Content uploaded by Bob Schadenberg
Author content
All content in this area was uploaded by Bob Schadenberg on Sep 14, 2017
Content may be subject to copyright.
Things that Make Robots Go HMMM:
Heterogeneous Multilevel Multimodal Mixing to
Realise Fluent, Multiparty, Human-Robot Interaction
Daniel Davison1Binnur G¨
orer2Jan Kolkmeier1Jeroen Linssen1Bob Schadenberg1Bob van de Vijver1,4
Nick Campbell3Edwin Dertien1Dennis Reidsma1
Abstract—Fluent, multi-party, human-robot interaction calls
for the mixing of deliberate conversational behaviour and re-
active, semi-autonomous behaviour. In this project, we worked
on a novel, state-of-the-art setup for realising such interactions.
We approach this challenge from two sides. On the one hand,
a dialogue manager requests deliberative behaviour and setting
parameters on ongoing (semi)autonomous behaviour. On the
other hand, robot control software needs to translate and mix
these deliberative and bottom-up behaviours into consistent and
coherent motion. The two need to collaborate to create behaviour
that is fluent, naturally varied, and well-integrated. The resulting
challenge is that, at the same time, this behaviour needs to
conform to both high level requirements and to content and
timing that are set by the dialogue manager. We tackled this
challenge by designing a framework which can mix these two
types of behaviour, using AsapRealizer, a Behaviour Markup
Language realiser. We call this Heterogeneous Multilevel Mul-
timodal Mixing (HMMM). Our framework is showcased in a
scenario which revolves around a robot receptionist which is
able to interact with multiple users.
Index Terms—Social robotics, human-robot interaction, multi-
party interaction, multi-modal interaction, Behaviour Markup
Language.
I. INTRODUCTION
THE main objective of this project is to bring forward the
state of the art in fluent human-robot dialogue by improv-
ing the integration between deliberative and (semi)autonomous
behaviour control. The interaction setting in which this has
been done is one of multi-party interaction between one
robot and several humans. The project builds upon interac-
tion scenarios with collaborative educational tasks, as used
in the context of the EU E AS EL project [1], and uses and
extends the state-of-the-art BML realiser AsapRealizer [2].
Fluent interaction plays an important role in effective human-
robot teamwork [3], [4]. A robot should be able to react to
a human’s current actions, to anticipate the user’s next action
and pro-actively adjust its behaviour accordingly. Factors such
as inter-predictability and common ground are required for
establishing such an alignment [5], [6]. Regulation of (shared)
attention, which to a large extent builds upon using the right
gaze and head behaviours [7], plays an important role in
1University of Twente, Enschede, The Netherlands
2Bo˘
gazic¸i University, Istanbul, Turkey
3Trinity College Dublin, Dublin, Ireland
4Part of the work on behaviour mixing was previously reported as part
of the MSc Thesis ‘A Human Robot Interaction Toolkit with Heterogeneous
Multilevel Multimodal Mixing’ by Bob van de Vijver.
maintaining the common ground. In a multi-party setting, the
matter becomes more complex. A mixture of conversational
behaviours directed at the main interaction partner, behaviours
directed at other people nearby to keep them included in the
conversation, and behaviours that show general awareness of
the surrounding people and environment need to be seamlessly
mixed and fluently coordinated to each other and to actions
and utterances of others.
For a robot that is designed to be used in such a social
conversational context, the exact control of its motion ca-
pabilities is determined on multiple levels. The autonomous
level controls behaviours such as idle motions and breathing.
Secondly, the semi-autonomous level governs behaviours such
as the motions required to keep the gaze focused on a certain
target. Thirdly, there is a level for reactive behaviours such as
reflex responses to visual input. Finally, the top level consists
of deliberative behaviours such as speech or head gestures
that make up the utterances of the conversation. Part of the
expressions, especially the deliberative ones, are triggered
by requests from a dialogue manager. Other parts may be
more effectively carried out by modules running in the robot
hardware itself. This is especially true for modules that require
high frequency feedback loops such as tracking objects with
gaze or making a gesture towards a moving object.
A dialogue manager for social dialogue orchestrates the
progress of the social conversation between human and robot.
Based on this progress, the manager requests certain delibera-
tive behaviours to be executed and certain changes to be made
to parameters of the autonomous behaviour of the robot. Such
requests are typically specified using a high level behaviour
script language such as the Behaviour Markup Language
(BML), which is agnostic of the details of the robot platform
and its controls and capabilities for autonomous behaviours.
The BML scripts are then communicated to the robot platform
by a Behaviour Realiser (in this project: AsapRealizer [8]),
which interprets the BML in terms of the available controls
of the robotic embodiment. Behaviours, both autonomous and
semi-autonomous, may then be mixed into the deliberative
behaviours, either by AsapRealizer or by the robot platform
itself. Since the behaviour should respond fluently to changes
in the environment, the dialogue models as well as the robot
control mechanisms must be able to adapt on-the-fly, always
being ready to change on a moment’s notice. Any running
behaviour could be altered, interrupted or cancelled by any
of the control mechanisms to ensure the responsive nature
of the interaction. This multi-level control can include social
commands like maintaining eye contact during conversations,
as well as reactive commands like looking at sudden visually
salient movements.
In this project we worked on such seamless integration of
deliberative, (semi)autonomous behaviours for a social robot.
This introduced a challenge for an architecture for human
robot interaction. On the one hand, the robot embodiment
continuously carries out its autonomous and reactive behaviour
patterns. The parameters of these may be modified on the fly
based on requests by the dialogue manager. On the other hand,
the dialogue manager may request deliberative behaviours that
actually conflict with these autonomous behaviours, since the
dialogue manager does not know the exact current state of
the autonomous behaviours. The control architecture therefore
contains intelligence to prioritise, balance and mix these
multilevel requests before translating them to direct robot
controls. We call this Heterogenous Multilevel Multimodal
Mixing (HMMM). In addition, the robotic embodiment sends
updates and predictions about the (expected) timing with
which behaviour requests from the dialog manager will be
carried out, so the dialog manager can manage adaptive
dialogue [9]. The resulting system has been showcased in a
context in which fluent and responsive behaviour are shown
off to good advantage. To this end we have set up a robot
receptionist scenario centred around multi-party interaction
with dynamic and responsive gaze behaviour.
The remainder of this paper is structured as follows. In
Section II, we address work related to HMMM. We outline the
scenario we chose to showcase our approach in Section III.
Section IV describes the architecture of our system. This
is followed up in Section V with the requirements of our
approach. In Section VI, we describe the results we obtained
with our work on HMMM. We present our conclusions in
Section VII.
II. RE LATE D WOR K
Many approaches to designing, implementing and evalu-
ating social robots exist, see [10]–[12]. As explained in the
introduction, for HMMM, we specifically looked at how to
realise fluent, multi-party, human-robot interaction. In this
section, we provide a high-level overview of existing work
related to the different facets of our work.
According to Bohus & Horitz, the challenges for open-world
dialogues with both robots and virtual agents originate in their
dynamic, multi-party nature, and their situatedness in the phys-
ical world [13]. Bohus & Horitz address these challenges by
developing a system with four core competencies: situational
awareness through computer vision; estimation of engagement
of users; multi-party turn-taking; and determination of users’
intentions. In his review of verbal and non-verbal human-
robot communication, Mavridis proposes a list of desiderata
for this field’s state-of-the-art [11]. This list supports the
importance of the challenges addressed by Bohus & Horitz,
but also emphasises the necessity of affective interactions,
synchronicity of verbal and non-verbal behaviour, and mixed-
initiative dialogue. Whereas Mavridis focuses on requirements
for interpersonal behaviour, functional open-world dialogues
also require correct intrapersonal behaviour. In this paper, we
address this necessity of the mixing of behaviour that is gen-
erated top-down and bottom-up, as argued in the introduction.
Our approach builds on the challenges Bohus & Horitz erected
as pillars of the field of human-robot dialogues in the wild.
Gaze behaviours can be utilised by a robot to shape en-
gagement and facilitate multi-party turn taking [14]. In a con-
versation, gaze behaviours serve various important functions,
such as enabling speakers to signal conversational roles, such
as speaker, addressee and side participant [15], facilitating
turn-taking, and providing information on the structure of the
speaker’s discourse [16]. Endowing robots with the capacity to
direct their gaze at the appropriate interlocutor combined with
the capability of doing this with the correct timing leads to
more fluent conversations [17] and improves the interlocutors’
evaluation of the robot [18].
In conversations with multiple interlocutors, it is important
that the robot can accommodate to the various conversational
roles, and the shifting of these roles over the course of the
conversation. It should be clear to the interlocutors who the
robot is addressing. In multi-party interaction between hu-
mans, a speaker’s gaze behaviour can signal whom the speaker
is addressing and whom is considered a side participant of the
conversation [19]. Mutlu et al. [14] found that a robot can
also utilise gaze behaviours to successfully cue conversational
roles in human participants. The shifting of roles during
conversation is accomplished through turn-taking mechanisms.
For example, the addressee whom the speaker looks at the
end of a remark is more likely to take up the role of speaker
afterwards. In turn, by looking at the speaker at the end of the
speaker’s turn, an addressee can signal that he or she can take
over the turn.
Gaze behaviours are partly (semi)autonomous, but can also
be used for deliberate and reactive behaviour. For instance,
a deliberate use of gaze is when you direct your gaze to
a cookie and stare intently at it, to communicate that you
desire the cookie. Whereas directing your gaze in reaction to a
salient event that happened in your vicinity, is an example of
a reactive use of gaze. For this project we therefore chose
to focus on gaze behaviour as one of the modalities for
exploring heterogeneous multi-modal mixing. As explained in
the introduction, AsapRealizer already provides the necessary
functionality to incorporate gaze behaviour based on deliberate
and (semi)autonomous behaviour [2].
III. SCENARIO
We chose to let our human-robot interactions take place
in a real-life context with the robot being a receptionist for
a doctor’s appointment. Users are given the goal of visiting
one of two available doctors. The robot should be able to
draw users’ attention, welcome them, instruct them on which
way to go to visit their doctor, and bid them farewell. When
another user enters the detection range of the robot during its
interaction with the first user, it should be able to recognise and
acknowledge the second, possibly shifting its attention to that
person. This conversational setting satisfies the prerequisites
Fig. 1: An overview of the eNTERFACE system architecture, highlighting four distinct components: (1) the signal acquisition
module SceneAnalyzer; (2) the dialogue manager Flipper; (3) the behaviour realiser AsapRealizer; (4) the agent controllers,
for example, the Zeno or EyePi robots, or a virtual agent created in Unity.
for multi-party capabilities, fluent behaviour generation, and
mixing of deliberate and autonomous behaviour.
Our first working prototype incorporated a scenario for
the receptionist robot interacting with a single user. The
dialogue with the robot revolves around users having a goal
of visiting one of two doctors. The robot assists users in
finding their way to their appointments. Below, we discuss
the conversational phases of the dialogue users can have with
the robot. Appendix A discusses our work on the dialogue
part of this project in more detail and shows the setup of the
interaction (see Fig. 5).
We chose to let the robot take initiative during the largest
part of the interaction, letting it guide users through the
dialogue in order to limit their agency and keep the in-
teraction straight and simple. Building on our ideas of a
suitable scenario for HMMM, our scenario consists of several
conversational phases:Initialise,Welcome,Instruct,Direct,
Farewell. These phases follow each other sequentially. In the
first phase, the system is initialised and parameters are set.
When a user enters the interaction range of the robot, the
Welcome phase is started: the robot acknowledges the user with
a short gaze and, if the user approaches even closer, the robot
will say ‘Hi!’ to welcome her. During the Instruct phase, the
robot will instruct the user to point at one of two nameplates
showing the doctor she wants to visit. It does so by uttering
the sentences ‘Please point at the sign with your doctor’s name
on it. Either this sign on your left, or this sign on your right.’
and by gazing and pointing at both nameplates in sync with
this verbal utterance. If the user does not seem to comply with
these instructions, the robot will try to instruct her again. If this
fails again, the robot directs her to a nearby human for further
assistance. When the user has pointed at the nameplate of a
doctor, the dialogue enters the Direct phase. In this phase, the
robot directs the user in the correct direction for her doctor,
again talking, gazing and pointing. Similar to the previous
phase, if the user does not seem to understand these directions,
the robot will direct her again before finally directing her to
a nearby human if she fails to respond for a third time. If it
turns out that the user walks off in the wrong direction after
the robot’s directions, it will call her back, directing her once
more in the correct direction. Finally, when the user walks off
in the correct direction, the robot offers her a friendly smile
and waves at her, saying ‘Goodbye!’ in the Farewell phase.
Thereafter, the system returns to the Initialise phase, ready for
a new user.
IV. GLO BA L ARCHITECTURE
The fluent behaviour generation system described in this
report uses a layered modular architecture. This architecture
is designed to separate the various processes required for
generating appropriate behaviour into standalone components,
which communicate through a common middleware. A more
comprehensive version of such an architecture is described by
Vouloutsi et al. in the EA SEL project [20]. In this section we
present a streamlined version of the architecture developed
for eNTERFACE ’16, focusing primarily on the components
involved in generating fluent dialogues and behaviour. The
global architecture consists of four layers, see Fig. 1.
A. Perception
The perception module provides information about the
state of the world and the actions of an interlocutor. Such
information is crucial for making informed decisions about
which appropriate behaviours to execute in the current state of
the dialogue. The SceneAnalyzer [21] application uses a Kinect
sensor to detect persons in interaction distance. Amongst other
data, it estimates the probability that a person is speaking,
and it extracts the location of the person’s head, spine and
hands. This data is further processed to extract features such
as proxemics and (hand) gestures.
B. Dialogue Manager
The role of the Flipper [22] dialogue manager component
is to specify, monitor, and manage the flow of a dialogue.
By interpreting actions of the user, and taking into account
the current context of the interaction, the dialogue manager
selects an appropriate behavioural intent to convey to the user.
This behavioural intent is then translated to BML behaviour
commands suitable for an agent embodiment. This two-step
abstraction of interaction context to behavioural intent to BML
allows us to define a high-level flow of a dialogue that is
independent from the low-level agent controls. More details of
the high-level dialogue implementation are given in Section III
and Appendix A. Using this method, a high-level dialogue will
be able to generate behaviour for any agent platform, as long
as the agent is able to express the behavioural intents using
its platform-specific modalities.
(a) The EyePi robot. (b) The Zeno R25 robot.
Fig. 2: The robotic platforms used during the project.
C. Behaviour Realiser
The AsapRealizer is a BML behaviour realiser engine, that
takes behaviour specifications and translates these to agent-
specific control primitives [8], [23]. The realiser is capable
of resolving inter-behaviour synchronisation, resulting in a
detailed schedule of planned behaviour fragments, such as
speech, gaze and animations [9]. These behaviour fragments
are mapped to a agent-specific control primitives, each of
which might have different timing constraints. Such control
primitives can include joint- or motor-rotations, text-to-speech
requests, or animation sequences. To determine the exact
timings for a specific agent, AsapRealizer relies on a nego-
tiation process with the agent embodiment. During execution
of the behaviours, AsapRealizer receives feedback from the
agent embodiment about the progress of execution, which is
necessary for planning on-the-fly interruptions and adaptations
of the planned behaviour [9]. This process is described in more
detail in Section VI-A.
D. Agent Control
We focused on controlling two specific robot agents: EyePi
(Fig. 2a) and Zeno R25 (Fig. 2b). The EyePi is a minimalistic
representation of a robotic head and eyes, offering fluent
control over gaze direction, emotional expressions, and a
collection of animation sequences. The Zeno R25 is a small
humanoid robot, offering control over gaze direction, facial
expressions, animations and speech. Whereas the EyePi has
very responsive and fluent control, the Zeno R25 offers more
modalities, such as hands and a fully expressive face. The
process of extending AsapRealizer with new embodiments
based on these control primitives is described in more detail
by Reidsma et al. [8].
V. REQUIREMENTS
Within the context of the ‘pillars’ of Bohus & Horitz [13]
as discussed in Section II, we constructed a demonstration of
a fluent multi-party interaction. In this section we describe the
specific, additional requirements for achieving our global aim.
These revolve around three main themes: fluent behaviour gen-
eration (Section V-A), multi-party capabilities (Section V-B),
and behaviour mixing (Section V-C).
A. Fluent Behaviour Generation
Generating behaviour that can adapt fluently to external
influences introduces several requirements for our system
architecture. Figure 3 shows an abstract overview of the be-
haviour generation pipeline, consisting of a dialogue manager,
a behaviour realiser and one or more agent control engines.
Generally, the dialogue manager runs a high-level dialogue
model, that specifies how an agent should respond when inter-
acting with a user. The dialogue model sends BML behaviours
to a behaviour realiser (1), which then translates these to agent-
specific commands. These low-level commands are sent to an
agent control engine (2), which executes the actual behaviour
on an agent embodiment (for instance, a virtual human or
a robot). Welbergen et al. give a detailed explanation of the
processes required for performing behaviour realisation on an
agent [23]. In Section IV, we give an architectural overview
of our dialogue manager, behaviour realiser and agent control
engines.
Not all agents are identical in the way they handle behaviour
requests. Typically, a virtual character offers very predictable
controls in terms of motion and timing. For example, gazing
at a ‘red ball’ object in 0.2 seconds will be executed without
much problem by the 3D rendering engine. However, a phys-
ically embodied agent, such as a robot, might have physical
limitations to its movements which makes it more difficult to
accurately predict its movements and timings. For example,
gazing at the ‘red ball’ in 0.2 seconds might be physically
impossible due to limitations in the actuators. Depending on
what the current gaze direction is, it could instead take 0.5
seconds. This delay needs to be communicated back to the
behaviour realiser to ensure correct synchronisation with other
planned behaviours. Additionally, dynamic environmental fac-
tors such as temperature or battery level might play a role in
predicting and executing physical behaviours.
Concretely, this means that the behaviour realiser not only
needs to negotiate in advance with the agent control engine
Fig. 3: Abstract overview of the fluent behaviour generation pipeline: (1) the dialogue manager generates BML behaviour; (2)
the behaviour realiser generates agent-specific commands; (3) the agent control engine delivers feedback about planning and
execution of these commands; (4) the behaviour realiser provides feedback about the behaviour progress.
about expected timing of certain gestures and actions, but that
it also needs to be kept up to date about actual execution
progress. This way, it can adapt the timing of other, related,
behaviours such as speech. Specifically, feedback from the
agent control engine about command planning and execution
(3) is required to perform inter-behaviour synchronisation.
Feedback from the behaviour realiser about BML behaviour
progress (4) is used to perform dialogue synchronisation and
validation. This is discussed in more detail in [23].
Specifying and implementing adequate feedback mecha-
nisms are important requirements for fluent behaviour gen-
eration and adaptation, on both the dialogue level and the
behaviour realisation level. In Section VI we discuss our
approach and give several examples where this is used to
generate more fluent behaviour patterns.
B. Multiparty Capabilities
An interaction with a user often does not take place in an
isolated, controlled environment. There is always a possibility
for distractions or interruptions, which might require an agent
to adapt its running or scheduled behaviour. Resynchronising,
rescheduling and interrupting individual behaviours is typi-
cally handled by the behaviour realiser. However the decision
to perform these behaviour modification actions is driven
by the agent’s dialogue model, based on an interpretation
of the environment and the current interaction: ‘Is there an
interruption that is relevant? What am I doing at the moment?
Does it make sense to stop what I am doing and do something
else instead?’
Assuming that we have an agent control architecture that can
perform fluent behaviour generation, as described in the previ-
ous section, we can use the feedback about behaviour progress
to plan and execute behaviour interrupts and reschedule future
behaviours on a dialogue level. We use this functionality
to incorporate multiparty capabilities in a dialogue. For a
fluent integration of other interlocutors in an interaction, the
multiparty capabilities should include: (1) tracking of multiple
interlocutors; (2) acknowledgement of each (new) interlocutor,
well-coordinated with the ongoing interaction with the main
interlocutor; (3) assessment of each interlocutor’s priority for
gaining the focus of attention; (4) dialogue mechanisms for
interrupting and switching between interlocutors.
C. Behaviour Mixing
The final main requirement for our system concerns the
Heterogeneous Multilevel Multimodal Mixing, the necessity of
which we argued in the introduction. Autonomous behaviour,
such as breathing motions, eye blinking and temporarily
gazing at interesting objects, must be combined with deliberate
behaviours in a seamless way. We focus on head behaviours
as a use case for different types of behaviour mixing. More
specifically, we look at three types of head behaviour: gaze
direction, based on a combination of visual saliency maps;
emotional expressions, based on valence/arousal space; and
head gestures such as nodding, shaking, or deictive gaze
(pointing at an object using the head). Any robot platform
that implements these high level behaviours can be controlled
in a transparent manner by AsapRealizer. We focus on the
EyePi as a platform that additionally can mix conflicting
or complementing requests before actually executing them.
Section VI describes how we implemented these capabilities.
VI. RE SU LTS
In order to achieve fluent, multi-party human-robot inter-
action, we extended AsapRealizer, and implemented a design
pattern in Flipper. In this section we first describe how we
implemented fluent robot control, followed by the design
pattern through which we achieved multi-party interaction.
The remaining subsections describe how we mix various
behavioural modalities.
A. Fluent Robot Control
To realise fluent behaviour for our robots, we implemented
feedback mechanisms between them and AsapRealizer. This
involved aligning the control primitives of both robot platforms
with AsapRealizer’s BML commands and Flipper’s intents as
incorporated in the dialogue templates. Feedback is provided
on several levels: (1) feedback on whether the behaviour
has been performed or an estimation of its duration; (2)
an estimation of its duration before execution, with real-
time updates when running; (3) a combination of the former
two, including real-time adjustment of running synchronisation
points. For further detail on these levels of feedback, we refer
to [23]. We implemented these feedback mechanisms in the
EyePi and Zeno platforms.
Specifically, for the EyePi platform, these are the following:
1) It is impossible to plan this request (nack)
It is not possible to execute the requested sequence. This
might be the case if the the requested time is to soon or
has already past, or is the actuators are not available.
2) Exact negotiation (–)
This feedback type will be used when the requester
wants to know when a specific sequence can be planned
such that it will be executed. The requester will need to
send a new request with the required timing based on
the negotiation result.
3) Negotiation (ack)
This feedback will be used if the requester specified that
the sequence should have a start on or after the requested
time. This is a weak request, on which the feedback will
contain the computed planning.
4) Try to execute, but motion parameters are updated (ack)
If it is possible to achieve the timing by updating
the motion parameters (within configured bounds), the
parameters will be updated and will also be send as
feedback.
5) Will execute, but it will be late (ack)
If the requested timing can not be met, but it can be
met if it is within the configured flexibility limit (for
example, 50-100 ms), the sequence will be executed.
6) Will execute on time (ack)
If the requested timing can be met without problems.
The timing requests can be made on the start, stroke and end
synchronisation points and that all feedback holds for every
possibility. To be able to handle the stroke and end timing
requests, the sequence mixer calculates the expected duration
of every sequence request from arrival as they are based on
the currently active motion parameters.
B. Multiparty Interaction
In order to accommodate interrupts from a bystander, we de-
veloped general patterns for the dialogue management scripts
that are independent of the actual contents of the ongoing
dialogue. We implemented a priority system implemented a
priority system that signals the importance of the current
discourse, and any event that may occur during a conversation.
A priority, ranging from 1 (low) to 3 (high), is assigned to each
dialogue template in Flipper (see Fig. 6), which resembles the
importance of the continuity of the behaviour that is linked to
the template. For example, when the robot is giving directions
to the addressee, an interruption would severely disrupt the
interaction. Therefore, the dialogues in which the robot gives
directions are given a high priority. Behaviours generated as
part of this should not be interrupted for the sake of relatively
unimportant additional events. When the robot has completed
an action, the priority threshold is lowered again.
Next to the dialogues, each other person that is recognised
in the scene, which are considered a bystanders, also receives
a priority. When a bystander is recognised for the first time,
a low priority is assigned to the bystander. The priority is
increased when the bystander actively tries to get the attention
of the robot by either talking or waving with the arms. When
either is recognised, the bystander’s priority will increase to
a medium priority. When the bystander is both talking and
waving, he or she is given a high priority.
Whether or not the robot responds to the bystander depends
on matching the priority of the bystander with the dialogue’s
priority. When the bystander’s priority is smaller than the
priority of the dialogue, the bystander will be ignored, until
the bystander’s priority is equal or larger than the dialogue’s
priority.
When the robot responds to a bystander, the actual form of
this response is determined by the priority of the bystander.
When the bystander has a low priority, the robot switches its
gaze to the bystander to acknowledge their presence, and then
returns its gaze to the interlocutor. In case the bystander has a
medium priority, the robot will address the bystander by gazing
at the bystander and telling him or her to wait. Alternatively,
when the bystander has a high priority, the robot will tell the
main interlocutor to wait, and start a conversation with the
bystander; the main interlocutor and bystander switch roles.
After finishing the conversation with the new interlocutor, the
robot will continue with the conversation that was put on hold.
C. Behaviour Mixing
In our system, the behaviour mixing is divided into three
parts: emotion, gaze and sequence. All generate an output,
which is being handled by the robot animator that converts
those command directly into movement. While the animator
is robot specific, the mixing can be reused for every robot that
wants to support emotion, gaze and sequence commands.
1) Emotion Mixing: The emotion mixing part of HMMM
(Fig. 7) can be considered the simplest mixing part. It pro-
cesses input from both external requests and requests from
the gaze part. The requests are directly mixed and new output
values are calculated based on the previous state and the
requested values.
Requests that describe large sudden changes in emotion will
be processed instantly. For other requests, the emotion state
will gradually change into the requested state. Two outputs
for the robot animator are generated: motion parameters and
the current emotion. The motion parameters are generated in
the emotion mixing part as they are directly related to the
current emotion. For example, a cheerful person has sharper
and faster motions than a sleepy person has. Due to time
constraints the connection with the motion parameters has not
been implemented.
2) Gaze Mixing: For the gaze mixing part of HMMM,
implemented on the EyePi platform (Fig. 8), two mixing
types are used: single-modal and multimodal. The single-
modal mixing processes multiple saliency maps that may come
from various sources such as low level autonomous perceptual
attention models, and high level deliberate attention in context
of the dialogue. All maps are combined into a single map,
keeping their original data intact, and fed into the mixer. The
mixer chooses the most salient point from the map, but it uses
the current state to update the map first.
Before the selection of the most salient point is done, the
mixer computes the difference with the previous map to find
related points. Due to small camera movements, but also
movements of the object self, the centres of the salient points
moves from frame to frame. In order to not see every changed
center as a new independent salient point, small changes are
detected and the data from the previous point is combined with
the updated data on the new point. This method prevents the
ghosting of the different points when there is a moving body
in front of the camera and it also functions as a smoothing
function for the final EyePi movement.
To create more lifelike behaviour, the mixer will lose inter-
est in active salient spots by applying a logarithmic penalty
that is configurable during runtime. Whenever a spot loses
the interest because another spot is more interesting, the first
spots receives an instant penalty for loosing interest to prevent
fast interest flipping between two spots. After loosing interest
and receiving the last penalty, another logarithmic process will
reward the spot to make it more interesting again. The rate is
again reconfigurable and the total amount of interest will never
be higher than the original calculated amount. The last instant
reward in the system is given when a spot receives interest,
again to prevent fast switching between interests. Finally, there
is a threshold value that needs to be met to be able to retrieve
interest. The described behaviour is sketched in Fig. 4; the
exact behaviour is configurable with parameters.
The multimodal mixing of gaze works as follows. After
selecting the most salient point, which is send onwards to
be used as gaze target, this may also interact with the emo-
tion models and the head gesture module. The autonomous
generated map from the internal camera can induce ‘shocked
behaviour’ by the robot, which leads to emotional response
and a small expressive head movement. Finally, execution of
gaze behaviour can be blocked in case specific gestures are
active that would not be understandable when combined with
gaze behaviour (see also Section VI-C3).
3) Sequence Mixing: The final mixing part of HMMM,
sequence mixing (Fig. 9), handles both external requests and
request from the gaze part. Sequences are pre-defined motions,
which have specific motion definitions and requirements for
every available actuator. Every robot platform will need its
own definitions for all sequences in order to complete the
mixing step. The definitions are specified on actuator level
and they have one of the following classifications:
•Required absolute motion Absolute motion is required to
complete the sequence. If it is not possible to do this, the
sequence request must be rejected. It is impossible to mix
this actuation with any other that controls the required
actuator.
•Not required absolute motion This motion is still an
absolute motion, but on conflicts it can be dropped.
•Relative motion As this motion is relative, it can be added
to almost every other moved by adding its value. When
an actuator is near its limit, the actuation can be declined.
•Don´t care The actuator is not used, so the sequence does
not care about it.
Every sequence request has its own identifier, which is used
in the feedback message in order to identify the feedback for
the external software.
The first possible rejection is done based on these classifica-
tions. The current queue will be checked and the information
from the requested sequence is retrieved from the database. If
there are any conflicts in actuator usage that can not be solved,
the request will be rejected. The second possible rejection is
based on the timing of the requested sequence. If the timing
can not be met, the request will be rejected. Both rejections
will be send back to the requester using a feedback message.
If the sequence has passed both the actuator and timing
check, the sequence planner will put it in the queue for
execution. An acknowledgement request is send back to the
requester and the processing of this specific request stops for
the moment.
The second part of the sequence mixing is no longer directly
part of the mixing process itself. There is a constantly running
process which will activate sequences when there are allowed
to start. When a sequence is started, feedback is send to the
original requester that the sequence is started. The output to
the animator contains both the sequence and possibly adjusted
parameters in order to make the timing. The sequence is also
transported to the gaze mixing part to operate the blocking
behaviour there. The animator itself also sends feedback
requests on animation strokes. Once a sequence is stopped,
the sequence executor provides that action as feedback.
4) Animator: All mixing parts have one output in common:
an output to an animator part. The animator is implementation
specific and will differ per robot, but it needs to take the
generated output from HMMM as input. These are the same
as the input of the HMMM part, yet mixed, and they should
not clash. An extra data channel is added: motion parameters
to adjust speeds of the movements. Note that the animator
also has an feedback output, required for progress feedback.
Figure 10 provides a schematic overview of the animator as
implemented for the EyePi.
VII. DISCUSSION AND CONCLUSION
In this project, we set out to mix deliberative and
(semi)autonomous behaviour, in order to achieve fluent, multi-
party, human-robot interaction. By extending the state-of-the-
art BML realiser AsapRealizer and implementing the priority
design pattern in the dialogue manager Flipper, we were
able to achieve this. In the receptionist scenario, the robot
showed fluent behaviour when assisting one interlocutor, and
at some point during in the conversation switching to assist
the bystander instead, when the robot recognised that the
bystander was trying to attract its attention.
With our implementation, the role of the traditional role of
the robot is transformed from a puppet, that always needs an
puppeteer, into an actor which tries to follow its director. It
interprets the requests and tries to execute them as best as
possible. This way autonomous behaviour, such as breathing
motions, eye blinking and (temporarily) gazing at interesting
objects, will be combined but can also override the requests
resulting in fluent and lifelike robot behaviour.
With the extension of AsapRealizer and the design pat-
tern implementation in Flipper to handle interrupts during a
conversation, a dialogue designer can now create responsive,
Point 1
Point 2
Start
Switch
interest
Switch
interest
Lost
point
Below
threshold
Above
threshold
Threshold
Time
Amount of interest
Fig. 4: A sketch of the interest for two interest points over time.
lifelike and non-static dialogues, while only having to specify
the deliberate behaviours.
This work was presented in the context of social robots.
However, by virtue of the architecture of modern BML re-
alisers, the approach will benefit also interaction with other
embodied agents, such as Virtual Humans. To ensure that
the AsapRealizer will also stay relevant for use with Virtual
Humans, we have started working on the coupling with the
Unity3D game engine1and editor which is a popular, state-of-
the-art choice for virtual and mixed reality applications, both
in research and industry. In the R3D3 project, the approach
presented here will be used to govern the interaction between
human users and a duo consisting of a robot and a virtual
human. This project revolves around having such a duo take
on receptionist and venue capabilities. HMMM will ensure that
the envisioned interactions run smoothly and will be able to
incorporate multiple users at the same time.
ACK NOW LE DG EM EN TS
The authors would like to thank the eNTERFACE ’16
organisation, especially Dr. Khiet Truong, and the Design-
Lab personnel. This publication was supported by the Dutch
national program COMMIT, and has received funding from
the European Union’s Horizon 2020 research and innova-
tion programme under grant agreement No 688835 (DE-
ENIGMA), and the European Union Seventh Framework
Programme (FP7-ICT-2013-10) under grand agreement No
611971 (EASEL).
APPENDIX A
DIALOGUES WITH THE RECEPTIONIST ROB OT
Section III discussed the outline of the scenario we used
to demonstrate our work on multi-modal mixing. In this
Appendix, we explain the dialogues in more detail.
As explained in the subsection on the system architecture,
the cascading triggering of the Flipper templates eventually
leads to the dialogue templates being triggered (see Fig. IV).
Our method for realising the dialogue consists of two parts:
1http://unity3d.com/
firstly, dialogue management through conversational phases;
secondly, behaviour planning through behavioural intents.
A. Dialogue Management
The dialogue with the robot revolves around users having a
goal of visiting one of two doctors. The robot assists users in
finding their way to their appointments. Fig. 5 shows the setup
of the interaction. Building on our ideas of a suitable scenario
for HMMM, our scenario consists of several conversational
phases, see Fig. 6. Table I lists the realisations of the robot’s
behaviour during each of its actions.
In the interaction with the robot, the first phase is the
Initialisation phase, which is invisible to users. Here, when
the system is started, the initial world and user model are set
in Flipper’s information state. This happens internally, without
any behaviour of the robot being shown. The Welcome phase
consists of two actions: acknowledging and greeting the user.
Code Listing 1 shows the Flipper dialogue template which
governs the behaviour of the robot (see Appendix C). To be
triggered, it requires three preconditions to be met. Firstly,
the current conversational phase which the current user or
interlocutor is situated in must be Welcome. Additionally, the
conversational substate must not yet exist, as this it has not
been created at the start of the scenario. This template can
only follow up on the first phase of the interaction and not
during any following phases, during which this substate does
exist. All of our dialogue templates use this construction to
order the steps in the interaction. Thirdly, the distance of the
current interlocutor to the robot is checked. When the system
has been started, the SceneAnalyzer continuously scans the
scene and updates the world model in the information state.
When an interlocutor is detected, she gets a unique ID and she
is tracked in the scene. We defined several zones of proximity
based on Hall’s interpersonal distances [24], with the outer
boundary of social space being 3.7 meters and that for personal
space being 2.1 meters from the robot.2When the user comes
closer than 3.7 meters, a Flipper template triggers which sets
the user’s interpersonal distance to social.
2These zones correspond to Hall’s far phase and close phase in social
distance, respectively [24]; in our setup, we renamed them for clarity.
Fig. 5: Overview of the interaction setup: the Zeno R25 robot, the interlocutor and bystander, the Kinect, and two nameplates
for the doctors to the sides of the robot.
Together, these three preconditions trigger a number of
effects. As described above, the conversational phase (and
substate) are updated. To handle multiple users, the priority
of this action is set to a particular number. This is explained
in Section VI-B. Finally, a behavioural intent is added to a
queue of actions to be carried out by the robot. We explain
this functionality in the following subsection. When the user
steps into the personal distance of the robot, the next template
triggers, namely the one causing the robot to greet the user.
The remaining conversational phases follow a similar struc-
ture. The robot’s goal during the Instruct phase is to indicate
what users should do in order to reach their appointment.
Having welcomed a user, the robot instructs her to point to the
nameplate of the doctor with whom she has an appointment
(the Instruct template). The robot synchronises verbal and
non-verbal behaviour to both point at and gaze at each of the
nameplates in turn. The SceneAnalyzer detects whether one
of the user’s hands points either left or right. This information
is further processed and when the user has made a choice, the
next phase can be triggered. If this is not the case, the robot
waits a certain amount of time (20 seconds) before re-iterating
its instructions (InstructAgain). Again, if the user makes
a choice, the dialogue progresses to the next phase. If she
fails to express her choice within a certain amount of time
(20 seconds), the robot apologises for not being able to help
her out and directs her to a nearby human to further assist
her (DismissAfterInstruct). Then, the robot idly waits
until the user leaves and a new user enters.
When the user has indicated her choice, she enters the Direct
phase, receiving directions on how to get to her appointment
(Direct). Based on the user’s choice, the robot utters a
sentence and gazes and points in the direction in which the user
should head. Similar to the previous phase, the robot either
repeats its directions (DirectAgain) or redirects the user to
someone else (DismissAfterDirect) when the user fails
to move in the correct direction after a set amount of time (20
seconds). If, after being directed or being directed again, the
user walks into the wrong direction, the robot will call her
back to it (DirectAfterIncorrectDirection). This
happens when she leaves the ‘personal’ space of the robot
and exits the interaction range in the opposite direction of
which she should be headed to. Instead, if the user leaves the
robot’s personal space in the correct direction, the robot utters
a friendly goodbye and waves her off (Farewell).
B. Behaviour Planning
When Flipper dialogue templates trigger, their effects are
executed. As described in the previous subsection, behaviours
of the robot are triggered through behavioural intents, see
Code Listing 1. Previously, Flipper used behaviour-tags for
the specification of BML behaviour in these templates. We
replaced these tags with behavioural intents to accommodate
for different realisations of behaviour, e.g., by different robots
or virtual agents. To this end, the intents are of a higher order
specification than the explicit BML commands. In Code List-
ing 1 (see Appendix C), the robot’s intent is to acknowledge
the current interlocutor. A request with this intent is added to
a queue of behaviours to be planned by AsapRealizer. Then,
it is up the realiser to plan this behaviour for a specific robot
or virtual agent. The advantage of this approach is that the
dialogue remains realiser-agnostic: for a different entity, the
BML needs to be specified for each behaviour, separated from
the dialogue templates. This planner uses Flipper templates to
take the first intent in the queue of planned intents and checks
which type of behaviour should be planned. Based on this
information, it carries out optional translations of information
from the SceneAnalyzer. In the case of the Acknowledge
intent, the coordinate system of the SceneAnalyzer data is
translated to the coordinate system of the Zeno robot, so
that it is able to look at the correct position of the user’s
head. Then, this template calls AsapRealizer to realise this
Start Initialise
Interaction
Acknowledge
user seen
Greet
user present
Instruct
user silent
Direct
user
choice
InstructAgain
timer Dismiss
AfterInstruct
timer
user choice
DirectAgain Dismiss
AfterDirect
timertimer
Farewell
DirectAfter
Incorrect
Direction
correct
direction
incorrect
direction
incorrect
direction
Priority 1: gaze
Priority 2: gaze + “Please wait.”
Phase
Initialise
Welcome
Instruct
Direct
Farewell
correct
direction
Priority 3: drop currentInterlocutor +
gaze + start new conversation with
new interlocutor
Fig. 6: Schematic overview of the receptionist scenario.
behaviour using BML. Code Listing 2 (see Appendix C) shows
the BML behaviours used by the Zeno and the EyePi robots,
respectively, for the Acknowledge intent.
APPENDIX B
HET ERO GE NE OU S MULTILEVEL MULTIM ODAL MIXING
Figures 7 (emotion), 8 (gaze) and 9 (sequence) contain
the schematic overviews of HMMM. The three parts handle
different behaviour requests and mix them into viable robot be-
haviour, while still using autonomous behaviour, as discussed
in Section VI-C.
APPENDIX C
COD E LISTINGS
This appendix contains code snippets from the project. Code
Listing 1 shows the dialogue template for the acknowledge-
ment behaviour of the robot. Code Listing 2 shows the BML
for the acknowledge behaviour for both the Zeno and eyePi
robots.
TABLE I: The Realisations of the Behaviour of the Zeno Robot, for Each of the Intents Mentioned Shown in Fig. 6.
Intent Behaviour realisation
Verbal Non-verbal
Acknowledge (None.) Look at user.
Greet Hello, my name is Zeno. Wave at user.
Instruct Please point at the sign with your doctor’s name
on it. Either this sign (1) on your left, or this
sign (2) on your right.
At point (1), look at the sign on the left and
point at it with left arm; at (2), look at the right
sign with the right arm and point at it.
InstructAgain My apologies, maybe I was not clear. Please
point at the sign with your doctor’s name on it.
Either this sign (1) on your left, or this sign (2)
on your right.
At point (1), look at the sign on the left and
point at it with left arm; at (2), look at the right
sign with the right arm and point at it.
DismissAfterInstruct (1) I’m sorry, I’m not able to help you out. My
capabilities are still limited, so I was not able to
understand you. (2) Please find a nearby human
for further assistance.
At point (1), make a sad face; at (2), make a
neutral face.
Direct Please go to the left/right (a) for doctor Vanes-
sa/Dirk (b).
Depending on the user’s choice, instruct the
user, and look and point at the corresponding
direction the user should head in (a, b).
DirectAgain (1) My apologies. Maybe I was unclear. (2)
Please go to the left/right (a) for doctor Vanes-
sa/Dirk (b).
At point (1), make a sad face; at (2), make a
neutral face. Depending on the user’s choice,
instruct the user, and look and point at the
corresponding direction the user should head in
(a, b).
DismissAfterDirect (1) I’m sorry, I’m not able to help you out.
Please find a nearby human for further assis-
tance. (2)
At point (1), make a sad face; at (2), make a
neutral face.
DirectAfterIncorrectDirection (1) Sorry, but you’re headed the wrong way!
Please come back here. (2)
At point (1), make a confused face; at (2), make
a neutral face.
Farewell (1) That’s the way! (2) Goodbye! At point (1), make a happy face; at (2), wave
at the user.
Listing 1: The Flipper Dialogue Template for the Acknowledgement Behaviour of the Robot.
<!-- When a user has been detected, but is not yet within interaction range,
let the robot look at the user. -->
<template id="hmmmAcknowledge" name="hmmmAcknowledge">
<!-- These preconditions must be satisfied before the template triggers. -->
<preconditions>
<compare value1="$interactionContext.currentInterlocutor.cstate" value2="welcome"/>
<compare value1="$interactionContext.currentInterlocutor.csubstate" comparator="not_exists"/>
<compare value1="$interactionContext.currentInterlocutor.socialDistance" value2="SOCIAL"/>
</preconditions>
<!-- These effects result from the template triggering. -->
<effects>
<!-- Update the conversational substate so the template will not trigger again. -->
<update name="$interactionContext.currentInterlocutor.csubstate" value="acknowledged"/>
<update name="$interactionContext.currentInterlocutor.interactionStarted" value="TRUE"/>
<!-- The priority of this action is set. -->
<update name="$conversationalContext.priority" value="1"/>
<!-- The intent of the acknowledgement behaviour is added to a queue of planned behaviours.. -->
<update name="$isTemp.newRequest.r.intent" value="acknowledgeInterlocutor"/>
<update name="$isTemp.newRequest.r.target" value="currentInterlocutor"/>
<update name="$isBehaviourPlanner.requests._addlast" value="$isTemp.newRequest.r"/>
<remove name="$isTemp.newRequest.r"/>
</effects>
</template>
Multi-Modal Mixing Animator Output
External emotion
request
State
params
Valence
& weight
Arousal
& weight
Predefined emotions such
as angry, happy etc.
Predefined parameters
such as frequency and
smooting
From
gaze
Animator
Motion
parameters
Emotion
Current
emotion
Fig. 7: Schematic overview of the emotion mixing.
Multi-Modal
Mixing
Animator OutputSingle-Modal mixing
Map 0
(autonomous)
Map 1
(deliberate or
autonomous)
Map n
(deliberate or
autonomous)
Top down
+
Different maps are
still distinguishable
during processing
State
params
Point tracking &
computations for
realized saliency
Decider
Most salient
point takes all
Blocks the output if a specific
sequence is active, but also
decides on the shock action
Feedback:
Most salient
point
Shock reaction:
Autonomous maps
may trigger emotion
and sequence changes
whereas deliberate
maps do not
The animator will handle
the gaze direction actuation
From
sequence
Sequence
database
Animator
Gaze
direction
To
emotion
To
sequence
Previous combined
saliency map &
most salient point
Fig. 8: Schematic overview of the gaze mixing.
Execution loopMulti-Modal Mixing Animator Output
External sequence
request
Actuator
check
Feedback:
Negative
Timing
check
Not OK
OK
Feedback:
Negotiation
Not OK
Sequence
planner
Feedback:
Positive
OK
Sequence
database
Contains timing
and actuator
information
Sequence
executor
Feedback:
Sequence
start/end
Animator
Feedback:
Sequence
stroke
Sequence
queue
Motion
parameters
Sequence
To
Gaze
From
gaze
Predefined
sequences such as
nod, shake etc.
Fig. 9: Schematic overview of the sequence mixing.
Animator
L ed-display control based on:
Emotion
Gaze direction
M otor control based on:
Sequence
Gaze direction
Motion parameters
Gaze
direction
Sequence
Motion
parameters
Emotion
Feedback:
Sequence
stroke
Fig. 10: Schematic overview of the animator.
Listing 2: BML Behaviours for Realisation of the Acknowledge Intent by the Zeno Robot and by the eyePi.
<!-- BML for Zeno -->
<bml id="$id$" xmlns="http://www.bml-initiative.org/bml/bml-1.0"
xmlns:sze="http://hmi.ewi.utwente.nl/zenoengine">
<!-- Zeno will look at the user’s head position. -->
<sze:lookAt id="lookAtCurrentInterlocutor"
x="$interactionContext.currentInterlocutor.x$"
y="$interactionContext.currentInterlocutor.y$"
start="0" end="0.2"/>
<!-- After having looked at the user for two seconds, Zeno will look to the front again. -->
<sze:lookAt id="lookToTheFront" x="0.5" y="0.5" start="2" end="2.2"/>
</bml>
<!-- BML for eyePi -->
<bml id="$id$" xmlns="http://www.bml-initiative.org/bml/bml-1.0"
xmlns:epe="http://hmi.ewi.utwente.nl/eyepiengine">
<!-- eyePi will look at the user’s position. -->
<epe:eyePiGaze id="lookateyepi" x="$x$" y="$y$" start="0" end="0.1"/>
</bml>
REFERENCES
[1] V. Charisi, D. P. Davison, F. Wijnen, J. van der Meij, D. Reidsma,
T. Prescott, W. Joolingen, and V. Evers, “Towards a child-robot sym-
biotic co-development: a theoretical approach,” in Proceedings of the
Fourth International Symposium on ”New Frontiers in Human-Robot
Interaction”, Canterbury, UK, M. Salem, A. Weiss, P. Baxter, and
K. Dautenhahn, Eds. Society for the Study of Artificial Intelligence &
Simulation of Behaviour, April 2015, pp. 331–336.
[2] H. van Welbergen, D. Reidsma, and S. Kopp, “An incremental multi-
modal realizer for behavior co-articulation and coordination,” in 12th
International Conference on Intelligent Virtual Agents, IVA 2012, ser.
Lecture Notes in Computer Science, Y. Nakano, M. Neff, A. Paiva, and
M. Walker, Eds., vol. 7502. Berlin: Springer Verlag, 2012, pp. 175–188,
iSBN=978-3-642-33196-1, ISSN=0302-9743.
[3] G. Hoffman, “Ensemble : fluency and embodiment for robots acting
with humans,” Ph.D. dissertation, Massachusetts Institute of Technology,
2007. [Online]. Available: http://dspace.mit.edu/handle/1721.1/41705
[4] G. Hoffman and C. Breazeal, “Effects of anticipatory perceptual
simulation on practiced human-robot tasks,” Autonomous Robots,
vol. 28, no. 4, pp. 403–423, dec 2009. [Online]. Available:
http://link.springer.com/10.1007/s10514-009-9166- 3
[5] S. Kopp, “Social resonance and embodied coordination in face-to-face
conversation with artificial interlocutors,” Speech Communication,
vol. 52, no. 6, pp. 587–597, jun 2010. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0167639310000312
[6] G. Klein and P. Feltovich, “Common Ground and Coordination in Joint
Activity,” Organizational simulation, pp. 1–42, 2005. [Online]. Avail-
able: http://www.springerlink.com/index/V20U533483228545.pdfhttp:
//csel.eng.ohio-state.edu/woods/distributed/CGfinal.pdf
[7] D. Heylen, “Head gestures, gaze, and the principles of conversational
structure,” International Journal of Humanoid Robotics, vol. 03, no. 03,
pp. 241–267, 2006. [Online]. Available: http://www.worldscientific.
com/doi/abs/10.1142/S0219843606000746
[8] D. Reidsma and H. van Welbergen, “AsapRealizer in practice – A
modular and extensible architecture for a BML Realizer,” Entertainment
Computing, vol. 4, no. 3, pp. 157–169, aug 2013. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S1875952113000050
[9] H. van Welbergen, D. Reidsma, and J. Zwiers, “Multimodal plan
representation for adaptable bml scheduling,” Autonomous Agents and
Multi-Agent Systems, vol. 27, no. 2, pp. 305–327, September 2013.
[10] I. Leite, C. Martinho, and A. Paiva, “Social Robots for Long-Term
Interaction: A Survey,” International Journal of Social Robotics,
vol. 5, no. 2, pp. 291–308, jan 2013. [Online]. Available: http:
//link.springer.com/10.1007/s12369-013-0178- y
[11] N. Mavridis, “A review of verbal and non-verbal human–robot interac-
tive communication,” Robotics and Autonomous Systems, vol. 63, no. P1,
pp. 22–35, 2015.
[12] L. Riek, “Wizard of Oz studies in HRI: A systematic review and new
reporting guidelines,” Journal of Human-Robot Interaction, vol. 1, no. 1,
pp. 119–136, 2012.
[13] D. Bohus and E. Horvitz, “Dialog in the open world,” in Proceedings
of the 2009 international conference on Multimodal interfaces - ICMI-
MLMI ’09. New York, New York, USA: ACM Press, 2009, p. 31.
[14] B. Mutlu, T. Shiwa, T. Kanda, H. Ishiguro, and N. Hagita, “Footing
in human-robot conversations: how robots might shape participant roles
using gaze cues,” in Proceedings of the 4th ACM/IEEE international
conference on Human robot interaction. ACM, 2009, pp. 61–68.
[15] E. Goffman, “Footing,” Semiotica, vol. 25, no. 1-2, pp. 1–30, 1979.
[16] B. Mutlu, T. Kanda, J. Forlizzi, J. Hodgins, and H. Ishiguro, “Conver-
sational gaze mechanisms for humanlike robots,” ACM Transactions on
Interactive Intelligent Systems (TiiS), vol. 1, no. 2, p. 12, 2012.
[17] A. Yamazaki, K. Yamazaki, Y. Kuno, M. Burdelski, M. Kawashima, and
H. Kuzuoka, “Precision timing in human-robot interaction: coordination
of head movement and utterance,” in Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems. ACM, 2008,
pp. 131–140.
[18] J. G. Trafton, M. D. Bugajska, B. R. Fransen, and R. M. Ratwani,
“Integrating vision and audition within a cognitive architecture to track
conversations,” in Proceedings of the 3rd ACM/IEEE international
conference on Human robot interaction. ACM, 2008, pp. 201–208.
[19] H. Sacks, E. A. Schegloff, and G. Jefferson, “A simplest systematics for
the organization of turn-taking for conversation,” language, pp. 696–735,
1974.
[20] V. Vouloutsi, M. Blancas, R. Zucca, P. Omedas, D. Reidsma,
D. Davison, V. Charisi, F. Wijnen, J. van der Meij, V. Evers, D. Cameron,
S. Fernando, R. Moore, T. Prescott, D. Mazzei, M. Pieroni, L. Cominelli,
R. Garofalo, D. De Rossi, and P. F. M. J. Verschure, Towards a
Synthetic Tutor Assistant: The EASEL Project and its Architecture.
Cham: Springer International Publishing, 2016, pp. 353–364. [Online].
Available: http://dx.doi.org/10.1007/978-3- 319-42417- 0 32
[21] A. Zaraki, D. Mazzei, N. Lazzeri, M. Pieroni, and D. De Rossi, “Prelim-
inary implementation of context-aware attention system for humanoid
robots,” in Conference on Biomimetic and Biohybrid Systems. Springer,
2013, pp. 457–459.
[22] T. Mark Maat and D. Heylen, “Flipper: An Information State Component
for Spoken Dialogue Systems,” in Intelligent Virtual Agents. Reykjavik:
Springer Verlag, 2011, pp. 470–472.
[23] H. van Welbergen, D. Reidsma, Z. Ruttkay, and J. Zwiers, “Elckerlyc,”
J. Multimodal User Interfaces, vol. 3, no. 4, pp. 271–284, 2009.
[Online]. Available: http://dx.doi.org/10.1007/s12193-010- 0051-3
[24] E. T. Hall, “The hidden dimension,” 1966.