Conference PaperPDF Available

Users’ Belief Awareness in Reinforcement Learning-Based Situated Human–Robot Dialogue Management

Authors:

Abstract and Figures

Others can have a different perception of the world than ours. Understanding this divergence is an ability, known as perspective taking in developmental psychology, that humans exploit in daily social interactions. A recent trend in robotics aims at endowing robots with similar mental mechanisms. The goal then is to enable them to naturally and efficiently plan tasks and communicate about them. In this paper we address this challenge extending a state-of-the-art goal-oriented dialogue management framework, the Hidden Information State (HIS). The new version makes use of the robot’s awareness of the users’ belief in a reinforcement learning-based situated dialogue management optimisation procedure. Thus the proposed solution enables the system to cope with the communication ambiguities due to noisy channel but also with the possible misunderstandings due to some divergence among the beliefs of the robot and its interlocutor in a Human-Robot Interaction (HRI) context. We show the relevance of the approach by comparing different handcrafted and learnt dialogue policies with and without divergent belief reasoning in an in-house Pick-Place-Carry scenario by mean of user trials in a simulated 3D environment.
Content may be subject to copyright.
Users’ Belief Awareness in Reinforcement
Learning-based Situated Human-Robot
Dialogue Management
Emmanuel Ferreira, Gr´
egoire Milliez, Fabrice Lef`
evre and Rachid Alami
Abstract Others can have a different perception of the world than ours. Under-
standing this divergence is an ability, known as perspective taking in developmen-
tal psychology, that humans exploit in daily social interactions. A recent trend in
robotics aims at endowing robots with similar mental mechanisms. The goal then is
to enable them to naturally and efficiently plan tasks and communicate about them.
In this paper we address this challenge extending a state-of-the-art goal-oriented
dialogue management framework, the Hidden Information State (HIS). The new
version makes use of the robot’s awareness of the users’ belief in a reinforcement
learning-based situated dialogue management optimisation procedure. Thus the pro-
posed solution enables the system to cope with the communication ambiguities due
to noisy channel but also with the possible misunderstandings due to some diver-
gence among the beliefs of the robot and its interlocutor in a Human-Robot Interac-
tion (HRI) context. We show the relevance of the approach by comparing different
handcrafted and learnt dialogue policies with and without divergent belief reasoning
in an in-house Pick-Place-Carry scenario by mean of user trials in a simulated 3D
environment.
1 Introduction
When robots and humans share a common environment, previous works have shown
how much enhancing the robot’s perspective taking and intention detection abili-
ties improves its understanding of the situation, and leads to more appropriate and
efficient task planning and interaction strategies [2, 3, 13]. As part of the theory
of mind, perspective taking is a widely studied ability in developmental literature.
Ferreira Emmanuel and Fabrice Lef`
evre
LIA - University of Avignon, France e-mail: {emmanuel.ferreira,fabrice.lefevre}@univ-avignon.fr
Gr´
egoire Milliez and Rachid Alami
CNRS LAAS - University of Toulouse e-mail: {gregoire.milliez,rachid.alami}@laas.fr
1
2 Emmanuel Ferreira, Gr´
egoire Milliez, Fabrice Lef`
evre and Rachid Alami
This broad term encompasses 1) perceptual perspective taking, whereby human can
understand that other people see the world differently, and 2) conceptual perspec-
tive taking, whereby humans can go further and attribute thoughts and feelings to
other people [1]. Tversky and al. [19] explain to what extent switching between
perspectives rather than staying in an egocentric position can improve the overall
dialogue efficiency in a situated context. Therefore, to make robots more socially
competent, some research aims to endow robots with this ability. Among others,
Breazeal et al. [2] present a learning algorithm that takes into account information
about a teacher’s visual perspective in order to learn specific coloured buttons activa-
tion/deactivation patterns, and Trafton et al. [18] use both visual and spatial perspec-
tive taking to find out the referent indicated by a human partner. In the present study,
we specifically focus on a false belief task as part of the conceptual perspective tak-
ing. Formulated in [20], this kind of task requires the ability to recognize that others
can have beliefs about the world that differ from the observable reality. Breazal et
al. [3] proposed one of the first human-robot implementation and proposed some
more advanced goal recognition skills relying on this false belief detection. In [13],
a Spatial Reasoning and Knowledge component (SPARK) is presented to manage
separate models for agent belief state and used to pass the Sally and Anne test [1]
on a robotic platform. This test is a standard instance of false belief task where an
agent has to guess the belief state of an other agent with a divergent belief mind state.
The divergence in this case arises from modifications of the environment which one
agent is unaware of and which are not directly observable, for instance displacement
of objects hidden to this agent (behind another object for instance).
Considering this, to favour the human intention understanding and improve the
overall dialogue strategy, we take benefit of the divergent belief management into
the multimodal situated dialogue management problem. To do so, we rely on the
Partially Observable Markov Decision Process (POMDP) framework. This latter is
becoming a reference in the Spoken Dialogue System (SDS) field [21, 17, 14] as
well as in HRI context [15, 11, 12], due to its capacity to explicitly handle parts of
the inherent uncertainty of the information which the system (the robot) has to deal
with (erroneous speech recognizer, falsely recognised gestures, etc.). In the POMDP
setup, the agent maintains a distribution over possible dialogue states, the belief
state, all along the dialogue course and interacts with its perceived environment
using a Reinforcement Learning (RL) algorithm so as to maximise some expected
cumulative discounted reward [16]. So our goal here is to introduce the divergence
notion into the belief state tracking and add some means to deal with it in the control
part.
The remainder of the paper is organised as follows. Section 2 gives some details
about how an agent knowledge model can be maintained in a robotic system; in Sec-
tion 3 our extension of a state-of-art goal-oriented POMDP dialogue management
framework, the Hidden Information State (HIS), is presented to take into account
users’ beliefs state; in Section 4 the proposed Pick-Place-Carry false belief scenario
used to exemplify the benefit of both taking account of the perspective taking ability
and its integration in a machine learning scheme is introduced. In the same section,
the current system architecture and the experimental setup employed are given. The
Users’ Belief Awareness in RL-based Situated Human-Robot Dialogue Management 3
(a) (b)
Fig. 1: (a) Real users in front of the robot (right) and the virtual representation built
by the system (left). (b) Divergent belief example with belief state.
user trial results obtained with a learnt and an handcrafted belief-aware system are
compared in Section 5 with systems lacking perspective taking ability. Finally, in
Section 6 we discuss some conclusions and give some perspectives.
2 Agent knowledge management
As mentioned in the introduction, the spatial reasoning framework SPARK is used
for situation assessment and spatial reasoning. We will briefly recap here how it
works, for further details please refer to [13]. In our system, the robot collects data
about three different entities to virtually model its environment: objects, humans
and proprioceptions (its own position, posture, etc.). Concerning objects, a model
of the environment is loaded at startup to obtain the positions of static objects (e.g.
walls, furnitures, etc.). Other objects (e.g. mug, tape, etc.) are considered as mov-
able. Their positions are gathered using the robot’s stereo vision. Posture sensors,
such as Kinect, are used to obtain the position of humans. These perception data
allow the system to use the generated virtual model for further spatial-temporal rea-
soning. As an example, the system can reason on why an object is not perceived any
more by a participant and decide to keep its last known position if it recognizes a
situation of occlusion, or remove the object from its model if there is none.
Figure 1 (a) shows a field experiment with the virtual environment built by the
system from the perception data collected and enriched by the spatial reasoner. The
latter component is also used to generate facts about the objects relative position
and agents’ affordances. The relative position such as isIn,isNextTo,isOn are used
for multimodal dialogue management as a way to solve referents in users’ utter-
ances, but also for a more natural dialogue description of the objects position in
the robot’s responses. Agents’ affordances come from their ability to perceive and
reach objects. The robot is calculating its own capability of perception according to
the actual data it gets from the object position and recognition modules. For reach-
ability, the robot computes if it is able to reach the object with its grasping joints.
4 Emmanuel Ferreira, Gr´
egoire Milliez, Fabrice Lef`
evre and Rachid Alami
To compute the human’s affordances the robot applies its perspective taking abil-
ity. In other words, the robot has to estimate what is visible and reachable for the
human according to her current position. For visibility, it computes which objects
are present in a cone, emerging from human’s head. If the object can be directly
linked to the human’s head with no obstacle and if it is in the field of the view cone,
then it is assumed that the human sees the object and hence has knowledge of its
true position. If an obstacle is occluding the object, then it won’t be visible for the
human. Concerning the reachability, a threshold of one meter is used to determine
if the human can reach an object or not.
The facts generation feature allows the robot to get the information about the en-
vironment, its own affordances, and the human’s affordances. In daily life, humans
get the information about the environment through perception and dialogue. Using
the perspective taking abilities of our robot, we can compute a model of each hu-
man’s belief state according to what she perceived or what the robot has told her
about the environment. Then two different models of the world are considered: one
for the world state from the robot perception and reasoning and one for each hu-
man’s belief state (computed by the robot according to what the human perceived).
Each of these models is independent and logically consistent. In some cases, the
robot and the human models of the environment can diverge. As an example, if an
object Ohas a property Pwith a value A, if Ps value changed to Band the human
had no way to perceive it when it occurred, the robot will have the value Bin its
model (P(O) = B) while the human will still have the value Afor the property P
(P(O) = A). This value shouldn’t be updated in the human model until the human is
actually able to perceive this change or until the robot informs him. In our scenario,
this reasoning is applied to the position property.
We introduce here an example of false belief situation (fig. 1 (b)). A human sees
a red book (RED BOOK) on the bedside table BT . She will then have this property
in his belief state: P(RED BOOK) = BT . Now, while this human is away (has no
perception of BT ), the book is swapped with an other brown one (BROWN BOOK)
from the kitchen table KT . In this example, the robot explores the environment
and is aware of the new position values. The human will keep this belief until she
gets a new information on the current position of RED BOOK. This could come
from actually seeing RED BOOK on the position KT or seeing that RED BOOK
is not any more in BT (in which case the position property value will be updated to
an unknown value). Another way to update this value is for the robot to explicitly
inform the user of the new position.
In our system we mainly focused on position properties but this reasoning could
be straightforwardly extended to other properties such as who manipulated an ob-
ject, its content, temperature, etc. Obviously if this setup generalises quite easily
to false beliefs about individual properties of elements of the world, more complex
divergence configurations that might arise in daily interactions, for instance due to
prior individual knowledge, still remain out of range and should be addressed by
future complementary works.
Users’ Belief Awareness in RL-based Situated Human-Robot Dialogue Management 5
3 Belief Aware Multimodal Dialogue Management
As mentioned earlier, an important aspect of the approach is to base our user be-
lief state management on the POMDP framework [9]. It is a generalisation of the
fully-observable Markov Decision Process (MDP), that was first employed to deter-
mine an optimal mapping between situations (dialogue states) and actions for the
dialogue management problem in [10]. We try hereafter to recall some of the prin-
ciples of this approach pertaining to the modifications that will be introduced. More
comprehensive descriptions should be sought in the cited papers. This framework
maintains a probability distribution over dialogue states, called belief states, assum-
ing the true one is unobservable. By doing so, it explicitly handles parts of the in-
herent uncertainty on the information conveyed inside the Dialogue Manager (DM)
(e.g. error prone speech recognition and understanding processes). Thus, POMDP
can be cast as a continuous space MDP. The latter is a tuple <B,A,T,R,γ>,
where Bis the belief state space (continuous), Ais the discrete action space, T
is a set of Markovian transition probabilities, Ris the immediate reward function,
R:B×A×Band γ[0,1]the discount factor (discounting long term re-
wards). The environment evolves at each time step tto a belief state btand the
agent picks an action ataccording to a policy mapping belief states to actions,
π:BA. Then the belief state changes to bt+1according to the Markovian tran-
sition probability bt+1T(.|bt,at)and, following this, the agent received a re-
ward rt=R(bt,at,bt+1)from the environment. The overall problem of this con-
tinuous MDP is to derive an optimal policy maximising the reward expectation.
Typically the averaged discounted sum over a potentially infinite horizon is used,
t=0γtrt. Thus, for a given policy and start belief state b, this quantity is called
the value function: Vπ(b) = E[t0γtrt|b0=b,π]B.Vcorresponds to the
value function of any optimal policy π. The Q-function may be defined as an al-
ternative to the value function. It adds a degree of freedom on the first selected
action, Qπ(b,a) = E[t0γtrt|b0=b,a0=a,π]B×A. As well as V,Qcor-
responds to the action-value function of any optimal policy π. If it is known,
an optimal policy can be directly computed by being greedy according to Q,
π(b) = argmaxaQ(b,a)bB.
However, real-world POMDP problems are often intractable due to their dimen-
sionality (large belief state and action spaces). Among other techniques, the HIS
model [21] circumvents this scaling problem for dialogue management by the use
of two main principles. First, it factors the dialogue state into three components:
the user goal, the dialogue history and the last user act (see Figure 2). The pos-
sible user goals are then grouped together into partitions on the assumption that
all goals from the same partition are equally probable. These partitions are built
using the dependencies defined in a domain-specific ontology and the information
extracted all along the dialogue from both the user and the system communicative
acts. In the standard HIS model, each partition is linked to matching database enti-
ties based on its static and dynamic properties that corresponds to the current state of
the world (e.g. colour of an object vs spatial relations like isOn). The combination of
a partition, the associated dialogue history, which corresponds here to a finite state
6 Emmanuel Ferreira, Gr´
egoire Milliez, Fabrice Lef`
evre and Rachid Alami
Fig. 2: Overview of the HIS extension to take into account divergent belief.
machine that keeps track of the grounding status for each convoyed piece of infor-
mation (e.g. informed or grounded by the user), and a possible last user action forms
a dialogue state hypothesis. A probability distribution b(hyp)over the most likely
hypotheses is maintained during the dialogue and this distribution constitutes the
POMDP’s belief state. Second, HIS maps both the belief space (hypotheses) and the
action space into a much reduced summary space where RL algorithms are tractable.
The summary state space is the compound of two continuous and three discrete val-
ues. Continuous values are the probabilities of the two-first hypotheses b(hyp1)and
b(hyp2)while the discrete ones, extracted from the top hypothesis, are the type of
the last user act (noted last uact), a partition status (noted p-status) database match-
ing status related to the corresponding goal and a history status (noted h-status).
Likewise system dialogue acts are simplified in a dozen of summary actions like of-
fer,execute,explicit-confirm and request. Once the summary actions are ordered by
their Q(b,a)scores in descending order by the policy, an handcrafted process checks
if the best scored action is compatible with the current set of hypotheses (e.g. for the
confirm summary act this compatibility test consists in checking if there is some-
thing to confirm in the top hypothesis). If they are compatible, an heuristic-based
method maps this action back to the master space as the next system response. If
not, the process is pursued using the next best scored summary action until a possi-
ble action is found.
The standard HIS framework can properly handle misunderstandings due to noise
in the communicative channel. However, misunderstandings can also be introduced
in cases where the user has false beliefs, impacting negatively her communicative
acts. HIS has no dedicated mechanism to deal with such a situation and so it should
react as in front of a classical uncertainty by asking the user to confirm hypotheses
until the request can match the reality, although it could have be resolved since the
Users’ Belief Awareness in RL-based Situated Human-Robot Dialogue Management 7
first turn. Therefore having an appropriate mechanism should improve the quality
and efficiency of the dialogue, preventing user to pursue her goal with an erroneous
statement.
So, as illustrated in Figure 2 and highlighted with the orange items, we propose
to extend the summary belief state with an additional status, the divergent belief
status (noted d-status), and an additional summary action, inform divergent belief.
The d-status is employed to trigger the presence of false belief situations by match-
ing the top partition with user facts compiled by the system (see Sec. 2) and as
such trying to highlight some divergences between the user and the robot points of
view. Both the user and the robot facts (from the belief models, not to be mistaken
with the belief state related to the dialogue representation) are considered as part of
the dynamic knowledge resource and are maintained independently of the internal
state of the system with the techniques described in Sec. 2. Here we can observe in
Figure 2 that the top partition is about a book located on the bedside table. In the
robot model of the world (i.e. robot facts) this book is identified as a unique entity,
RED BOOK, and p-status is set to unique accordingly. However, in the user model
it is identified as BROWN BOOK. This situation can be considered as divergent
and p-status is set to unique too because there is one possible object that corre-
sponds to that description in the user model. In this preliminary study d-status can
only be unique or non-unique. Further studies may consider more complex cases.
The new summary action is employed for appropriate resolution and removal of the
divergence. The (real) communicative acts associated to this (generic) action relies
on expert design. In this first version, if this action is compatible with the current
hypotheses and thus picked up by the system, it explicitly informs the user of the
presence and the nature of the divergence. To do so, the system uses a deny dialogue
act to inform the user about the existence of a divergent point of view and let the
user agree on the updated information. Consequently, the user may pursue its orig-
inal goal with the correct property instead of the obsolete one. This process is also
illustrated in Figure 2 when the inform divergent belief action is mapped back to the
master space.
4 Scenario & Experimental Setup
In order to illustrate the robot’s ability to deal with user’s perspective, an adapted
Pick-Place-Carry scenario is used as test-bed. The robot and the user are in a vir-
tual flat with three rooms, in which there are different kinds of objects varying in
terms of colour, type, and position (e.g. blue mug on the kitchen table, red book on
the living room table, etc.). The user interacts with the robot using unconstrained
speech (Large Vocabulary Speech Recognition) and pointing gestures to ask the
robot to perform some specific object manipulation tasks (e.g. move the blue mug
from the living room table to the kitchen table). The multimodal dialogue is used to
solve ambiguities and to request missing information until task completion (i.e. full
command execution) or failure (i.e. explicit user disengagement or wrong command
8 Emmanuel Ferreira, Gr´
egoire Milliez, Fabrice Lef`
evre and Rachid Alami
Fig. 3 Architecture of the
multimodal and situated dia-
logue system.
execution). In this study, we specifically focus on tasks where divergent beliefs are
prone to be generated as in the Sally and Anne test: a previous interaction has led
the user to think that a specific object Ois located at Awhich is out of her view, and
an event has changed the object position from Ato Bwithout user’s awareness. For
example, a change performed by another user (or by the robot) without the presence
of the first one. Thereby, if the user currently wants to perform a manipulation in-
volving Oshe may do so using her own believed value (A) of the position property
in her communicative act.
Concerning the simulation, the setup of [12] is applied to enable a rich multi-
modal HRI. Thus, the open-source robotics simulator MORSE [5] is used which
provides a realistic rendering through the Blender Game Engine, a wide range sup-
port of middleware (e.g. ROS, YARP), and proposes reliable implementations of
realistic sensors and actuators which ease the integration on real robotic platforms.
It also provides the operator with an immersive control of a virtual human avatar
in terms of displacement, gaze, and interactions on the environment, such as object
manipulation (e.g. grasp/release an object). This simulator is tightly coupled with
the multimodal dialogue system, with the overall architecture given in Figure 3.
In the chosen architecture, the Google Web Speech API1for Automatic Speech
Recognition (ASR) is combined with a custom-defined grammar parser for Spoken
Language Understanding (SLU). The spatial reasoning module, SPARK, is respon-
sible for both detecting the user gestures and generating the per-agent spatial facts
(see Sec. 2) used to dynamically feed the contextual knowledge base and allow-
ing the robot to reason over different perspectives of the world. Furthermore, we
also make use of a static knowledge base containing the list of all available ob-
jects (even those not perceived) and their related static properties (e.g. color). The
Gesture Recognition and Understanding (GRU) module catches the gesture-events
generated by SPARK during the course of the interaction. Then, a rule-based fusion
engine, close to the one presented in [8], temporally aligns the monomodal inputs
(speech and gesture) and merges them to convey the list of possible fused inputs to
the POMDP-based DM, with speech considered as the primary modality.
The DM implements the extended HIS framework described in Sec. 3. For the
reinforcement learning setup, the sample-efficient KTD-SARSA RL algorithm [4]
in combination with the Bonus Greedy exploration scheme enables online learning
of dialogue strategy from scratch, as in [6]. A reward function is defined to penalise
the DM by 1 for each dialogue turn and give it a +20 if the right command is
performed at the end of the interaction, 0 otherwise. To convey the DM action back
1https://www.google.com/intl/en/chrome/demos/speech.html
Users’ Belief Awareness in RL-based Situated Human-Robot Dialogue Management 9
to the user, a rule-based fission module is employed that splits the high level DM
decision into verbal and non-verbal actions. The robot speech outputs are generated
by chaining a template-based Natural Language Generation (NLG) module, which
converts the sequence of concepts into text, to a Text-To-Speech (TTS) component
based on the commercial Acapela TTS system2. A Non-verbal Behaviour Planning
and Motor Control (NVBP/MC) module produces robot postures and gestures by
translating the non-verbal actions into a sequence of abstract actions such as grasp,
moveTo,release which are then executed in the simulated environment.
In this study we intend to assess the benefit of introducing the divergent belief
management into the multimodal situated dialogue management problem. Thereby,
the scenarios of interest require some situations of divergent beliefs between the user
and the robot. In real setup those scenarios often need a long term interaction context
tracking. To bypass this time consuming process in our evaluation setup, we directly
propose a corrupted goal to the user at the beginning of her interaction. So, a false
belief about the location value was automatically added concerning an object not
visible from the human point of view. Although the situation is artificially generated,
the same behaviour can be obtained with the spatial reasoner if the robot performs an
action in self-decision mode, or if another human corrupts the scene. Thereby, this
setup was used to evaluate the robot’s ability to deal with both classical (CLASSIC)
and false belief (FB) object manipulation tasks. To do so, we compare the belief-
aware learnt system performance (noted BA-LEARNT hereafter) to an handcrafted
one (noted BA-HDC), and with two other similar systems with no perspective taking
ability (noted LEARNT and HDC respectively). The handcrafted policies make use
of expert rules based on the information provided by the summary state to pick the
next action to perform (deterministic). They are not considered as the best possible
handcrafted policies but as robust enough to manage correctly an interaction with
real users. The learnt policies were trained in an online learning settings using a
small set of 2 expert users which first performed 40 dialogues without FB tasks
and 20 more as a method-specific adaptation (LEARNT with CLASSIC tasks vs
BA-LEARNT with FB tasks). In former works we have shown the possibility to
learn efficient policies with few tens of dialogue samples, due to expert users better
tolerance to poor initial performance combined with more consistent behaviours
during interactions [7].
In the evaluation setup, 10 dialogues for the four proposed system configurations
(the learnt policies were configured to act greedily according to the value function)
were recorded from 6 distinct subjects (2 females and 4 males, around 25yo on aver-
age) who interacted with all configurations (within-subjects study), so 240 dialogues
in total. 30% of the performed dialogues involve FB tasks. No user had knowledge
of the current system configurations and they were proposed in random order to
avoid any prior effect. At the end of each interaction, users evaluated the system in
terms of task completion with an online questionnaire.
2http://www.acapela-group.com/index.html
10 Emmanuel Ferreira, Gr´
egoire Milliez, Fabrice Lef`
evre and Rachid Alami
5 Results
HDC BA-HDC LEARNT BA-LEARNT
TASK Avg.R Length SuccR Avg.R Length SuccR Avg.R Length SuccR Avg.R Length SuccR
CLASSIC 14.33 4.81 0.85 14.28 4.86 0.86 17.62 2.95 0.93 17.69 2.88 0.93
FB 9.78 6.67 0.72 13.05 5.61 0.83 12.72 5.94 0.83 13.89 4.78 0.83
ALL 12.97 5.36 0.82 13.92 5.08 0.85 16.15 3.85 0.9 16.55 3.45 0.9
Table 1: System performance on classic (CLASSIC), false belief (FB) and all (ALL)
tasks in terms of average cumulative discounted reward (Avg.R), average dialogue
length in terms of system turns (Length) and average success rate (SuccR).
Table 1 is populated with the performance obtained by the four system configu-
rations discussed above considering CLASSIC and FB tasks. These results are first
given in terms of mean discounted cumulative rewards (Avg.R). According to the
reward function definition, this metric expresses in a single real value the two vari-
ables of improvement, namely the success rate (accuracy) and the number of turns
until dialogue end (time efficiency). However, both metrics are also presented for
convenience. The results in Table 1 were gathered in test condition where no explo-
ration of the RL method is allowed. Thus, they basically consist of a mere average
over the 60 performed dialogues for each method and metric.
The differences observed between the LEARNT/BA-LEARNT and the HDC/BA-
HDC on the overall performance (row ALL) shows the interest of considering RL
methods rather than handcrafted policies. Indeed, only 60 training dialogues are
enough to outperform both handcrafted solutions. On CLASSIC tasks the perfor-
mance between LEARNT and BA-LEARNT as well as between HDC and BA-
HDC must be considered similar. Thus, the divergent belief resolution mechanism
doesn’t seem to impact the dialogue management when divergent belief situations
do not appear. For BA-HDC this statement could be expected (in lack of false be-
lief, the rules are the same as HDC). However for BA-LEARNT the tested policy is
learnt and the action assignment process is optimized with an additional degree of
complexity (larger state/action space than in LEARNT), so a loss could have been
observed. The performances between LEARNT and BA-LEARNT and respectively
between HDC and BA-HDC on FB tasks appear in favour of the BA-systems (both
show a higher success rate and a slightly more time efficient dialogue management
process - average gain of 1 turn). However the quantitative comparison between the
system configurations is not ensured to be relevant due to the relatively high confi-
dence interval on considered metrics (e.g. success rate confidence interval for row
FB is around 0.2 for all system configurations). Two main reasons account for this
status quo. First, a limited amount of observations involving the different system
configurations (due to experimental cost). Second, the expected marginal gain in
terms of the considered metrics. Indeed, the current system is learnt on some overall
task completion and efficiency criterion. However solving divergent belief situations
in a pick and place scenario can not be considered a critical factor influencing these
criterion greatly but just a way to cope with an additional (not dominant) degree of
Users’ Belief Awareness in RL-based Situated Human-Robot Dialogue Management 11
uncertainty and to improve user experience and naturalness of the interaction with
the embodied agent.
R1: Can I help you? U1: Bring me the book on my bedside table
R2:The brown book is not on the bedside
table, it has been moved to the kitchen.
U2: Ok, bring it to me
R3: I will give you the brown book that is on
the kitchen table
(a)
R2: Do you want the red one?
U2: No, the brown book.
R3: There is no brown book in your bedroom
but there is one in the kitchen
U3: Are you sure? Well, bring me that one.
R4: I will give you the brown book that is on
the kitchen table
(b)
Table 2: Dialogue examples with (a) and without (b) divergent belief reasoning in
the case of an unknown (from the user’s point of view) interchange between a red
and a brown book.
To have better insights on what the main differences between the four dialogue
strategies are we also performed a qualitative study. In this study we precisely iden-
tify the behavioural differences due to introducing a FB handling mechanism in a
learning setup. Overall, it is observed that confirmation acts (e.g. confirm, offer)
are more accurate and less frequent for the two learnt methods. For instance, when
the learnt systems are confident on the top object manipulation hypothesis they pre-
dominantly performed the command directly rather than trying to check its validity
further as in the handcrafted versions. In Table 2 two dialogue samples extracted
from the evaluation dataset illustrate the differences between non-BA and BA di-
alogue management on the same FB task (here a red book was interchanged with
a brown one). If the belief divergence problem is not explicitly taken into account
(as in (a)) the DM can be constrained to deal with an additional level of misunder-
standing (see (b) from R2to U3). We can also see in (b) that the non-BA system was
able to succeed FB tasks (explaining the relative high LEARNT performance on
FB tasks). Indeed, if the object is clearly identified by the user (e.g. color and type)
the system can release the constraint of the false position and thus is able to make
an offer on (execute) the “corrected” form of the command involving the true object
position. Concerning the main differences between BA-LEARNT and BA-HDC, we
observed a less systematic usage of the inform divergent belief act in the learnt case.
BA-LEARNT first tries to reach a high confidence on the true presence of the object
involved in the belief divergence in the user goal. Furthermore, BA-LEARNT, like
LEARNT, has learnt alternative mechanisms to fulfil FB tasks such as direct execu-
tion of the user command (which also avoids misunderstanding) when the convoyed
piece of information seems to be sufficient to identify the object.
6 Conclusion
In this paper, we described how a user belief realtime tracking framework can be
used along with a multimodal POMDP-based dialogue management. The evalua-
tion of the proposed method with real users confirms that this additional informa-
tion helps to achieve more efficient and natural task planning (and does not harm
12 Emmanuel Ferreira, Gr´
egoire Milliez, Fabrice Lef`
evre and Rachid Alami
handling of normal situations). Our next step will be to integrate the multimodal
dialogue system on the robot and carry out evaluations in real setting to uphold our
claims in an fully realistic configuration.
Acknowledgements This work has been partly supported by the French National Research
Agency (ANR) under project reference ANR-12-CORD-0021 MaRDi.
References
1. F. U. Baron-Cohen S, Leslie AM. Does the autistic child have a ’theory of mind’? Cognition,
21(1):37 – 46, 1985.
2. C. Breazeal, M. Berlin, A. Brooks, J. Gray, and A. Thomaz. Using perspective taking to learn
from ambiguous demonstrations. Robotics and Autonomous Systems, 2006.
3. C. Breazeal, J. Gray, and M. Berlin. An embodied cognition approach to mindreading skills
for socially intelligent robots. I. J. Robotic Res., 2009.
4. L. Daubigney, M. Geist, S. Chandramohan, and O. Pietquin. A comprehensive reinforcement
learning framework for dialogue management optimization. Journal on Selected Topics in
Signal Processing, 6(8):891–902, 2012.
5. G. Echeverria, N. Lassabe, A. Degroote, and S. Lemaignan. Modular open robots simulation
engine: Morse. In ICRA, 2011.
6. E. Ferreira and F. Lefevre. Expert-based reward shaping and exploration scheme for boosting
policy learning of dialogue management. In ASRU, 2013.
7. E. Ferreira and F. Lef`
evre. Social signal and user adaptation in reinforcement learning-based
dialogue management. In Proceedings of the 2nd MLIS Workshop, pages 61–69. ACM, 2013.
8. H. Holzapfel, K. Nickel, and R. Stiefelhagen. Implementation and evaluation of a constraint-
based multimodal fusion system for speech and 3d pointing gestures. In ICMI, 2004.
9. L. Kaelbling, M. Littman, and A. Cassandra. Planning and acting in partially observable
stochastic domains. Artificial Intelligence Journal, 101(1-2):99–134, May 1998.
10. E. Levin, R. Pieraccini, and W. Eckert. Learning dialogue strategies within the markov deci-
sion process framework. In ASRU, 1997.
11. L. Lucignano, F. Cutugno, S. Rossi, and A. Finzi. A dialogue system for multimodal human-
robot interaction. In ICMI, 2013.
12. G. Milliez, E. Ferreira, M. Fiore, R. Alami, and F. Lef`
evre. Simulating human robot interaction
for dialogue learning. In SIMPAR, pages 62–73, 2014.
13. G. Milliez, M. Warnier, A. Clodic, and R. Alami. A framework for endowing interactive robot
with reasoning capabilities about perspective-taking and belief management. In ISRHIC, 2014.
14. F. Pinault and F. Lef`
evre. Unsupervised clustering of probability distributions of semantic
graphs for pomdp based spoken dialogue systems with summary space. In KRPDS, 2011.
15. N. Roy, J. Pineau, and S. Thrun. Spoken dialogue management using probabilistic reasoning.
In ACL, 2000.
16. R. Sutton and A. Barto. Reinforcement learning: An introduction. IEEE Transactions on
Neural Networks, 9(5):1054–1054, 1998.
17. B. Thomson and S. Young. Bayesian update of dialogue state: A pomdp framework for spoken
dialogue systems. Computer Speech and Language, 24(4):562–588, 2010.
18. J. Trafton, N. Cassimatis, M. Bugajska, D. Brock, F. Mintz, and A. Schultz. Enabling effective
human-robot interaction using perspective-taking in robots. IEEE Transactions on Systems,
Man, and Cybernetics, 35(4):460–470, 2005.
19. B. Tversky, P. Lee, and S. Mainwaring. Why do speakers mix perspectives? Spatial Cognition
and Computation, 1(4):399–412, 1999.
20. H. Wimmer and J. Perner. Beliefs about beliefs: Representation and constraining function of
wrong beliefs in young children’s understanding of deception. Cognition, 13(1):103 – 128,
1983.
21. S. Young, M. Gaˇ
si´
c, S. Keizer, F. Mairesse, J. Schatzmann, B. Thomson, and K. Yu. The
hidden information state model: A practical framework for pomdp-based spoken dialogue
management. Computer Speech and Language, 24(2):150–174, 2010.
... This architecture is fully implemented and we demonstrate it on several robotic platforms and in several interaction scenarios. It eventually proves to be an effective framework for novel contributions about human-robot joint action [12,13,14], as well as for multi-disciplinary studies [15,16,17,18,19,20]. ...
... While several contributions in the literature provide insights and contributions on one aspect or another (references are in the corresponding subsections), we are not aware of a fully implemented architecture that effectively combines in a coherent manner all these points. The novelty and relevance of this contribution to HRI is further underlined by the range of multi-disciplinary collaborations and studies that have been made possible by our architecture [15,16,17,18,19,20]. ...
Article
Full-text available
Human-Robot Interaction challenges Artificial Intelligence in many regards: dynamic, partially unknown environments that were not originally designed for robots; a broad variety of situations with rich semantics to understand and interpret; physical interactions with humans that requires fine, low-latency yet socially acceptable control strategies; natural and multi-modal communication which mandates common-sense knowledge and the representation of possibly divergent mental models. This article is an attempt to characterise these challenges and to exhibit a set of key decisional issues that need to be addressed for a cognitive robot to successfully share space and tasks with a human. We identify first the needed individual and collaborative cognitive skills: geometric reasoning and situation assessment based on perspective-taking and affordance analysis; acquisition and representation of knowledge models for multiple agents (humans and robots, with their specificities); situated, natural and multi-modal dialogue; human-aware task planning; human-robot joint task achievement. The article discusses each of these abilities, presents working implementations, and shows how they combine in a coherent and original deliberative architecture for human-robot interaction. Supported by experimental results, we eventually show how explicit knowledge management, both symbolic and geometric, proves to be instrumental to richer and more natural human-robot interactions by pushing for pervasive, human-level semantics within the robot’s deliberative system.
... This architecture is fully implemented and we demonstrate it on several robotic platforms and in several interaction scenarios. It eventually proves to be an effective framework for novel contributions about human-robot joint action [11, 12, 13], as well as for multi-disciplinary studies [14, 15, 16, 17, 18, 19]. The remaining of the article details this robotic architecture. ...
... While several contributions in the literature provide insights and contributions on one aspect or another (references are in the corresponding subsections), we are not aware of a fully implemented architecture that effectively combines in a coherent manner all these points. The novelty and relevance of this contribution to HRI is further underlined by the range of multi-disciplinary collaborations and studies that have been made possible by our architecture [14, 15, 16, 17, 18, 19]. ...
Article
Full-text available
Human-Robot interaction is an area full of challenges for artificial intelligence: dynamic, partially unknown environments that are not originally designed for autonomous machines; a large variety of situations and objects to deal with, with possibly complex semantics; physical interactions with humans that requires fine, low-latency control, representation and management of several mental models, pertinent situation assessment skills...the list goes on. This article sheds light on some key decisional issues that are to be tackled for a cognitive robot to share space and tasks with a human, and present our take on these challenges. We adopt a constructive approach based on the identification and the effective implementation of individual and collaborative skills. These cognitive abilities cover geometric reasoning and situation assessment mainly based on perspective-taking and affordances, management and exploitation of each agent (human and robot) knowledge in separate cognitive models, natural multi-modal communication, "human-aware" task planning, and human and robot interleaved plan achievement. We present our design choices, the articulations between the diverse deliberative components of the robot, experimental results, and eventually discuss the strengths and weaknesses of our approach. It appears that explicit knowledge management, both symbolic and geometric, proves to be key as it pushes for a different, more semantic way to address the decision-making issue in human-robot interactions.
... While we will not present a specific dialogue component in this work, in Ferreira et al. (2015) our belief management component was integrated with a situated dialogue system in a simulator. This model was compared with a basic system (without belief awareness) in a study with 60 interactions, in a simulated environment. ...
Thesis
There has been an increasing interest, in the last years, in robots that are able to cooperate with humans not only as simple tools, but as full agents, able to execute collaborative activities in a natural and efficient way. In this work, we have developed an architecture for Human-Robot Interaction able to execute joint activities with humans. We have applied this architecture to three different problems, that we called the robot observer, the robot coworker, and the robot teacher. After quickly giving an overview on the main aspects of human-robot cooperation and on the architecture of our system, we detail these problems.In the observer problem the robot monitors the environment, analyzing perceptual data through geometrical reasoning to produce symbolic information.We show how the system is able to infer humans' actions and intentions by linking physical observations, obtained by reasoning on humans' motions and their relationships with the environment, with planning and humans' mental beliefs, through a framework based on Markov Decision Processes and Bayesian Networks. We show, in a user study, that this model approaches the capacity of humans to infer intentions. We also discuss on the possible reactions that the robot can execute after inferring a human's intention. We identify two possible proactive behaviors: correcting the human's belief, by giving information to help him to correctly accomplish his goal, and physically helping him to accomplish the goal.In the coworker problem the robot has to execute a cooperative task with a human. In this part we introduce the Human-Aware Task Planner, used in different experiments, and detail our plan management component. The robot is able to cooperate with humans in three different modalities: robot leader, human leader, and equal partners. We introduce the problem of task monitoring, where the robot observes human activities to understand if they are still following the shared plan. After that, we describe how our robot is able to execute actions in a safe and robust way, taking humans into account. We present a framework used to achieve joint actions, by continuously estimating the robot's partner activities and reacting accordingly. This framework uses hierarchical Mixed Observability Markov Decision Processes, which allow us to estimate variables, such as the human's commitment to the task, and to react accordingly, splitting the decision process in different levels. We present an example of Collaborative Planner, for the handover problem, and then a set of laboratory experiments for a robot coworker scenario. Additionally, we introduce a novel multi-agent probabilistic planner, based on Markov Decision Processes, and discuss how we could use it to enhance our plan management component.In the robot teacher problem we explain how we can adapt the plan explanation and monitoring of the system to the knowledge of users on the task to perform. Using this idea, the robot will explain in less details tasks that the user has already performed several times, going more in-depth on new tasks. We show, in a user study, that this adaptive behavior is perceived by users better than a system without this capacity.Finally, we present a case study for a human-aware robot guide. This robot is able to guide users with adaptive and proactive behaviors, changing the speed to adapt to their needs, proposing a new pace to better suit the task's objectives, and directly engaging users to propose help. This system was integrated with other components to deploy a robot in the Schiphol Airport of Amsterdam, to guide groups of passengers to their flight gates. We performed user studies both in a laboratory and in the airport, demonstrating the robot's capacities and showing that it is appreciated by users.
... [3,6]) and the dialog manager (e.g. [4,5,7]). NLG is still mainly based on template-based models, that turn out to produce good results given a specific task. ...
Conference Paper
In this paper, we present a brief overview of our ongoing work about artificial interactive agents and their adaptation to users. Several possibilities to introduce humorous productions in a spoken dialog system are investigated in order to enhance naturalness during social interactions between the agent and the user. We finally describe our plan on how neuroscience will help to better evaluate the proposed systems, both objectively and subjectively.
... In psychology this ability is called conceptual perspective taking. In [14] we have used this framework along with a dialog system to implement a situated dialog and consequently to improve dialog in term of efficiency and success rate. ...
Article
Full-text available
The topic of joint actions has been deeply studied in the context of Human-Human interaction in order to understand how humans cooperate. Creating autonomous robots that collaborate with humans is a complex problem, where it is relevant to apply what has been learned in the context of Human-Human interaction. The question is what skills to implement and how to integrate them in order to build a cognitive architecture, allowing a robot to collaborate efficiently and naturally with humans. In this paper, we first list a set of skills that we consider essential for Joint Action, then we analyze the problem from the robot's point of view and discuss how they can be instantiated in human-robot scenarios. Finally, we open the discussion on how to integrate such skills into a cognitive architecture for human-robot collaborative problem solving and task achievement.
... Perspective taking has been successfully used in several robotic applications to improve reasoning capabilities, leading to more appropriate and efficient task planning and interaction strategies. Previous research has shown how perspective-taking ability is a key feature for planning [13], understanding others' intentions [14] for coordination, efficient task learning by taking into account teacher's visual perspective [15], and improving dialog [16] . This research focused on the representation of other agents' visual perspective and belief state concerning the environment. ...
Conference Paper
Full-text available
One application of robotics is to assist humans in the achievement of tasks they face in both the workplace and domestic environments. In some situations, a task may require the robot and the human to act together in a collaborative way in order to reach a common goal. To achieve a collaborative plan, each agent (human, robot) needs to be aware of the tasks she/he must carry out and how to perform them. This paper addresses the issue of enhancing a robotic system with a dynamic model of its collaborator's knowledge concerning tasks of a shared plan. Using this model, the robot is able to adapt its collaborative plan generation, its abilities to give explanations and to monitor the overall plan execution. We present the algorithm we have elaborated to take advantage of the tree representation of our Hierarchical Task Network (HTN) planner to enhance the robot with appropriate explanation and execution monitoring abilities. To evaluate how our adaptive system is perceived by users and how much it improves the quality of the Human-Robot interaction, the outcome of a comparative study is presented.
Preprint
Full-text available
Experimental investigation on traversed path of a moving robot using rule-based-fuzzy integrated method has been analysed in the paper. An intelligent tool box is prepared using hybrid rule-base-fuzzy integrated method to address the problem. The rule-base-fuzzy tool box takes obstacle distances around the robots as inputs and also takes the position and angle of the target as input. The output from the intelligent controller is final movement angle of the robot. i.e. the robot will move in the final movement angle direction till the next decision. Simulation and experimental results are given for comparison and are in agreement.
Conference Paper
Full-text available
This paper presents a POMDP-based dialogue system for multimodal human-robot interaction (HRI). Our aim is to exploit a dialogical paradigm to allow a natural and robust interaction between the human and the robot. The proposed dialogue system should improve the robustness and the flexibility of the overall interactive system, including multimodal fusion, interpretation, and decision-making. The dialogue is represented as a Partially Observable Markov Decision Process (POMDPs) to cast the inherent communication ambiguity and noise into the dialogue model. POMDPs have been used in spoken dialogue systems, mainly for tourist information services, but their application to multimodal human-robot interaction is novel. This paper presents the proposed model for dialogue representation and the methodology used to compute a dialogue strategy. The whole architecture has been integrated on a mobile robot platform and has bee n tested in a human-robot interaction scenario to assess the overall performances with respect to baseline controllers.
Conference Paper
Full-text available
Many robotic projects use simulation as a faster and easier way to develop, evaluate and validate software components compared with on-board real world settings. In the human-robot interaction field, some recent works have attempted to integrate humans in the simulation loop. In this paper we investigate how such kind of robotic simulation software can be used to provide a dynamic and interactive environment to both collect a multimodal situated dialogue corpus and to perform an efficient reinforcement learning-based dialogue management optimisation procedure. Our proposition is illustrated by a preliminary experiment involving real users in a Pick-Place-Carry task for which encouraging results are obtained.
Conference Paper
Full-text available
In daily human interactions, spatial reasoning occupies an important place. In this paper we present a situation assessment reasoner that generates relevant symbolic information from the geometry of the environment with respect to relations between objects and human capabilities. The role of SPARK (SPAtial Reasoning and Knowledge) component is to permanently maintain a state of the world in order to provide a basis for the robot to plan, to act, to react and to interact. More precisely, we describe here the way the system manages the hypotheses to be able to handle such knowledge in a flexible manner. Equipped with such capabilities, a robot that will interact with humans should be able to extract, compute or infer these relations and capabilities in order to communicate and interact efficiently in a natural way. To illustrate our work, we will explain how the robot is able to manage and update agents beliefs and pass Sally-Anne test. This work is part of a broader effort to develop a complete decisional framework for human-robot interactive task achievement.
Conference Paper
Full-text available
This paper investigates the conditions under which expert knowledge can be used to accelerate the policy optimization of a learning agent. Recent works on reinforcement learning for dialogue management allowed to devise sophisticated methods for value estimation in order to deal all together with exploration/exploitation dilemma, sample-efficiency and non-stationary environments. In this paper, a reward shaping method and an exploration scheme, both based on some intuitive hand-coded expert advices, are combined with an efficient temporal difference-based learning procedure. The key objective is to boost the initial training stage, when the system is not sufficiently reliable to interact with real users (e.g. clients). Our claims are illustrated by experiments based on simulation and carried out using a state-of-the-art goal-oriented dialogue management framework, the Hidden Information State (HIS).
Conference Paper
This paper investigates the conditions under which cues from social signals can be used for user adaptation (or user tracking) of a learning agent. In this work we consider the case of the Reinforcement Learning (RL) of a dialogue management module. Social signals (gazes, postures, emotions, etc.) have an undeniable importance in human interactions and can be used as an additional and user-dependent (subjective) reinforcement signal during learning. In this paper, the Kalman Temporal Differences (KTD) framework is employed in combination with a potential-based shaping reward method to properly integrate the social information in the optimisation procedure and adapt the policy to user profiles. In a second step the ability of the method to track a new user profile (after self learning of the user or switch to a new user) is shown. Experiments carried out using a state-of-the-art goal-oriented dialogue management framework with simulations support our claims.
Article
Reinforcement learning is now an acknowledged approach for optimising the interaction strategy of spoken dialogue systems. If the first considered algorithms were quite basic (like SARSA), recent works concentrated on more sophisticated methods. More attention has been paid to off-policy learning, dealing with the exploration-exploitation dilemma, sample efficiency or handling non-stationarity. New algorithms have been proposed to address these issues and have been applied to dialogue management. However, each algorithm often solves a single issue at a time, while dialogue systems exhibit all the problems at once. In this paper, we propose to apply the Kalman Temporal Differences (KTD) framework to the problem of dialogue strategy optimisation so as to address all these issues in a comprehensive manner with a single framework. Our claims are illustrated by experiments led on two real-world goal-oriented dialogue management frameworks, DIPPER and HIS.