Available via license: CC BY 4.0
Content may be subject to copyright.
SIP (), vol. , e, page of © The Authors, .
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/./), which permits
unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
doi:./ATSIP..
Engagement recognition by a latent character
model based on multimodal listener behaviors
in spoken dialogue
, ,
Engagement represents how much a user is interested in and willing to continue the current dialogue. Engagement recognition
will provide an important clue for dialogue systems to generate adaptive behaviors for the user. This paper addresses engagement
recognition based on multimodal listener behaviors of backchannels, laughing, head nodding, and eye gaze. In the annotation
of engagement, the ground-truth data often diers from one annotator to another due to the subjectivity of the perception of
engagement. To deal with this, we assume that each annotator has a latent character that aects his/her perception of engage-
ment. We propose a hierarchical Bayesian model that estimates both engagement and the character of each annotator as latent
variables. Furthermore, we integrate the engagement recognition model with automatic detection of the listener behaviors to
realize online engagement recognition. Experimental results show that the proposed model improves recognition accuracy com-
pared with other methods which do not consider the character such as majority voting. We also achieve online engagement
recognition without degrading accuracy.
Keywords: Engagement, Multimodal, Listener behaviors, Latent variable model, Dialogue
Received February ; Revised August
I. INTRODUCTION
Many spoken dialogue systems have been developed and
practically used in a variety of contexts such as user
assistants and conversational robots. The dialogue systems
eectively interact with users in specic tasks including
question answering [,], board games [], and medical
diagnoses []. However, human behaviors observed during
human-machine dialogues are much dierent from those
of human–human dialogues. Our ultimate goal is to realize
a dialogue system which behaves like a human being. It is
expected that these systems will permeate many aspects of
our daily lives in a symbiotic manner.
It is crucial for dialogue systems to recognize and under-
stand the conversational scene which contains a variety
of information such as dialogue states and users’ internal
states. The dialogue states can be objectively dened and
have been widely modeled by various kinds of machine
learning techniques [,]. On the other hand, the users’
internal states are dicult to dene and measure objec-
tively. Many researchers have proposed recognition mod-
els for various kinds of internal states such as the level
of interest [,], understanding [], and emotion [–
Yoshida-honmachi, Sakyo-ku, Kyoto -, Japan
Corresponding author:
Koji Inoue
Email: inoue@sap.ist.i.kyoto-u.ac.jp
]. From the perspective of the relationship between
dialogue participants (i.e. between a system and a user),
other researchers have dealt with entrainment [], rapport
[–], and engagement.
In this paper, we address engagement which repre-
sentstheprocessbywhichindividualsestablish,maintain,
and end their perceived connection to one another [].
Engagement has been studied primarily in the eld of
human–robot interaction and is practically dened as
how much a user is interested in the current dialogue.
Building and maintaining a high level of engagement
leads to natural and smooth interaction between the sys-
tem and the user. It is expected that the system can
dynamically adapt its behavior according to user engage-
ment, and increases the quality of the user experience
through the dialogue. In practice, some attempts have been
made to control turn-taking behaviors []anddialogue
policies [,].
Engagement recognition has been widely studied from
the perspective of multimodal behavior analyses. In this
study, we propose engagement recognition based on the
schemedepictedinFig.. At rst, we automatically detect
listener behaviors such as backchannels, laughing, head
nodding,andeyegazefromsignalsofmultimodalsensors.
Recent machine learning techniques have been applied to
this task and achieved sucient accuracy []. According to
the observations of the behaviors, the level of engagement
https://www.cambridge.org/core/terms. https://doi.org/10.1017/ATSIP.2018.11
Downloaded from https://www.cambridge.org/core. IP address: 139.81.197.156, on 13 Sep 2018 at 00:48:48, subject to the Cambridge Core terms of use, available at
et al.
Fig. 1. Scheme of engagement recognition.
is estimated. Although the user behaviors are objectively
dened, the perception of engagement is subjective and
may depend on each perceiver (annotator). In the annota-
tion of engagement, this subjectivity sometimes results in
inconsistencies of the ground-truth labels between annota-
tors. Previous studies integrated engagement labels among
annotators like majority voting [,]. However, the incon-
sistency among annotators suggests that each annotator
perceives engagement dierently. This inconsistency can be
associated with the dierence of the character (e.g. per-
sonality) of annotators. To deal with this issue, we pro-
pose a hierarchical Bayesian model that takes into account
the dierence of annotators, by assuming that each anno-
tator has a latent character that aects his/her percep-
tion of engagement. The proposed model estimates not
onlyengagementbutalsothecharacterofeachannota-
tor as latent variables. It is expected that the proposed
model more precisely estimates each annotator’s percep-
tion by considering the character. Finally, we integrate
the engagement recognition model with automatic detec-
tion of the multimodal listener behaviors to realize online
engagement recognition which is vital for practical spo-
ken dialogue systems. This study makes a contribution to
studies on recognition tasks containing subjectivity, in that
the proposed model takes into account the dierence of
annotators.
The rest of this paper is organized as follows. We
overview related works in Section II.SectionIII intro-
duces the human–robot interaction corpus used in this
study and describes how to annotate user engagement. In
Section IV, the proposed model for engagement recognition
is explained based on the scheme of Fig. . We also demon-
strate an online processing of engagement recognition for
spoken dialogue systems in Section V.InSectionVI,
experiments of engagement recognition are conducted and
analyzed. In Section VII, the paper concludes with sugges-
tions for future directions of human–robot interaction with
engagement recognition.
II. RELATED WORKS
In this section, we rst summarize the denition of engage-
ment. Next, previous studies on engagement recognition
are described. Finally, several attempts to generate system
behaviors according to user engagement are introduced.
A) Denition of engagement
The engagement was originally dened in a sociology
study []. This concept has been extended and variously
dened in the context of dialogue research []. We catego-
rize the denitions into two types as follows. The rst type
focusesoncuestostartandendthedialogue.Exampledef-
initions are “the process by which two (or more) participants
establish, maintain, and end their perceived connection”[]
and “the process subsuming the joint, coordinated activities
by which participants initiate, maintain, join, abandon, sus-
pend, resume, or terminate an interaction”[]. This type
of engagement is related to attention and involvement [].
The focus of studies based on these denitions was when
and how the conversation starts and also ends. For exam-
ple, one of the tasks was to detect the engaged person who
wants to start the conversation with a situated robot [,].
Thesecondtypeofdenitionisaboutthequalityofthe
connection between participants during the dialogue. For
example, the engagement was dened as “how much a par-
ticipant is interested in and attentive to a conversation”[]
and “the value that a participant in an interaction attributes
to the goal of being together with the other participant(s) and
of continuing the interaction”[]. This type of engagement
is related to interest and rapport.Thefocusofstudiesbased
on these denitions was how the user state changes during
the dialogue. Both types of denitions are important for dia-
logue systems to accomplish a high quality of dialogue. In
this study, we focus on the latter type of engagement, so that
the purpose of our recognition model is to understand the
user state during the dialogue.
B) Engagement recognition
A considerable number of studies have been made on
engagement recognition over the last decade. Engagement
recognition has been generally formulated as a binary classi-
cation problem: engaged or not (disengaged), or a category
classication problem like no interest or following the conver-
sation or managing the conversation []. The useful features
for engagement recognition have been investigated and
identied from a multi-modal perspective. Non-linguistic
behaviors are commonly used as the clue to recognize
engagement because verbal information like linguistic fea-
tures is specic to the dialogue domain and content, and
speech recognition is error-prone. A series of studies on
human–agent interaction found that user engagement was
related to several features such as spatial information of the
human (e.g. location, trajectory, distance to the robot) [,
,], eye-gaze behaviors (e.g. looking at the agent, mutual
gaze) [,,–], facial information (e.g. facial move-
ment, expression, head pose) [,], conversational behav-
iors (e.g. voice activity, adjacency pair, backchannel, turn
length) [,,], laughing [], and posture []. Engage-
mentrecognitionmodulesbasedonthemulti-modalfea-
tures were implemented in agent systems and empirically
tested with real users []. For human–human interaction,
it was also revealed that the eective features in dyadic
https://www.cambridge.org/core/terms. https://doi.org/10.1017/ATSIP.2018.11
Downloaded from https://www.cambridge.org/core. IP address: 139.81.197.156, on 13 Sep 2018 at 00:48:48, subject to the Cambridge Core terms of use, available at
conversations were acoustic information [,,], facial
information [,], and low-level image features (e.g. local
pattern, RGB data) []. Furthermore, the investigation was
extended to multi-party conversations. They analyzed fea-
tures including audio and visual backchannels [], eye-gaze
behaviors [,], and upper body joints []. The recog-
nition models using the features mentioned above were
initially based on heuristic approaches [,,]. Recent
methodsarebasedonmachinelearningtechniquessuch
as support vector machines (SVM) [,,,], hid-
den Markov models [], and convolutional neural net-
works []. In this study, we focus on behaviors when the
user is listening to system speech, such as backchannels,
laughing, head nodding, and eye-gaze.
We also nd a problem of subjectivity in the annotation
process of user engagement. Since the perception of engage-
ment is subjective, it is dicult to dene the annotation
criteria objectively. Therefore, most of the previous stud-
ies conducted the annotation of engagement with multiple
annotators. One approach is to train a few annotators to
avoid disagreement among annotators [,,,]. How-
ever, when we consider natural interaction, the annotation
becomes more complicated and challenging to be consistent
among annotators. In this case, another approach is based
on ‘wisdom of crowds’ where many annotators are recruited.
Eventually, the annotations were integrated using meth-
ods such as majority labels, averaged scores, and agreed
labels [,,]. In our proposed method, the dierent
views of the annotators are taken into account. It is expected
that we can understand the dierence among the annota-
tors. Our work is novel, in that the dierence in annotation
form the basis of our engagement recognition model.
C) Adaptive behavior generation according to
engagement
Some attempts were made to generate system behaviors
after recognizing user engagement. These works are essen-
tial to clarify the signicance of engagement recognition.
Although this is beyond this paper, our purpose of engage-
ment recognition is similar to those of the studies.
Turn-taking behaviors are ne-grained and could be
reective of user engagement. An interactive robot was
implemented to adjust its turn-taking behavior according
to user engagement []. For example, if a user was engaged,
the robot behaved to start a conversation with the user and
givetheoortotheuser.Asaresult,subjectiveevaluations
of both the eectiveness of communication and user expe-
rience were improved by this behavior strategy. Besides, in
our preliminary study with a remote conversation, the anal-
ysis result implied that if a participant was engaged in the
conversation, the duration of the participant’s turn became
longer than the case of not engaged, and the frequency of
backchannels given by the counterpart was also higher [].
Dialoguestrategycanalsobeadaptedtouserengage-
ment. Topic selection based on user engagement was
proposed []. The system was designed to predict user
engagement on each topic, and select the next topic which
maximizes both user engagement and the system’s pref-
erence. A chatbot system was implemented to select a
dialogue module according to user engagement []. For
example, when the user was not engaged in the conversa-
tion, the system switched the current dialogue module into
another one. Consequently, subjective evaluations such as
the appropriateness of the system utterance were improved.
Another system was designed to react to user disengage-
ment []. In an interview dialogue, when the user (intervie-
wee) was disengaged, the system said positive feedbacks to
elicit more self-disclosure from the user. Another research
group investigated how to handle user disengagement in a
human–robot interaction []. They compared two kinds of
system feedback: explicit and implicit. The result of a subject
experiment suggested that the implicit strategy of insert-
ing llers was preferred by the users than the explicit one
where the system directly asks a question such as “Are you
listening?”.
III. DIALOGUE DATA AND
ANNOTATION OF ENGAGEMENT
In this section, we describe the dialogue data used in this
study. We conducted an annotation of user engagement
with multiple annotators. The annotation result is analyzed
to conrm inconsistencies among the annotators on the
perception of engagement.
A) Human–robot interaction corpus
We have collected a human–robot interaction corpus in
which the humanoid robot ERATO intelligent conversa-
tional android (ERICA) [,]interactedwithhuman
subjects. ERICA was operated by another human subject,
called an operator, who was in a remote room. The dialogue
was one-on-one, and the subject and ERICA sat on chairs
facing each other. Figure shows a snapshot of the dialogue.
The dialogue scenario was as follows. ERICA works in a lab-
oratory as a secretary, and the subject visited the professor.
Since the professor was absent for a while, the subject talked
with ERICA until the professor would come back.
The voice uttered by the operator was directly played
withaspeakerplacedonERICAinrealtime.Whenthe
operatorspoke,thelipandheadmotionsofERICAwere
automatically generated from the prosodic information
[,]. The operator also manually controlled the head
and eye-gaze motions of ERICA to express some behaviors
such as eye-contact and head nodding. We recorded the
dialogue with directed microphones, a -channel micro-
phone array, RGB cameras, and a Kinect v sensor. After
the recording, we manually annotated the conversation data
including utterances, turn units, dialogue acts, backchan-
nels,laughing,llers,headnodding,andeyegaze(the
object at which the participant is looking).
We use dialogue sessions for an annotation of subject
engagement in this paper. The subjects were females and
males, with ages ranging from teenagers to over years
https://www.cambridge.org/core/terms. https://doi.org/10.1017/ATSIP.2018.11
Downloaded from https://www.cambridge.org/core. IP address: 139.81.197.156, on 13 Sep 2018 at 00:48:48, subject to the Cambridge Core terms of use, available at
et al.
Fig. 2. Setup for dialogue collection.
old. The operators were six amateur actresses in their
and s. Whereas each subject participated in only one ses-
sion, each operator was assigned several sessions. Each dia-
logue lasted about minutes. All participants were native
Japanese speakers.
B) Annotation of engagement
There are several choices to annotate ground-truth labels
of the subject engagement. The intuitive method is to ask
the subject to evaluate his/her own engagement right after
thedialoguesession.However,inpractice,wecanobtain
onlyoneevaluationonthewholedialogueduetotimecon-
straints. It is dicult to obtain evaluations on ne-grained
phenomena such as every conversational turn. Further-
more, we sometimes observe a bias where subjects tend to
give positive evaluations of themselves. This kind of bias
was observed in other works [,]. Another method is
to ask the ERICA’s operators to evaluate the subject engage-
ment. However, it was dicult to let the actresses participate
in this annotation work due to time constraints. Similar
to the rst method, we would obtain only one evaluation
on the whole dialogue, but this is not useful for the cur-
rent recognition task. This problem often happens in other
studies for building corpora because the dialogue recording
and the annotation work are done separately. Most of the
previous studies adopted a practical approach where they
asked third-party people (annotators) to evaluate engage-
ment. This approach is categorized into two types: training
a small number of annotators [,,,]andmakinguse
of the wisdom of crowds [,]. The former type is valid
when the annotation criterion is objective. On the other
hand, the latter type is better when the criterion is subjec-
tive and when a large amount of data is needed. We took the
latter approach in this study.
We recruited females who had not participated in
the dialogue data collection. Their gender was set to be
same as those of the ERICA’s operators. We instructed the
annotators to take the point of view of the operator. Each
dialogue session was randomly assigned to ve annotators.
The denition of engagement was presented as “How much
the subject is interested in and willing to continue the current
dialogue with ERICA”. We asked the annotators to anno-
tatethesubjectengagementbasedonitsbehaviorswhile
the subject was listening to ERICA’s talk. Therefore, the sub-
ject engagement can be interpreted as listener engagement.
It also means that the annotators observe listener behaviors
expressed by the subject. We showed a list of listener behav-
iors that could be related to engagement, with example
descriptions. This list included facial expression, laughing,
eye gaze, backchannels, head nodding, body pose, moving
ofshoulders,andmovingofarmsorhands.Weinstructed
theannotatorstowatchthedialoguevideobystandingin
the viewpoint of the ERICA’s operator, and to press a but-
tonwhenallthefollowingconditionswerebeingmet:()
ERICA was holding the turn, () the subject was expressing
any listener behaviors, and () the behavior means the high
level of subject engagement. For condition (), the annota-
tors were notied of auxiliary information which showed
the timing of when the conversational turn was changed
between the subject and ERICA.
C) Analysis of annotation result
Across all annotators and sessions, the average number of
button presses per session was . with a standard devia-
tion of .. Since each annotator was assigned some of the
sessions randomly, we tested one-way ANOVA for both
inter-annotator and inter-session, respectively. As a result,
we could see signicant dierences of the average numbers
of button presses among both the annotators (F(11, 88)=
4.64,p=1.51 ×10−5)andthesessions(F(19, 80)=2.56,
p=1.92 ×10−3). There was a variation among not only
sessions but also annotators.
In this study, we use ERICA’s conversational turn as a
unit for engagement recognition. The conversational turn
is useful for spoken dialogue systems to utilize the result of
engagement recognition because the systems typically make
https://www.cambridge.org/core/terms. https://doi.org/10.1017/ATSIP.2018.11
Downloaded from https://www.cambridge.org/core. IP address: 139.81.197.156, on 13 Sep 2018 at 00:48:48, subject to the Cambridge Core terms of use, available at
Fig. 3. Inter-annotator agreement score (Cohen’s kappa) on each pair of the annotators.
a decision on their behaviors on a turn basis. If an annotator
pressed the button more than once in a turn, we regarded
that the turn was annotated as engaged by the annotator.
We excluded short turns whose durations are smaller than
seconds, and also some turns corresponding to the greet-
ing. As a result, the total number of ERICA’s turns was
over dialogue sessions. The numbers of engaged and
not engaged turns from all annotators were and ,
respectively.
We investigated the agreement level among the anno-
tators. The average value of Cohen’s kappa coecients on
every pair of two annotators was . with a standard
deviation of .. However, as Fig. shows, some pairs
showed coecients higher than the moderate agreement
(larger than .). The result suggests that the annotators
can be clustered into some groups based on their tendencies
to annotate engagement similarly. Each group regarded dif-
ferent behaviors as important and had dierent thresholds
to accept the behaviors as engaged events.
We took a survey on which behaviors the annotators
regarded as essential to judge the subject engagement. For
everysession,weaskedtheannotatorstoselectalltheessen-
tial behaviors to judge the subject engagement. Table lists
the results of this survey. Note that we conducted this sur-
vey in total times (ve annotators in sessions). The
Tab l e 1 . The number of times selected by the annotators as
meaningful behaviors
Listener behavior selected
Facial expression
Backchannels
Head nodding
Eye gaze
Laughing
Body pose
Moving of shoulders
Moving of arms or hands
Others
result indicates that engagement could be related to some
listener behaviors such as facial expression, backchannels,
head nodding, eye gaze, laughing, and body pose.
In the following experiments, we use four behaviors:
backchannels, laughing, head nodding, and eye gaze. As
we have seen in the section on the related works, these
behaviors have been identied as indicators of engagement.
We manually annotated the occurrences of the four behav-
iors. The denition of backchannel in this annotation was
responsive interjections (such as ‘yeah’inEnglishand‘un’
in Japanese), and expressive interjections (such as ‘oh’in
English and ‘he-’ in Japanese) []. The laughing was dened
as vocal laughing, not including just smiling without any
vocalutterance.Weannotatedtheoccurrenceofhead
nodding based on the vertical movement of the head. The
occurrence of eye-gaze behaviors is acknowledged when
the subject was gazing at ERICA’s face continuously for
more than seconds. We decided this -second thresh-
old by conrming a reasonable balance between the accu-
racy in Table and the recall of the engaged turns. It
was challenging to annotate facial expression and body
pose due to their ambiguity. We will consider the other
behaviors which we do not use in this study as addi-
tional features in the future work. The relationship between
Tab l e 2 . Relationship between the occurrence of each behavior and
the annotated engagement (: occurred / engaged, : not occurred /
not engaged)
Engagement
Behavior Accuracy
Backchannel .
Laughing .
Head nodding .
Eye gaze .
https://www.cambridge.org/core/terms. https://doi.org/10.1017/ATSIP.2018.11
Downloaded from https://www.cambridge.org/core. IP address: 139.81.197.156, on 13 Sep 2018 at 00:48:48, subject to the Cambridge Core terms of use, available at
et al.
Tab l e 3 . Accuracy scores of each annotator’s engagement labels when the reference labels of each behavior are used
Annotator index
Behavior
Backchannel . . . . . . . . . . . .
Laughing . . . . . . . . . . . .
Head nodding . . . . . . . . . . . .
Eye gaze . . . . . . . . . . . .
theoccurrencesofthefourbehaviorsandtheannotated
engagement is summarized in Table .Notethatweusedthe
engagement labels given by all individual annotators. The
result suggests that these four behaviors are useful cues to
recognize the subject engagement. We further analyzed the
accuracy scores on each annotator in Table .Theresults
show that each annotator has a dierent perspective on each
behavior. For example, the engagement labels of the rst
annotator (index ) are related to the labels of backchan-
nelsandheadnodding.Ontheotherhand,thoseofthe
second annotator (index ) are related to those of laughing,
head nodding, and eye gaze. This dierence implies that we
need to consider each annotator’s dierent perspective on
engagement.
IV. LATENT CHARACTER MODEL
In this section, we propose a hierarchical Bayesian model for
engagement recognition. As we have shown, the annotators
can be clustered into some groups based on their perception
manners. We assume that each annotator has a character
which aects his/her perception of engagement. The char-
acter is a latent variable estimated from the annotation data.
We call the proposed model as a latent character model. This
model is inspired by latent Dirichlet allocation []andthe
latent class model which estimates annotators’ abilities for a
decision task like diagnosis [].
A) Problem formulation
At rst, we dene the problem formulation of this engage-
ment recognition as follows. Engagement recognition is
doneforeachturnoftherobot(ERICA).Theinputis
based on listener behaviors of the user during the turn:
laughing, backchannels, head nodding, and eye gaze. Each
behavior is represented as binary: occur or not as dened
in the previous section. The input feature is a combina-
tion of the occurrences of the four behaviors and referred
to as behavior pattern.Inthisstudy,sinceweusethe
four behaviors, the possible number of the behavior pat-
terns is (=24). Although the number of the behav-
iorpatternswouldbemassiveifweusemanybehav-
iors, the observed patterns are limited so that we can
exclude the less-frequent patterns. The output is also
binary: engaged or not, as annotated in the previous
section. Note that this ground-truth label diers for each
annotator.
Fig. 4. Graphical model of latent character model.
B) Generative process
The latent character model is illustrated as the graphical
model in Fig. .Thegenerativeprocessisasfollows.For
each annotator, parameters of a character distribution are
generated from the Dirichlet distribution as
θi=(θi1,...,θik,...,θiK)∼Dirichlet(α),
1≤i≤I,()
where I,K,i,kdenote the number of annotators, the
number of characters, the annotator index, and the char-
acter index, respectively, and α=(α1,...,αk,...,αK)is
a hyperparameter. The parameter θik represents the prob-
ability that the i-th annotator has the k-th character. For
each combination of the k-th character and the l-th behav-
iorpattern,aparameterofanengagementdistributionis
generated by the beta distribution as
φkl ∼Beta(β,γ),1≤k≤K,1≤l≤L,()
where Ldenotesthenumberofbehaviorpatterns,and
βand γare hyperparameters. The parameter φkl repre-
sents the probability that annotators with the k-th character
interpret the l-th behavior pattern as an engaged signal.
The number of the total dialogue sessions is represented
as J.Forthe j-th session, the set of annotators who were
assigned to this session is represented as Ij.Thenumber
of conversational turns of the robot in the j-th session is
represented as Nj. For each turn, a character is generated
from the categorical distribution corresponding to the i-th
annotator as
zijn ∼Categorical(θi),1≤j≤J,1≤n≤Nj,
i∈Ij,()
https://www.cambridge.org/core/terms. https://doi.org/10.1017/ATSIP.2018.11
Downloaded from https://www.cambridge.org/core. IP address: 139.81.197.156, on 13 Sep 2018 at 00:48:48, subject to the Cambridge Core terms of use, available at
where ndenotes the turn index. The input behavior pat-
tern in this turn is represented as xjn ∈{1, ...,L}.Note
that the behavior pattern is independent from the annota-
tor index i. The binary engagement label is generated from
theBernoullidistributionbasedonthecharacterandthe
input behavior pattern as
yijn ∼Bernoulli(φzijnxjn).()
Given the dataset of the above variables and parameters,
the conditional distribution is represented as
p(Y,Z,,|X)
=p(Z|)p(Y|X,Z,)p()p(),()
where the bold capital letters represent the datasets of the
variables written by the corresponding small letters. Note
that and are the model parameters, and the dataset of
the behavior patterns Xis given and regarded as constant.
C) Training
In the training phase, the model parameters and are
estimated. The training datasets of the behavior patterns X
and the engagement labels Yaregiven.Weusecollapsed
Gibbs sampling which marginalizes the model parame-
ters and eciently samples only target variables. Here, we
sample each character alternately and iteratively from its
conditional probability distribution as
zijn ∼p(zijn|X,Y,Z\ijn,α,β,γ),()
where the model parameters and are marginalized.
Note that Z\ijn is the set of the characters without zijn.
The conditional probability distribution is expanded in the
same manner as the other work []. The distribution is
proportionate to the product of two terms as
p(zijn =k|X,Y,Z\ijn,α,β,γ)
∝p(zijn =k|Z\ijn,α)
×p(yijn|X,Y\ijn,zijn =k,Z\ijn,β,γ),()
where Y\ijn is the dataset of the engagement labels without
yijn.Thersttermiscalculatedas
p(zijn =k|Z\ijnα)=p(zijn =k,Z\ijn|α)
p(Z\ijn|α)()
=Dik\ijn +αk
Di−1+K
k=1αk
.()
Note that Diand Dik\ijn represent the number of turns
where the i-th annotator was assigned, and the number
of turns where the i-th annotator had the k-th character
without considering zijn, respectively. The above expansion
from equations ()to () is explained in the appendix. The
second term is calculated as
p(yijn|X,Y\ijn,zijn =k,Z\ijn,β,γ)
=p(Y|X,zijn =k,Z\ijn,X,β,γ)
p(Y\ijn|X,Z\ijn,X,β,γ) ()
=
L
l=1(Nkl\ijn +β+γ)
(Nkl\ijn +Nijnl +β+γ)
×(Nkl1\ijn +Nijnl1+β)
(Nkl1\ijn +β)
×(Nkl0\ijn +Nijnl0+γ)
(Nkl0\ijn +γ) ,()
where (·)is the gamma function. Note that Nkl \ijn repre-
sents the number of times when the l-th behavior pattern
was observed by annotators with the k-th character without
considering xijn.Amongthem,Nkl1\ijn and Nkl0\ijn are the
number of times when the annotators gave the engaged and
not engaged labels, respectively. Besides, Nijnl represents
the binary variable indicating if the i-th annotator observed
the l-th behavior pattern in the n-th turn of the j-th session.
Among them, Nijnl1and Nijnl0are binary variables indicat-
ing if the annotator gave the engaged and not engaged labels,
respectively. The above expansion from equations ()to
() is also explained in the appendix.
Aftersampling,weselectoneofthesamplingresults
where the joint probability of the variables is maximized as
Z∗=arg max
Z(r)
p(Y,Z(r)|X,α,β,γ),()
where Z(r)represents the r-th sampling result. The joint
probability is expanded as
p(Y,Z|X,α,β,γ)
=p(Z|α)p(Y|X,Z,β,γ),()
∝
I
i=1k(Dik +αk)
(Di+kαk)
×
K
k=1
L
l=1
(Nkl1+β)(Nkl0+γ)
(Nkl +β+γ) .()
Note that Nkl is the number of times when annotators with
k-th character annotated the l-th behavior pattern. Among
them, Nkl1and Nkl0represent the number of times when
theannotatorsgavetheengagedandnot-engagedlabels,
respectively. Besides, Dik is the number of turns where the
i-th annotator had the k-th character. The above expansion
from equations ()to() is also explained in the appendix.
Finally, the model parameters and are estimated based
on the sampling result Z∗as
θik =Dik +αk
Di+K
k=1αk
,()
φkl =Nkl1+β
Nkl +β+γ,()
https://www.cambridge.org/core/terms. https://doi.org/10.1017/ATSIP.2018.11
Downloaded from https://www.cambridge.org/core. IP address: 139.81.197.156, on 13 Sep 2018 at 00:48:48, subject to the Cambridge Core terms of use, available at
et al.
where Di,Dik,Nkl,andNkl 1are counted up among the
sampling result Z∗.
D) Testing
In the testing phase, the unseen engagement label given
by a target annotator is predicted based on the estimated
model parameters and .Theinputdataarethebehav-
ior pattern xt∈{1, ...,L}and the target annotator index
i∈{1, ...,I}.Notethattrepresents the turn index in the
test data. The probability that the target annotator gives the
engaged label on this turn is calculated by marginalizing
the character as
p(yit =1|xt,i,,)=
K
k=1
θik φkxt.()
The t-thturnisrecognizedasengaged when this probability
is higher than a threshold.
E) Related models
We summarize similar models considering the dierence
of annotators. In a task of backchannel prediction, a two-
step conditional random elds was proposed [,]. They
trained a prediction model per annotator. The nal deci-
sion is based on voting by the individual models. Our
method trains the model based on the character, not for
each annotator, so that robust estimation is expected even
if the amount of data for each annotator is not large, which
is the case in many realistic applications. In a task of esti-
mation of empathetic states among dialogue participants, a
model classied annotators by considering both the anno-
tators’ tendencies of the estimation and their personali-
ties []. The personalities correspond to the characters in
our model. Their model was able to estimate the empathetic
state based on a specic personality. It assumed that the per-
sonality and the input features such as the behavior patterns
are independent. In our model, we assume that the char-
acter and the input features are dependent, meaning that
how to perceive each behavior pattern is dierent for each
character.
V. ONLINE PROCESSING
In order to use the engagement recognition model in spoken
dialogue systems, we have to detect the behavior patterns
automatically. In this section, we rst explain automatic
detection methods of the behaviors. Then, we integrate the
detection methods with the engagement recognition model.
A) Automatic detection of behaviors
We detect backchannels and laughing from speech signals
recorded by a directed microphone. Note that we will inves-
tigate making use of a microphone array in future work.
This task has been widely studied in the context of social
signal detection []. We proposed using bi-directional long
short-term memory with connectionist temporal classi-
cation (BLSTM-CTC) []. The advantage of CTC is that
we do not need to annotate the time-wise alignment of the
social signal events. On each user utterance, we extracted
the log-Mel lterbank features as a -dimension vector
and also a delta and a delta-delta of them. The number of
dimensions of the input features was . We trained the
BLSTM-CTC model for backchannels and laughing inde-
pendently. The number of hidden layers was and the
number of units on each layer was . For training, we used
other dialogue sessions recorded in the same manner
as the dataset for engagement recognition. In the training
set, the total number of user utterances was ,. Among
them, the number of utterances containing backchannels
was , and the number of utterances containing laugh-
ing was . Then, we tested the sessions which were
used for engagement recognition. In the test dataset, the
totalnumberofthesubjectutteranceswas.Among
them, the utterances containing backchannels was , and
the utterances containing laughing was . Precision and
recall of backchannels were . and ., and the F
score was .. For laughing, precision and recall were
. and ., and the F score was .. The occurrence
probabilities of backchannel and laughing are computed for
everyuserutterance.Therefore,wetakethemaximumvalue
during the turn as the input for engagement recognition.
We detect head nodding from visual information cap-
tured by the Kinect v sensor. Detection of head nod-
ding has also been widely studied in the eld of computer
vision [,]. We used LSTM for this task []. With the
Kinect v sensor, we can measure the head direction in the
D space. We calculated a feature set containing the instan-
taneous speeds of the yaw, roll, and pitch of the head. Other
features were the average speed, average velocity, acceler-
ation and range of the head pitch over the previous
milliseconds. We trained an LSTM model with these fea-
tures whose number of dimension was . We used a single
hidden layer with units. The dataset was the same
sessions as the engagement recognition task, and -fold
cross-validation was applied. The number of data frames
per second was about , and we made a prediction every
frames. There were prediction points in the whole
dataset, and of them were manually annotated as points
wheretheuserwasnodding.Notethatwediscardedthedata
frames on the subject’s turn. For prediction-point-wise eval-
uation, precision and recall of head nodding frames were
. and ., and the F score was .. For event-wise
detection, we regarded a continuous sequence of detected
nodding as a head nodding event where the duration is
longer than milliseconds. If the sequence overlapped
with a ground-truth event, the event was correctly detected.
On an event basis, there were head nodding events.
Precision and recall of head nodding events were .
and.,andtheFscorewas..Wecomparedthe
LSTM performance with several other models such as SVM
andDNNandfoundthattheLSTMmodelhadthebest
score []. The occurrence probability of head nodding is
estimated at every frame. We smoothed the output sequence
https://www.cambridge.org/core/terms. https://doi.org/10.1017/ATSIP.2018.11
Downloaded from https://www.cambridge.org/core. IP address: 139.81.197.156, on 13 Sep 2018 at 00:48:48, subject to the Cambridge Core terms of use, available at
with a Gaussian lter where the standard deviation for
Gaussian kernel was .. Therefore, we also take the max-
imum value during the turn as the input for engagement
recognition.
The eye-gaze direction is approximated by the head ori-
entation given by the Kinect v sensor. We would be able to
detect eye-gaze directionprecisely if we used an eye tracker,
but non-contact sensors such as the Kinect v sensor are
preferable for spoken dialogue systems interacting with
users on a daily basis. Eye gaze towards the robot is detected
when the distance between the head-orientation vector and
thelocationoftherobot’sheadissmallerthanathresh-
old. We set the threshold at mm in our experiment. The
number of eye-gaze samples per second was about . In the
dataset of the sessions, there were eye-gaze sam-
plesintotal,andsamplesweremanuallyannotated
as looking at the robot. For frame-wise evaluation, precision
and recall of the eye-gaze towards the robot were . and
., and the F score was .. This result implies that the
ground-truth label of eye gaze based on the actual eye-gaze
direction is sometimes dierent from the head direction.
However, even if the frame-wise performance is low, it is
enough if we can detect the eye-gaze behavior which is con-
tinuous gaze longer than seconds. We also evaluated the
detection performance on a turn basis. There were turns
in the corpus, and the continuous eye-gaze behavior was
observed in turns of them. For this turn-wise evalua-
tion, precision and recall of the eye-gaze behavior were .
and.,andtheFscorewas..Thisresultmeans
that this method is sucient to detect the eye-gaze behav-
ior for engagement recognition. Note that we ignored some
not looking states if the duration is smaller than mil-
liseconds. To convert the estimated binary states (looking or
not) to the occurrence probability of the eye-gaze behavior,
we use a logistic function with a threshold of seconds.
B) Integration with engagement recognition
We use the behavior detection models in the test phase of
engagement recognition. At rst, the occurrence probability
of the l-th behavior pattern in the t-th turn of the test dataset
is calculated as
pt(l)=
M
m=1
pt(m)bm(1−pt(m))(1−bm),()
where M,m,pt(m),andbm∈{0, 1}denote the number of
behaviors, the behavior index, the output probability of the
behavior m, and the occurrence of the behavior m,respec-
tively. Note that pt(m)corresponds to the output of each
behavior detection model, and the binary value bmis based
on the given behavior pattern l. For example, when the
given behavior pattern lrepresents the case where both
laughter (m=2)andeye-gaze(m=4)occur,thebinary
values are represented as (b1,b2,b3,b4)=(0, 1, 0, 1).As
the behavior pattern lis represented by the combination
of the occurrences, l=M
m=1bm·2m−1.Theprobabilityof
engaging (equation ()) is reformulated by marginalizing
notonlythecharacterbutalsothebehaviorpatternwithits
occurrence probability as
p(yit =1|Ptl,i,,)=
K
k=1
L
l=1
θik φkl pt(l),()
where Ptl denotes the set of the occurrence probabilities of
all possible behavior patterns calculated by equation ().
VI. EXPERIMENTAL EVALUATIONS
In this section, the latent character model is compared with
other models that do not consider the dierence of the
annotators. Besides, we evaluate the accuracy of the online
implementation. Furthermore, we investigate the eective-
ness of each behavior to identify important behaviors in this
recognition task. In this experiment, the task is to recognize
each annotator’s labels. Since we observed a low agreement
among the annotators in Section III C,itdoesnotmake
sense to recognize a single overall label such as majority vot-
ing. In real-life applications, we can select an annotator or
a character appropriate for the target system. At the end of
this section, we suggest a method to select a character dis-
tribution for engagement recognition based on a personality
trait expected for a system such as a humanoid robot.
A) Experimental setup
We conducted the cross-validation with the dialogue ses-
sions: for training and the rest for testing. In the proposed
model, the number of sampling was , and all prior
distributions were the uniform distribution. The evaluation
was done for each annotator one by one where ve anno-
tators individually annotated each dialogue session. Given
the annotator index i, the probability of the engaged label
(equations ()or ()) was calculated for each turn. Setting
the threshold at ., we obtained the accuracy score which
is a ratio of the number of the correctly recognized turns to
the total number of turns. The nal evaluation was made by
averaging the accuracy scores for all annotators and also the
cross-validation. The chance level was . (=, / ,).
B) Eectiveness of character
At rst, we compared the proposed model with two meth-
ods to see the eectiveness of the character. In this exper-
iment, we used the input behavior patterns which were
manually annotated. For the proposed model, we explored
an appropriate number of characters (K)bychangingfrom
to on a trial basis. The rst compared model was the
same as the proposed model other than a unique charac-
ter (K=1). The second compared models were based on
other machine learning methods. We used logistic regres-
sion, SVM, and a multilayer perceptron (neural network).
For each model, two types of training are considered: major-
ity and individual.Inthemajority type, we integrated the
training labels of the ve annotators by majority voting
and trained a unique model which was independent of the
https://www.cambridge.org/core/terms. https://doi.org/10.1017/ATSIP.2018.11
Downloaded from https://www.cambridge.org/core. IP address: 139.81.197.156, on 13 Sep 2018 at 00:48:48, subject to the Cambridge Core terms of use, available at
et al.
annotators. In the individual type, we trained an individ-
ual model for each annotator with his/her data only and
used each model according to the input annotator index i
in the test phase. Although the individual type can learn the
dierent tendency of each annotator, the amount of train-
ing data decreases. Furthermore, we divided the training
data into training and validation datasets on a session basis
so that it corresponds to :. We trained each model with
the training dataset, and then tuned the parameter of each
model with the validation dataset. For logistic regression,
we tuned the weight parameter of the l-norm regulariza-
tion. For SVM, we used the radial basis function kernel and
tuned the penalty parameter of the error term. For the mul-
tilayerperceptron,wetunedtheweightparameterofthe
l-norm regularization for each unit. We also needed to
decide on other settings for the multilayer perceptron such
as the number of hidden layers, the number of hidden units,
the activation function, the optimization method, and the
batch size. We tried many settings and report the best result
among them.
Table summarizes the accuracy scores. Among the con-
ventional machine learning methods, the multilayer per-
ceptron showed the highest score on both of the majority
and individual types. For the majority type, the best set-
ting of the multilayer perceptron was hidden layers and
hidden units. For the individual type, the best setting was
hidden layer and hidden units. The dierence in the
number of hidden layers can be explained by the number of
available training data.
Considering the character (K≥2), we improved the
accuracy compared with the w/o-character models includ-
ing the multilayer perceptron. The highest accuracy was
. with the four characters (K=4). We conducted
apairedt-test between the cases of the unique charac-
ter (K=1)andthefourcharacters(K=4)andfound
a signicant dierence between them (t(99)=2.55,p=
1.24 ×10−2). We also performed a paired t-test between
the proposed model with the four characters and the multi-
layer perceptron models. There was a signicant dierence
between the proposed model (K=4)andthemajority type
of the multilayer perceptron (t(99)=2.34,p=2.15 ×
Tab l e 4 . Engagement recognition accuracy
(Kis the number of characters.)
Method Accuracy
Chance level .
Logistic regression Majority .
Individual .
SVM Majority .
Individual .
Multilayer perceptron Majority .
Individual .
Latent character (proposed) K=1(no character) .
K=2.
K=3.
K=4.
K=5.
10−2). We also found a signicant dierence between the
proposed model (K=4)andtheindividual type of the
multilayer perceptron (t(99)=2.55,p=1.24 ×10−2).
These results indicate that the proposed model simulates
each annotator’s perception of engagement more accurately
than the others by considering the character. Apparently,
the majority voting is not enough for this kind of recogni-
tion task that contains subjectivity. Although the individual
model has the potential to simulate each annotator’s per-
ception, it fails to address the data sparseness problem in
model training. This means that there was not enough train-
ing data for each annotator. We often face this problem
whenweusedatathatwascollectedbythewisdomofcrowd
approach where a large number of annotators are available
but the amount of data of each annotator is small.
C) Evaluation with automatic behavior
detection
We evaluated the online processing described in Section V.
We compared two types of the input features in the test
phase: manually annotated and automatically detected.
Note that we used the manually annotated features for train-
ing in both cases. We also tested the number of characters
(K)atonlyoneandfour.Tableshows the dierence
between the manual and automatic features. The accuracy
isnotdegradedsomuchevenwhenweusetheautomatic
detection. We performed the paired t-test on the proposed
model with the four characters (K=4), and there was no
signicant dierence between the cases of the manual and
automatic features (t(99)=1.45,p=1.51 ×10−1). This
result indicates that we can apply our proposed model to
live spoken dialogue systems. Note that all detection models
can run in real time with short processing time which does
not aect the decision-making process in spoken dialogue
systems.
D) Identifying important behaviors
We examined the eectiveness of each behavior by eliminat-
ing one of the four behaviors from the feature set. We again
testedthenumberofcharacters(K)atonlyoneandfour.
Table reports the results on both cases of the manual and
automatic features. From this table, laughing and eye-gaze
Tab l e 5 . Engagement recognition accuracy of the online
processing
Behavior
Method Manual Automatic
Logistic regression Majority . .
Individual . .
SVM Majority . .
Individual . .
Multilayer Perceptron Majority . .
Individual . .
Latent character (proposed) K=1. .
K=4. .
https://www.cambridge.org/core/terms. https://doi.org/10.1017/ATSIP.2018.11
Downloaded from https://www.cambridge.org/core. IP address: 139.81.197.156, on 13 Sep 2018 at 00:48:48, subject to the Cambridge Core terms of use, available at
Tab l e 6 . Recognition accuracy without each behavior of
the proposed method
Manual Automatic
Used be havi or K=1K=4K=1K=4
All . . . .
w/o backchannels . . . .
w/o laughing . . . .
w/o head nodding . . . .
w/o eye gaze . . . .
behaviors are more useful for this engagement recognition
task. This result is partly consistent with the analysis of
Table . It is assumed that backchannels and head nod-
ding were indicating engagement, but those can be used
more frequently than the others. While backchannel and
head nodding behaviors play a role to acknowledge that the
turn-takingoorwillbeheldbythecurrentspeaker,laugh-
ing and eye-gaze behaviors express the reaction towards the
spoken content. Therefore, laughing and eye-gaze behaviors
aremorerelatedtothehighlevelofengagement.However,
itisthoughtthatsomebackchannelssuchasexpressive inter-
jections (such as ‘oh’inEnglishand‘he-’ in Japanese) []
areusedtoexpressthehighlevelofengagement.Fromthis
perspective, there is room for further investigation to clas-
sify each behavior into some categories which are correlated
with the level of engagement.
E) Example of parameter training
We analyzed the result of parameter training. The follow-
ing parameters were trained using all sessions. In this
example, the number of characters was four (K=4).
The parameters of the character distribution (θik)is
showninFig.. The vertical axis represents the probability
that each annotator has each character. It is observed that
some annotators have common patterns. We clustered the
annotators based on this distribution by using the hierar-
chical clustering with the unweighted pair-group method
with arithmetic mean algorithm. From the generated tree
diagram, we extracted three clusters that are reported in
Table . Four annotators were independent (the annotators
, , , ). The table also shows the averaged agreement
scores of the engagement labels among the annotators inside
Tab l e 7. Clustered annotators based on character distribution and
averaged in-cluster agreement scores
Cluster Annotator index Cohen’s kappa
A,.
B ,,, .
C , .
Tab l e 8 . Averaged agreement scores between-clusters
Cluster pair Cohen’s kappa
A-B .
A - C .
B-C .
thesamecluster.Allscoresareoverthemoderateagree-
ment (larger than .) and also higher than the whole
averaged score (.) reported in Section III C.Wefur-
ther analyzed the averaged agreement scores between the
clusters. Table reports the scores that are lower than the
in-cluster agreement scores.
The parameters of the engagement distribution (φkl)is
showninFig.. The vertical axis represents the probabil-
itythateachbehaviorpatternisrecognizedasengagedby
each character. Note that we excluded some behavior pat-
terns which appear less than ve times in the corpus. We
also show the number of times when each behavior pattern
is observed. The proposed model can obtain the dierent
distribution for each character. Although the rst charac-
ter (k=1) seems to be reactive to some behavior patterns,
it is also reactive to behaviors other than the four behav-
iors because the high probability is estimated against the
empty behavior pattern (nothing) where no behavior was
observed. The second and third characters (k=2, 3)show
a similar tendency, but some are dierent (e.g. BL, BG, and
BNG). The fourth character (k=4)isreactivetoallbehav-
ior patterns except the patterns of empty (nothing) and eye
gaze only (G). Among all characters, the co-occurrence of
multiple behaviors leads to higher probability (the right side
of the gure). Especially, when all behaviors are observed
(BLNG),theprobabilitybecomesveryhighinallcharacters.
This tendency indicates that the co-occurrence of multiple
behaviors expresses the high level of engagement.
Fig. 5. Estimated parameter values of character distribution (Each value corresponds to the probability that each annotator has each character.).
https://www.cambridge.org/core/terms. https://doi.org/10.1017/ATSIP.2018.11
Downloaded from https://www.cambridge.org/core. IP address: 139.81.197.156, on 13 Sep 2018 at 00:48:48, subject to the Cambridge Core terms of use, available at
et al.
Fig. 6. Estimated parameter values of engagement distribution (Each value corresponds to the probability that each behavior pattern is recognized as engaged by
each character. The number in parentheses next to the behavior pattern is the frequency of the behavior pattern in the corpus.).
F) How to determine character distribution
The advantage of the proposed model is to simulate various
kindsofperspectivesforengagementrecognitionbychang-
ing the character distribution. However, when we use the
proposed model in a spoken dialogue system, we need to
determine one character distribution to be simulated. We
suggest a method based on the personality given to the dia-
logue system. Specically, once we set the personality for the
dialogue system, the character distribution for engagement
recognition is decided. Here, as a proxy of the personality,
we use the Big Five traits: extroversion,neuroticism,open-
ness to experience,conscientiousness,andagreeableness [].
For example, given a social role of a laboratory guide, the
dialogue system is expected to be extroverted.
In the annotation work, we also measured the Big Five
scores of each annotator []. We then trained a soft-
max single-layer linear regression model which maps the
BigFivescorestothecharacterdistributionasshownin
Fig. . Note that the weight parameters of the regression are
constrained by the l-norm regularization and to be non-
negative. The bias term was not added. Table shows the
regression weights and indicates that some characters are
related to some personality traits. For example, extroversion
is related to the rst and fourth characters, followed by the
second character.
We also tested the regression with some social roles.
When a dialogue system plays a role of a laboratory guide,
we set the input score on extroversion at the maximum
valueamongtheannotators,andtheotherscoresareset
totheaveragevalues.Theoutputoftheregressionwasθ=
(0.147, 0.226, 0.038, 0.589), which means the fourth char-
acter is weighted in this social role. For another social role
Tab l e 9 . Regression weights for mapping from Big Five scores to
character distribution
Character index (k)
Big Five factor
Extroversion . . . .
Neuroticism . . . .
Openness to experience . . . .
Conscientiousness . . . .
Agreeableness . . . .
such as a counselor, we set the input scores on conscien-
tiousness and agreeableness at the maximum values among
the annotators, and the other scores are set to the average
values. The output was θ=(0.068, 0.464, 0.109, 0.359),
which means the second and fourth characters are weighted
in this social role Further investigation is required on the
eectiveness of this personality control.
VII. CONCLUSION
We have addressed engagement recognition using listener
behaviors. Since the perception of engagement is subjec-
tive, the ground-truth labels depend on each annotator. We
assumed that each annotator has a character that aects
his/her perception of engagement. The proposed model
estimates not only user engagement but also the charac-
ter of each annotator as latent variables. The model can
simulate each annotator’s perception by considering the
character. To use the proposed model in spoken dialogue
systems, we integrated the engagement recognition model
with the automatic detection of the listener behaviors. In
the experiment, the proposed model outperforms the other
methods that use either the majority voting for label gener-
ation in training or individual training for each annotator.
Then, we evaluated the online processing with the auto-
matic detection of the listener behaviors. As a result, we
achieved online engagement recognition without degrading
accuracy. The proposed model that takes into account the
dierence of annotators will contribute to other recognition
tasks that contain subjectivity such as emotion recognition.
We also conrmed that the proposed model can cluster the
annotators based on their character distributions and that
each character has a dierent perspective on behaviors for
engagement recognition. From the analysis result, we can
learn the traits of each annotator or character. Therefore,
wecanchooseanannotatororacharacterthatalivespoken
dialogue system wants to imitate. We also presented another
method to select a character distribution based on the Big
Five personality traits expected for the system.
A further study on adaptive behavior generation of spo-
ken dialogue systems should be conducted. Dialogue sys-
tems should consider the result of engagement recognition
andappropriatelychangetheirdialoguepolicyfortheusers.
As we have seen in Section II C, a little study has been made
https://www.cambridge.org/core/terms. https://doi.org/10.1017/ATSIP.2018.11
Downloaded from https://www.cambridge.org/core. IP address: 139.81.197.156, on 13 Sep 2018 at 00:48:48, subject to the Cambridge Core terms of use, available at
Fig. 7. Real-time engagement visualization tool.
on this issue. We are also studying methods for utilizing
the result of engagement recognition. One possible way is
to change the dialogue policy according to user engage-
ment adaptively. For example, when a system is given a
social role of information navigation such as a laboratory
guide, the system mostly takes the dialogue initiative. In
this case, it is expected that, when the system recognizes
low-level user engagement, the system gives a feedback
response that attracts the user’s attention. Moreover, it is
also possible that the system adaptively changes the expla-
nation content according to the level of user engagement.
The explanation content can be elaborated for users with
high-level engagement. On the other hand, for users with
low-level engagement, the system should make the content
more understandable. Another way of utilizing engagement
is to use it as an evaluation metric for dialogue. Studies
on non-task oriented dialogue such as casual chatting have
tried to establish evaluation metrics including the length of
dialogue, linguistic appropriateness, and human judgment.
However, there is still no clear metric to evaluate dialogue.
We will be able to use engagement as an evaluation metric
or reference labels for training models.