ArticlePDF Available

Abstract

We present work in progress on an intelligent embodied conversation agent that is supposed to act as a social companion with linguistic and emotional competence in the context of basic and health care. The core of the agent is an ontology-based knowledge model that supports flexible reasoning-driven conversation planning strategies. A dedicated search engine ensures the provision of background information from the web, necessary for conducting a conversation on a specific topic. Multimodal communication analysis and generation modules analyze respectively generate facial expressions, gestures and multilingual speech. The assessment of the prototypical implementation of the agent shows that users accept it as a natural and trustworthy conversation counterpart. For the final release, all involved technologies will be further improved and matured.
ScienceDirect
Available online at www.sciencedirect.com
Procedia Computer Science 121 (2017) 920–926
1877-0509 © 2017 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of the scientific committee of the CENTERIS - International Conference on ENTERprise Information
Systems / ProjMAN - International Conference on Project MANagement / HCist - International Conference on Health and Social
Care Information Systems and Technologies.
10.1016/j.procs.2017.11.119
1877-0509 © 2017 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of the scientific committee of the CENTERIS - International Conference on ENTERprise Information
Systems / ProjMAN - International Conference on Project MANagement / HCist - International Conference on Health and Social
Care Information Systems and Technologies.
10.1016/j.procs.2017.11.119
10.1016/j.procs.2017.11.119
© 2017 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of the scientic committee of the CENTERIS - International Conference on ENTERprise
Information Systems / ProjMAN - International Conference on Project MANagement / HCist - International Conference on
Health and Social Care Information Systems and Technologies.
1877-0509
Available online at www.sciencedirect.com
ScienceDirect
Procedia Computer Science 00 (2017) 000000
www.elsevier.com/locate/procedia
1877-0509 © 2017 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of the scientific committee of the HCist - International Conference on Health and Social Care Information
Systems and Technologies.
CENTERIS - International Conference on ENTERprise Information Systems / ProjMAN -
International Conference on Project MANagement / HCist - International Conference on Health
and Social Care Information Systems and Technologies, CENTERIS / ProjMAN / HCist 2017, 8-10
November 2017, Barcelona, Spain
Design of a Knowledge-Based Agent as a Social Companion
Leo Wannera,b
1
, Elisabeth Andréc, Josep Blatb, Stamatia Dasiopouloub, Mireia Farrúsb,
Thiago Fragad, Eleni Kamaterie, Florian Lingenfelserc, Gerard Llorachb, Oriol Martínezb,
Georgios Meditskose, Simon Milleb, Wolfgang Minkerf, Louisa Pragstf, Dominik
Schillerc, Andries Stamg, Ludo Stellingwerffg, Federico Suknob, Bianca Vieru, and
Stefanos Vrochidise
aCatalan Institute for Research and Advanced Studies (ICREA); bPompeu Fabra University, C/Roc Boronat, 138, 08018 Barcelona, Spain;
cUniversity of Augsburg, Universitätsstr. 6, 86159 Augsburg, Germany; dVocapia Research 28, rue Jean Rostand Parc Orsay Université,
91400 Orsay, France; eCenter for Research and Technologies Hellas (CERTH), 6th km Charilaou-Thermi Rd, P.O. Box 60361, 57001 Thermi,
Thessaloniki, Greece; fUniversity of Ulm, Albert Einstein Allee, 43, 89081 Ulm, Germany; gAlmende BV, AK, Stationsplein 45, 3013 AK
Rotterdam, The Netherlands
Abstract
We present work in progress on an intelligent embodied conversation agent that is supposed to act as a social companion with
linguistic and emotional competence in the context of basic and health care. The core of the agent is an ontology-based knowledge
model that supports flexible reasoning-driven conversation planning strategies. A dedicated search engine ensures the provision of
background information from the web, necessary for conducting a conversation on a specific topic. Multimodal communication
analysis and generation modules analyze respectively generate facial expressions, gestures and multilingual speech. The assessment
of the prototypical implementation of the agent shows that users accept it as a natural and trustworthy conversation counterpart.
For the final release, all involved technologies will be further improved and matured.
© 2017 The Authors. Published by Elsevier B.V.
* Corresponding author. Tel.: +34 93 542 2241; fax: +34 93 542 2517.
E-mail address: leo.wanner@upf.edu
Available online at www.sciencedirect.com
ScienceDirect
Procedia Computer Science 00 (2017) 000000
www.elsevier.com/locate/procedia
1877-0509 © 2017 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of the scientific committee of the HCist - International Conference on Health and Social Care Information
Systems and Technologies.
CENTERIS - International Conference on ENTERprise Information Systems / ProjMAN -
International Conference on Project MANagement / HCist - International Conference on Health
and Social Care Information Systems and Technologies, CENTERIS / ProjMAN / HCist 2017, 8-10
November 2017, Barcelona, Spain
Design of a Knowledge-Based Agent as a Social Companion
Leo Wannera,b1, Elisabeth Andréc, Josep Blatb, Stamatia Dasiopouloub, Mireia Farrúsb,
Thiago Fragad, Eleni Kamaterie, Florian Lingenfelserc, Gerard Llorachb, Oriol Martínezb,
Georgios Meditskose, Simon Milleb, Wolfgang Minkerf, Louisa Pragstf, Dominik
Schillerc, Andries Stamg, Ludo Stellingwerffg, Federico Suknob, Bianca Vieru, and
Stefanos Vrochidise
aCatalan Institute for Research and Advanced Studies (ICREA); bPompeu Fabra University, C/Roc Boronat, 138, 08018 Barcelona, Spain;
cUniversity of Augsburg, Universitätsstr. 6, 86159 Augsburg, Germany; dVocapia Research 28, rue Jean Rostand Parc Orsay Université,
91400 Orsay, France; eCenter for Research and Technologies Hellas (CERTH), 6th km Charilaou-Thermi Rd, P.O. Box 60361, 57001 Thermi,
Thessaloniki, Greece; fUniversity of Ulm, Albert Einstein Allee, 43, 89081 Ulm, Germany; gAlmende BV, AK, Stationsplein 45, 3013 AK
Rotterdam, The Netherlands
Abstract
We present work in progress on an intelligent embodied conversation agent that is supposed to act as a social companion with
linguistic and emotional competence in the context of basic and health care. The core of the agent is an ontology-based knowledge
model that supports flexible reasoning-driven conversation planning strategies. A dedicated search engine ensures the provision of
background information from the web, necessary for conducting a conversation on a specific topic. Multimodal communication
analysis and generation modules analyze respectively generate facial expressions, gestures and multilingual speech. The assessment
of the prototypical implementation of the agent shows that users accept it as a natural and trustworthy conversation counterpart.
For the final release, all involved technologies will be further improved and matured.
© 2017 The Authors. Published by Elsevier B.V.
* Corresponding author. Tel.: +34 93 542 2241; fax: +34 93 542 2517.
E-mail address: leo.wanner@upf.edu
2 Leo Wanner et al./ Procedia Computer Science 00 (2017) 000000
Peer-review under responsibility of the scientific committee of the HCist - International Conference on Health and Social Care
Information Systems and Technologies.
Keywords: conversation agent, dialogue, ontological knowledge model, multimodality, multilinguality, basic care, health care
1. Introduction
Social companionship in the geriatric context is crucial4,20,25,26. As long as elderly were embedded into stable family
structures which foresaw that several generations live together, this companionship was ensured at least to a certain
degree. However, in the modern society, there is an increasing tendency among older people to live alone or in
residences. As a result, the family bonds are at risk to weaken, such that elderly no longer feel sufficiently attended,
lack affection and attention and are thus increasingly socially isolated. The problem is particularly significant for
elderly migrants because they are often not integrated into the society of the host country either. Intelligent
conversation agents could help in that they could converse, entertain, coach, etc. However, for this task, they must be
versatile, eloquent, knowledgeable, and possess a certain cultural, social and emotional competence. Not all of these
characteristics are displayed by state-of-the-art conversation agents. Many of them focus on the emotional (and, to a
certain extent, social) competence5,18,19,30,. Hardly any are versatile, eloquent and knowledgeable. Versatility
presupposes flexibility in dialogue conduction, but most of the agents follow a prescripted dialogue strategy and do
not take into account that the course and content of a dialogue is also influenced by the culture of the human
conversation counterpart. Eloquence presupposes full-fledged (multilingual) text generation, while most of the agents
use predefined sentence templates for linguistic realization1,29. Knowledgeability presupposes a theoretically sound
knowledge model over which the agent can reason and the possibility to acquire further knowledge if required by a
conversation, while only very few agents are based on an ontological, expandable knowledge model and integrate an
information search engine.
We attempt to account for some of the most significant limitations of the current state of the art in our work in
progress on a flexible knowledge-based conversation agent. The agent is designed as an embodied companion for
elderly migrants with language and cultural barriers in the host country and as a trusted information provider and
mediator in questions related to basic care and healthcare.
2. Design of the Knowledge-Based Conversation Agent
The targeted agent (A) is expected to be able to conduct dialogues with users (U) as the following one:
A: Why do you look so sad? What is wrong?
U: I feel sad; nobody came to see me already for two weeks.
A: Cheer up! What about going for a walk in the park?
U: How is the weather today?
A: The weather will be in general nice, but some showers cannot be excluded.
So don’t forget to take an umbrella with you
To have this capacity, the agent is envisaged to (i) be embedded into linguistic, cultural, social and emotional
contexts of the users; (ii) be able to search for information in the web to either enrich its own knowledge repertoire or
to offer to the user requested content; (iii) understand and interpret the multimodal (facial, gestural and multilingual
verbal) communication signals of the user; (iv) plan the dialogue using ontology-based reasoning techniques
according to the prior interpretation of the user signals; (v) communicate with the user with multimodal
communication signals. The architecture of the agent in Fig. 1 reflects these objectives.
Leo Wanner et al. / Procedia Computer Science 121 (2017) 920–926 921
www.elsevier.com/locate/procedia
1877-0509 © 2017 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of the scientific committee of the HCist - International Conference on Health and Social Care Information
Systems and Technologies.
CENTERIS - International Conference on ENTERprise Information Systems / ProjMAN -
International Conference on Project MANagement / HCist - International Conference on Health
and Social Care Information Systems and Technologies, CENTERIS / ProjMAN / HCist 2017, 8-10
November 2017, Barcelona, Spain
Design of a Knowledge-Based Agent as a Social Companion
Leo Wannera,b1, Elisabeth Andréc, Josep Blatb, Stamatia Dasiopouloub, Mireia Farrúsb,
Thiago Fragad, Eleni Kamaterie, Florian Lingenfelserc, Gerard Llorachb, Oriol Martínezb,
Georgios Meditskose, Simon Milleb, Wolfgang Minkerf, Louisa Pragstf, Dominik
Schillerc, Andries Stamg, Ludo Stellingwerffg, Federico Suknob, Bianca Vieru, and
Stefanos Vrochidise
aCatalan Institute for Research and Advanced Studies (ICREA); bPompeu Fabra University, C/Roc Boronat, 138, 08018 Barcelona, Spain;
cUniversity of Augsburg, Universitätsstr. 6, 86159 Augsburg, Germany; dVocapia Research 28, rue Jean Rostand Parc Orsay Université,
91400 Orsay, France; eCenter for Research and Technologies Hellas (CERTH), 6th km Charilaou-Thermi Rd, P.O. Box 60361, 57001 Thermi,
Thessaloniki, Greece; fUniversity of Ulm, Albert Einstein Allee, 43, 89081 Ulm, Germany; gAlmende BV, AK, Stationsplein 45, 3013 AK
Rotterdam, The Netherlands
Abstract
We present work in progress on an intelligent embodied conversation agent that is supposed to act as a social companion with
linguistic and emotional competence in the context of basic and health care. The core of the agent is an ontology-based knowledge
model that supports flexible reasoning-driven conversation planning strategies. A dedicated search engine ensures the provision of
background information from the web, necessary for conducting a conversation on a specific topic. Multimodal communication
analysis and generation modules analyze respectively generate facial expressions, gestures and multilingual speech. The assessment
of the prototypical implementation of the agent shows that users accept it as a natural and trustworthy conversation counterpart.
For the final release, all involved technologies will be further improved and matured.
© 2017 The Authors. Published by Elsevier B.V.
* Corresponding author. Tel.: +34 93 542 2241; fax: +34 93 542 2517.
E-mail address: leo.wanner@upf.edu
www.elsevier.com/locate/procedia
1877-0509 © 2017 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of the scientific committee of the HCist - International Conference on Health and Social Care Information
Systems and Technologies.
CENTERIS - International Conference on ENTERprise Information Systems / ProjMAN -
International Conference on Project MANagement / HCist - International Conference on Health
and Social Care Information Systems and Technologies, CENTERIS / ProjMAN / HCist 2017, 8-10
November 2017, Barcelona, Spain
Design of a Knowledge-Based Agent as a Social Companion
Leo Wannera,b1, Elisabeth Andréc, Josep Blatb, Stamatia Dasiopouloub, Mireia Farrúsb,
Thiago Fragad, Eleni Kamaterie, Florian Lingenfelserc, Gerard Llorachb, Oriol Martínezb,
Georgios Meditskose, Simon Milleb, Wolfgang Minkerf, Louisa Pragstf, Dominik
Schillerc, Andries Stamg, Ludo Stellingwerffg, Federico Suknob, Bianca Vieru, and
Stefanos Vrochidise
aCatalan Institute for Research and Advanced Studies (ICREA); bPompeu Fabra University, C/Roc Boronat, 138, 08018 Barcelona, Spain;
cUniversity of Augsburg, Universitätsstr. 6, 86159 Augsburg, Germany; dVocapia Research 28, rue Jean Rostand Parc Orsay Université,
91400 Orsay, France; eCenter for Research and Technologies Hellas (CERTH), 6th km Charilaou-Thermi Rd, P.O. Box 60361, 57001 Thermi,
Thessaloniki, Greece; fUniversity of Ulm, Albert Einstein Allee, 43, 89081 Ulm, Germany; gAlmende BV, AK, Stationsplein 45, 3013 AK
Rotterdam, The Netherlands
Abstract
We present work in progress on an intelligent embodied conversation agent that is supposed to act as a social companion with
linguistic and emotional competence in the context of basic and health care. The core of the agent is an ontology-based knowledge
model that supports flexible reasoning-driven conversation planning strategies. A dedicated search engine ensures the provision of
background information from the web, necessary for conducting a conversation on a specific topic. Multimodal communication
analysis and generation modules analyze respectively generate facial expressions, gestures and multilingual speech. The assessment
of the prototypical implementation of the agent shows that users accept it as a natural and trustworthy conversation counterpart.
For the final release, all involved technologies will be further improved and matured.
© 2017 The Authors. Published by Elsevier B.V.
* Corresponding author. Tel.: +34 93 542 2241; fax: +34 93 542 2517.
E-mail address: leo.wanner@upf.edu
2 Leo Wanner et al./ Procedia Computer Science 00 (2017) 000000
Peer-review under responsibility of the scientific committee of the HCist - International Conference on Health and Social Care
Information Systems and Technologies.
Keywords: conversation agent, dialogue, ontological knowledge model, multimodality, multilinguality, basic care, health care
1. Introduction
Social companionship in the geriatric context is crucial4,20,25,26. As long as elderly were embedded into stable family
structures which foresaw that several generations live together, this companionship was ensured at least to a certain
degree. However, in the modern society, there is an increasing tendency among older people to live alone or in
residences. As a result, the family bonds are at risk to weaken, such that elderly no longer feel sufficiently attended,
lack affection and attention and are thus increasingly socially isolated. The problem is particularly significant for
elderly migrants because they are often not integrated into the society of the host country either. Intelligent
conversation agents could help in that they could converse, entertain, coach, etc. However, for this task, they must be
versatile, eloquent, knowledgeable, and possess a certain cultural, social and emotional competence. Not all of these
characteristics are displayed by state-of-the-art conversation agents. Many of them focus on the emotional (and, to a
certain extent, social) competence5,18,19,30,. Hardly any are versatile, eloquent and knowledgeable. Versatility
presupposes flexibility in dialogue conduction, but most of the agents follow a prescripted dialogue strategy and do
not take into account that the course and content of a dialogue is also influenced by the culture of the human
conversation counterpart. Eloquence presupposes full-fledged (multilingual) text generation, while most of the agents
use predefined sentence templates for linguistic realization1,29. Knowledgeability presupposes a theoretically sound
knowledge model over which the agent can reason and the possibility to acquire further knowledge if required by a
conversation, while only very few agents are based on an ontological, expandable knowledge model and integrate an
information search engine.
We attempt to account for some of the most significant limitations of the current state of the art in our work in
progress on a flexible knowledge-based conversation agent. The agent is designed as an embodied companion for
elderly migrants with language and cultural barriers in the host country and as a trusted information provider and
mediator in questions related to basic care and healthcare.
2. Design of the Knowledge-Based Conversation Agent
The targeted agent (A) is expected to be able to conduct dialogues with users (U) as the following one:
A: Why do you look so sad? What is wrong?
U: I feel sad; nobody came to see me already for two weeks.
A: Cheer up! What about going for a walk in the park?
U: How is the weather today?
A: The weather will be in general nice, but some showers cannot be excluded.
So don’t forget to take an umbrella with you
To have this capacity, the agent is envisaged to (i) be embedded into linguistic, cultural, social and emotional
contexts of the users; (ii) be able to search for information in the web to either enrich its own knowledge repertoire or
to offer to the user requested content; (iii) understand and interpret the multimodal (facial, gestural and multilingual
verbal) communication signals of the user; (iv) plan the dialogue using ontology-based reasoning techniques
according to the prior interpretation of the user signals; (v) communicate with the user with multimodal
communication signals. The architecture of the agent in Fig. 1 reflects these objectives.
922 Leo Wanner et al. / Procedia Computer Science 121 (2017) 920–926
Leo Wanner et al./ Procedia Computer Science 00 (2017) 000000 3
Fig 1: Architecture of the knowledge-based conversation agent
In this architecture, the communication analysis modules are controlled by the Social Signal Interpretation (SSI)
framework27. The output produced by the communication analysis modules is projected onto genuine ontological
(OWL) graphs, fused and stored in the knowledge base (KB), which also receives input from a search engine that
retrieves background information from curated web sources. The dialogue manager (DM) is embedded into the Visual
Scene Maker (VSM) framework10, which, in addition, controls the agent’s turn-taking and its non-verbal idle
behavior17. The turn-taking strategy is based on a policy that determines whether the agent is allowed to interrupt the
user’s utterance and how it reacts when the user interrupts it. The agent displays idle behavior when it listens to the
user, for example, mimicking the user’s (positive) affective state or displaying different eye gazes17.
The DM requests from the KB possible reasoned reactions to the communication move of the user and chooses the
best in accordance with the analyzed move, the user’s emotion and culture and the recent dialogue history. The OWL
structures of the chosen reaction are passed by the DM to the fission and discourse planning module, which assigns
to the content elements of these structures the realization modalities (voice, face, and/or body gesture) and plans their
coherent and coordinated presentation. The three modality generation modules instantiate the form of the content
elements assigned to them and their output is synchronized to ensure a coherent multimodal communication and
realized.
Let us now briefly introduce the core modules. Verbal communication analysis implies speech recognition and
language analysis. For speech recognition, VoxSigma (http://www.vocapia.com/) is used. Language analysis is
realized as a sequence of processing stages that involves deep dependency parsing2, rule-based graph transduction6,
and ontology design patterns9. Cf. Fig. 2 for the intermediate structures of the ASR transcript I feel sad. Its knowledge
graph representation is a declarative statement that contains an instantiation of the ‘dul:Situation’ class, which
interprets the instances of ‘:CareRecipient’ and ‘:Sad’ classes as the experiencer and experienced emotion respectively
of the event class ‘:Feel’ instance (cf. Fig. 3).
Fig. 2: Intermediate structures of language analysis Fig. 3: OWL structure of the statement ‘I feel sad’
Facial expression analysis, gesture recognition and paralinguistic cue analysis focus so far on the identification of
the affective state of the user. For facial expression analysis, Action Units (AUs) from the Facial Action Coding
:declare a da:Declare ;
da:containsSemantics :feelCtx1 .
:feelCtx1 a dul:Situation ;
dul:includes :user1 ; dul:includes :sad1 ; dul:includesEvent :feel1 ;
dul:satisfies :feelDesc1 [a dul:Description] .
:feel1 a :Feel [rdfs:SubClassOf dul:Event] ;
dul:classifiedBy :Context [rdfs:SubClassOf dul:Concept] .
:sad1 a :Sad [rdfs:subClassOf :Emotion] ;
dul:classifiedBy :Theme [a dul:Concept] .
:user1 a :CareRecipient [rdfs:SubClassOf dul:Person] ;
dul:classifiedBy :Experiencer [rdfs:SubClassOf dul:Concept] .
4 Leo Wanner et al./ Procedia Computer Science 00 (2017) 000000
System8,23 are used. To determine the AUs, first SIFT-based features are extracted from automatically detected facial
landmarks, and subsequently linear classifiers are applied to assign probabilities to the targeted AUs. During gesture
analysis, the agitation of the user in terms of hand movements is detected using video frame filter masks. To obtain
paralinguistic affective cues, we use Wagner et al.’s model27. Extracted facial, gestural and paralinguistic cues are
combined using event fusion strategies15 and projected onto values in the valence/arousal space, which are then
represented in the KB.
The agent’s KB contains ontologies that cover: (i) models for the representation, integration and interpretation of
the content of the user’s multimodal communication; (ii) background knowledge and user profile and behavior pattern
ontologies; and (iii) healthcare and medical ontologies22. The knowledge integration and interpretation models define
how the structures can be combined to derive high-level interpretations. For this, a lightweight ontology schema
models the types of relevant structures and their interpretation strategies by the reasoner. Fig. 4 depicts the vocabulary
for the interpretation of the user’s statement I feel sad and the complementary information from the visual input ‘low
mood’ encoded in terms of valence/arousal values21. As can be observed, the ontology extends the ‘leo:Event’ concept
of LODE24, taking advantage of existing vocabularies for description of events and observations. The relation between
observation types and context is modeled in terms of the ‘Context’ class. This allows for the introduction of one or
more ‘contains’ property assertions that refer to observations. The fact that the user is sad constitutes contextual
information that is modeled as an instance of ‘Context’, which is further associated with an instance of ‘Sad’.
Fig. 4: Observation and Context models Fig.5: Excerpt of the dialogue acts ontology
Fig. 4 furthermore contains an excerpt of the domain ontology used to infer feedback and suggestions based on the
emotional state of the user. For each context, one or more ‘suggestion’ property assertions can be defined and
associated with feedback instances that can improve user’s mood. In our example, ‘Sadness’ is a subclass of ‘Context’,
defined in terms of the following equivalence axiom: Sadness Context contains.Sad. It also defines the property
restriction Sadness suggestion.ImproveMood, which specifies the type of feedback needed when this emotional
context is detected. The ‘ctx1’ instance is classified in the ‘Sadness’ context class, which further inherits the restriction
about the potential feedback that could be given to improve the mood of the user. All three subclasses of the
‘ImproveMood’ concept are retrieved and sent back to the DM as possible system actions. The DM chooses the one
that is to be communicated to the user. The choice is grounded in the dialogue act typology shown in Fig. 5 as well as
in the user’s emotion and culture and the recent dialogue history. To avoid the predefinition of all user and system
actions and be able to handle arbitrary input from both the language analysis and the KB, the choice is defined for
general features such as the respective dialogue act and the topics, constituted by the classes associated with the
possible system actions. For instance, the three possible system actions from above share the dialogue act ‘Statement’.
However, the topics differ. Thus, the first action has the topics ‘newspaper’ and ‘read’, the second ‘socialmedia’ and
‘read’, and the third ‘activity’. Individuals from a collectivistic culture tend to be more tightly integrated in their
respective social groups, while individuals from an individualistic culture less so12. Therefore, the DM would propose
to the user with a collectivistic culture background to read aloud news from social media, and select one of the other
options if the user's culture is individualistic.
Leo Wanner et al. / Procedia Computer Science 121 (2017) 920–926 923
Leo Wanner et al./ Procedia Computer Science 00 (2017) 000000 3
Fig 1: Architecture of the knowledge-based conversation agent
In this architecture, the communication analysis modules are controlled by the Social Signal Interpretation (SSI)
framework27. The output produced by the communication analysis modules is projected onto genuine ontological
(OWL) graphs, fused and stored in the knowledge base (KB), which also receives input from a search engine that
retrieves background information from curated web sources. The dialogue manager (DM) is embedded into the Visual
Scene Maker (VSM) framework10, which, in addition, controls the agent’s turn-taking and its non-verbal idle
behavior17. The turn-taking strategy is based on a policy that determines whether the agent is allowed to interrupt the
user’s utterance and how it reacts when the user interrupts it. The agent displays idle behavior when it listens to the
user, for example, mimicking the user’s (positive) affective state or displaying different eye gazes17.
The DM requests from the KB possible reasoned reactions to the communication move of the user and chooses the
best in accordance with the analyzed move, the user’s emotion and culture and the recent dialogue history. The OWL
structures of the chosen reaction are passed by the DM to the fission and discourse planning module, which assigns
to the content elements of these structures the realization modalities (voice, face, and/or body gesture) and plans their
coherent and coordinated presentation. The three modality generation modules instantiate the form of the content
elements assigned to them and their output is synchronized to ensure a coherent multimodal communication and
realized.
Let us now briefly introduce the core modules. Verbal communication analysis implies speech recognition and
language analysis. For speech recognition, VoxSigma (http://www.vocapia.com/) is used. Language analysis is
realized as a sequence of processing stages that involves deep dependency parsing2, rule-based graph transduction6,
and ontology design patterns9. Cf. Fig. 2 for the intermediate structures of the ASR transcript I feel sad. Its knowledge
graph representation is a declarative statement that contains an instantiation of the ‘dul:Situation’ class, which
interprets the instances of ‘:CareRecipient’ and ‘:Sad’ classes as the experiencer and experienced emotion respectively
of the event class ‘:Feel’ instance (cf. Fig. 3).
Fig. 2: Intermediate structures of language analysis Fig. 3: OWL structure of the statement ‘I feel sad’
Facial expression analysis, gesture recognition and paralinguistic cue analysis focus so far on the identification of
the affective state of the user. For facial expression analysis, Action Units (AUs) from the Facial Action Coding
:declare a da:Declare ;
da:containsSemantics :feelCtx1 .
:feelCtx1 a dul:Situation ;
dul:includes :user1 ; dul:includes :sad1 ; dul:includesEvent :feel1 ;
dul:satisfies :feelDesc1 [a dul:Description] .
:feel1 a :Feel [rdfs:SubClassOf dul:Event] ;
dul:classifiedBy :Context [rdfs:SubClassOf dul:Concept] .
:sad1 a :Sad [rdfs:subClassOf :Emotion] ;
dul:classifiedBy :Theme [a dul:Concept] .
:user1 a :CareRecipient [rdfs:SubClassOf dul:Person] ;
dul:classifiedBy :Experiencer [rdfs:SubClassOf dul:Concept] .
4 Leo Wanner et al./ Procedia Computer Science 00 (2017) 000000
System8,23 are used. To determine the AUs, first SIFT-based features are extracted from automatically detected facial
landmarks, and subsequently linear classifiers are applied to assign probabilities to the targeted AUs. During gesture
analysis, the agitation of the user in terms of hand movements is detected using video frame filter masks. To obtain
paralinguistic affective cues, we use Wagner et al.’s model27. Extracted facial, gestural and paralinguistic cues are
combined using event fusion strategies15 and projected onto values in the valence/arousal space, which are then
represented in the KB.
The agent’s KB contains ontologies that cover: (i) models for the representation, integration and interpretation of
the content of the user’s multimodal communication; (ii) background knowledge and user profile and behavior pattern
ontologies; and (iii) healthcare and medical ontologies22. The knowledge integration and interpretation models define
how the structures can be combined to derive high-level interpretations. For this, a lightweight ontology schema
models the types of relevant structures and their interpretation strategies by the reasoner. Fig. 4 depicts the vocabulary
for the interpretation of the user’s statement I feel sad and the complementary information from the visual input ‘low
mood’ encoded in terms of valence/arousal values21. As can be observed, the ontology extends the ‘leo:Event’ concept
of LODE24, taking advantage of existing vocabularies for description of events and observations. The relation between
observation types and context is modeled in terms of the ‘Context’ class. This allows for the introduction of one or
more ‘contains’ property assertions that refer to observations. The fact that the user is sad constitutes contextual
information that is modeled as an instance of ‘Context’, which is further associated with an instance of ‘Sad’.
Fig. 4: Observation and Context models Fig.5: Excerpt of the dialogue acts ontology
Fig. 4 furthermore contains an excerpt of the domain ontology used to infer feedback and suggestions based on the
emotional state of the user. For each context, one or more ‘suggestion’ property assertions can be defined and
associated with feedback instances that can improve user’s mood. In our example, ‘Sadness’ is a subclass of ‘Context’,
defined in terms of the following equivalence axiom: Sadness Context contains.Sad. It also defines the property
restriction Sadness suggestion.ImproveMood, which specifies the type of feedback needed when this emotional
context is detected. The ‘ctx1’ instance is classified in the ‘Sadness’ context class, which further inherits the restriction
about the potential feedback that could be given to improve the mood of the user. All three subclasses of the
‘ImproveMood’ concept are retrieved and sent back to the DM as possible system actions. The DM chooses the one
that is to be communicated to the user. The choice is grounded in the dialogue act typology shown in Fig. 5 as well as
in the user’s emotion and culture and the recent dialogue history. To avoid the predefinition of all user and system
actions and be able to handle arbitrary input from both the language analysis and the KB, the choice is defined for
general features such as the respective dialogue act and the topics, constituted by the classes associated with the
possible system actions. For instance, the three possible system actions from above share the dialogue act ‘Statement’.
However, the topics differ. Thus, the first action has the topics ‘newspaper’ and ‘read’, the second ‘socialmedia’ and
‘read’, and the third ‘activity’. Individuals from a collectivistic culture tend to be more tightly integrated in their
respective social groups, while individuals from an individualistic culture less so12. Therefore, the DM would propose
to the user with a collectivistic culture background to read aloud news from social media, and select one of the other
options if the user's culture is individualistic.
924 Leo Wanner et al. / Procedia Computer Science 121 (2017) 920–926
Leo Wanner et al./ Procedia Computer Science 00 (2017) 000000 5
Once the appropriate system action has been determined by the DM, the fission module assigns to the individual
mode generation modules the content elements from the OWL graph that are to be expressed by the respective mode.
Language generation follows the inverse cascade of processing stages depicted for analysis; see Fig. 2 above. The
language generator consists of multilingual rule-based28 and statistical3 graph transduction components. As speech
generator, which takes as input the output provided by the language generator, we use CereProc’s TTS
(https://www.cereproc.com/). In parallel to the cascaded proposition realization model, a hierarchical prosodic model
is deployed, which captures prosody as a complex interaction of acoustic features at different phonological levels in
the utterance (i.e., prosodic phrases, prosodic words and syllables) and the information structure7.
For its non-verbal appearance, the agent is realized as an embodied virtual character. Cultural gestures and facial
expressions are generated according to the semantics of the message that is to be communicated. To facilitate the
required variety of facial expressions and avoid manual design of all possible expressions for each character, the
valence-arousal representation of emotions in the continuous 2D and 3D space is used11,13,14. Because of its parametric
nature, the valence-arousal space can be easily applied to a variety of faces.
3. Evaluation
The assessment of the quality of the first prototypical implementation of the agent has been carried out in the
context of three different use cases. In the first, it acts as a social companion of elderly with German respectively
Turkish background, in the second it acts as an assistant of Polish care givers who take care of German elderly, and
in the third it is supposed to be a healthcare adviser of migrants with North African background. Qualitative evaluation
trials have been run with respect to the agent’s appearance, trustworthiness, competence, naturalness, friendliness,
speech and language understanding and production quality, etc. As far as the agent’s appearance is concerned, it was
rated as still to be too rigid and unnatural, which was to be expected given the preliminary design of the virtual
character that embodies the agent. When asked whether the agent was proactive in addressing the user and whether
the communicative goal of the agent was in general clear, better marks were achieved (3.23 respectively 3.25 on a
five value Likert scale, with ‘1’ being the worst and ‘5’ being the best). Also, most of the evaluators agreed that the
agent provides the right amount of information when being asked (3.27 on the Likert scale). In general, it can be thus
assumed that even in its preliminary appearance the agent is considered to be a competent conversation partner.
For further illustration of the performance of the 1st prototype of the agent, Table 1 displays an excerpt of the
questionnaire on the quality of language production by the agent.
Table 1: Evaluation of the language production competence of the 1st prototype of the agent: ‘1’= “disagree”; ‘5’ = “compeletely agree”
Evaluation statement
Likert scale value (SD)
The voice of the agent sounds natural
2.81( ± 1.33)
The voice of the agent is expressive
2.24 ± 1.12
The statements uttered by the agent are perfectly understandable
3.19 ( 0.98)
The language as used by the agent was perfectly grammatical
2.77 (1.12)
The agent expresses itself accurately
3.45 (1.03)
The agent talks coherently (in the case of a multi-sentential statement)
3.43 ( 0.97)
As can be observed, the voice of the agent still needs to be improved. In particular, the prosody of the agent is
perceived to be monotonous in the case of a multi-sentential discourse or when reading a newspaper. We are currently
about to experiment with novel techniques for prosody enrichment. The grammaticality has also been rated as
deficient, while the content of the discourse is already considered to be rather accurate. Obviously, all technologies
still need to be improved considerably.
4. Conclusions
We presented the design and a preliminary evaluation of the first prototype of an intelligent embodied conversation
agent, which is aimed to conduct socially competent and culture-sensitive multilingual conversations in the context
6 Leo Wanner et al./ Procedia Computer Science 00 (2017) 000000
of basic care and healthcare. In its current version, the agent is able to understand and communicate in German, Polish,
and Spanish; in its next version, the language competence will be extended to cover also Arabic and Turkish. The
results of the first round of evaluation is encouraging, although some clear need for further improvement of the
individual technologies has been identified. Currently, the identified shortcomings are about to be addressed.
Acknowledgments
The presented work is funded by the European Commission as part of the H2020 Program under the contract
number 645012. We would like to thank our colleagues Chiara Baudracco, Jutta Mohr, Eylem Ög, Valérie Sarholz,
and Benjamin Schäfer for organizing and carrying out the evaluation trials and for their patience with the numerous
inadequacies of KRISTINA.
References
1. Anderson, K., André, E., Baur, T., Bernardini, S., Chollet, M., Chryssadou, E., Damian, I., Ennis, C., Egges, A., Gebhard, P., Jones, H., Ochs,
M., Pelachaud, C., Porayska-Pomsta, K., Rizzo, P., Sabouret, N.: The TARDIS framework: Intelligent virtual agents for social coaching in job
interviews. In: Reidsma, D., Katayose, H., Nijholt, A. (eds.) ACE, vol. LNCS, 8253. 2013; p. 476491. Springer, Heidelberg.
2. Ballesteros, M., Bohnet, B., Mille, S., Wanner, L.: Data-driven deep-syntactic dependency parsing. Natural Language Engineering. 2016;
22(6):939974.
3. Ballesteros, M., Bohnet, B., Mille, S., Wanner, L.: Data-driven sentence generation with non-isomorphic trees. In: Proceedings of the
Conference of the NAACL: Human Language Technologies; 2015. p. 387397.
4. Baldassare, M., Rosenfield, S., and Rook, K. The types of social relations predicting elderly well-being. Res on Aging. 1984. 6(4):549 559.
5. Baur, T., Mehlmann, G., Damian, I., Gebhard, P., Lingenfelser, F., Wagner, J., Lugrin, B., André E.: Context-Aware Automated Analysis and
Annotation of Social Human-Agent Interactions. ACM Transactions on Interactive Intelligent Systems. 2015; 5(2).
6. Bohnet, B., Wanner, L. Open source graph transducer interpreter and grammar development environment. In: Proceedings of the International
Conference on Language Resources and Evaluation; 2010.
7. Domínguez, M., Farrús, M., Burga, A., Wanner, L.: Using hierarchical information structure for prosody prediction in content-to-speech
application. In: Proceedings of the 8thInternational Conference on Speech Prosody; 2016.
8. Ekman, P., Rosenberg, E.L. What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System
(FACS). Oxford University Press, USA; 1997.
9. Gangemi, A.: The Semantic Web. In: Proceedings of the 4th International Semantic Web Conference; 2005, p. 262 27
10. Gebhard, P., Mehlmann, G.U., Kipp, M.: Visual SceneMaker: A Tool for Authoring Interactive Virtual Characters. Journal of Multimodal
User Interfaces: Interacting with Embodied Conversational Agents, Springer-Verlag. 2012; 6(1-2):3 11.
11. Gunes, H., Schuller, B.: Categorical and dimensional affect analysis in continuous input: Current trends and future directions. Image and Vision
Computing 2013; 31(2):120136.
12. Hofstede, G.H., Hofstede, G. Culture's consequences: Comparing values, behaviors, institutions and organizations across nations. Sage. 2001.
13. Hyde, J., Carter, E.J., Kiesler, S., Hodgins, J.K.: Assessing naturalness and emotional intensity: a perceptual study of animated facial motion.
In: Proceedings of the ACM Symposium on Applied Perception. 2014; p 1522. ACM.
14. Hyde, J., Carter, E.J., Kiesler, S., Hodgins, J.K.: Using an interactive avatar's facial expressiveness to increase persuasiveness and socialness.
In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. 2015; p. 1719 1728. ACM.
15. Lingenfelser, F.,Wagner, J., André, E., McKeown, G., Curran, W.: An event driven fusion approach for enjoyment recognition in real-time.
In: Proceedings of the Multimedia Conference. 2014; p. 377386.
16. Mehlmann, G., Janowski, K., André, E.: Modeling Grounding for Interactive Social Companions. Journal of Artificial Intelligence: Social
Companion Technologies. 2016. 30(1):4552.
17. Mehlmann, G., Janowski, K., Baur, T., Häring, M., André, E., Gebhard, P. Exploring a Model of Gaze for Grounding in HRI. In : Proceedings
of the 16th International Conference on Multimodal Interaction. 2014; p. 247254. ACM.
18. Ochs, M., Pelachaud, C.: Socially Aware Virtual Characters: The Social Signal of Smiles. IEEE Signal Processing Magazine. 2013; 30(2):128
132.
19. Pfeifer Vardoulakis, L., Ring, L., Barry, B., Sidner, C., Bickmore, T.: Designing relational agents as long term social companions for older
adults. In: Proceedings of the 12th International Conference on Intelligent Virtual Agents. 2012.
20. Pickett Y, Raue, PJ, Bruce, ML. Late-life depression in home healthcare. J Aging Health 2012; 8(3): 273284.
21. Posner, J., Russell, J., Peterson, B.: The circumplex model of affect: An integrative approach to affective neuroscience, cognitive development
and psychopathology. Development and psychopathology. 2005; 17(3).
22. Riaño, D., Real, F., Campana, F., Ercolani, S., Annicchiarico, R.: An ontology for the care of the elder at home. In: Proceedings of the 12th
Conference on Artifcial Intelligence in Medicine: Artifcial Intelligence in Medicine. 2009; p. 235239. AIME '09, Springer-Verlag, Berlin.
23. Savran, A., Sankur, B., Bilge, M.T.: Regression- based intensity estimation of facial action units. Image and Vision Computing 2012;
30(10):774 784.
24. Shaw, R., Troncy, R., Hardman, L.: Lode: Linking open descriptions of events. In: Proceedings of the 4th Asian Conference on the Semantic
Web. 2009; p. 153167. Shanghai, China.
25. Sorkin, D., Rook, K.S. and Lu, J.L.: Loneliness, lack of emotional support, lack of companionship, and the likelihood of having a heart
condition in an elderly sample. Ann. Behav. Med. 2002. 24: 290298.
Leo Wanner et al. / Procedia Computer Science 121 (2017) 920–926 925
Leo Wanner et al./ Procedia Computer Science 00 (2017) 000000 5
Once the appropriate system action has been determined by the DM, the fission module assigns to the individual
mode generation modules the content elements from the OWL graph that are to be expressed by the respective mode.
Language generation follows the inverse cascade of processing stages depicted for analysis; see Fig. 2 above. The
language generator consists of multilingual rule-based28 and statistical3 graph transduction components. As speech
generator, which takes as input the output provided by the language generator, we use CereProc’s TTS
(https://www.cereproc.com/). In parallel to the cascaded proposition realization model, a hierarchical prosodic model
is deployed, which captures prosody as a complex interaction of acoustic features at different phonological levels in
the utterance (i.e., prosodic phrases, prosodic words and syllables) and the information structure7.
For its non-verbal appearance, the agent is realized as an embodied virtual character. Cultural gestures and facial
expressions are generated according to the semantics of the message that is to be communicated. To facilitate the
required variety of facial expressions and avoid manual design of all possible expressions for each character, the
valence-arousal representation of emotions in the continuous 2D and 3D space is used11,13,14. Because of its parametric
nature, the valence-arousal space can be easily applied to a variety of faces.
3. Evaluation
The assessment of the quality of the first prototypical implementation of the agent has been carried out in the
context of three different use cases. In the first, it acts as a social companion of elderly with German respectively
Turkish background, in the second it acts as an assistant of Polish care givers who take care of German elderly, and
in the third it is supposed to be a healthcare adviser of migrants with North African background. Qualitative evaluation
trials have been run with respect to the agent’s appearance, trustworthiness, competence, naturalness, friendliness,
speech and language understanding and production quality, etc. As far as the agent’s appearance is concerned, it was
rated as still to be too rigid and unnatural, which was to be expected given the preliminary design of the virtual
character that embodies the agent. When asked whether the agent was proactive in addressing the user and whether
the communicative goal of the agent was in general clear, better marks were achieved (3.23 respectively 3.25 on a
five value Likert scale, with ‘1’ being the worst and ‘5’ being the best). Also, most of the evaluators agreed that the
agent provides the right amount of information when being asked (3.27 on the Likert scale). In general, it can be thus
assumed that even in its preliminary appearance the agent is considered to be a competent conversation partner.
For further illustration of the performance of the 1st prototype of the agent, Table 1 displays an excerpt of the
questionnaire on the quality of language production by the agent.
Table 1: Evaluation of the language production competence of the 1st prototype of the agent: ‘1’= “disagree”; ‘5’ = “compeletely agree”
Evaluation statement
Likert scale value (SD)
The voice of the agent sounds natural
2.81( ± 1.33)
The voice of the agent is expressive
2.24 ± 1.12
The statements uttered by the agent are perfectly understandable
3.19 ( 0.98)
The language as used by the agent was perfectly grammatical
2.77 (1.12)
The agent expresses itself accurately
3.45 (1.03)
The agent talks coherently (in the case of a multi-sentential statement)
3.43 ( 0.97)
As can be observed, the voice of the agent still needs to be improved. In particular, the prosody of the agent is
perceived to be monotonous in the case of a multi-sentential discourse or when reading a newspaper. We are currently
about to experiment with novel techniques for prosody enrichment. The grammaticality has also been rated as
deficient, while the content of the discourse is already considered to be rather accurate. Obviously, all technologies
still need to be improved considerably.
4. Conclusions
We presented the design and a preliminary evaluation of the first prototype of an intelligent embodied conversation
agent, which is aimed to conduct socially competent and culture-sensitive multilingual conversations in the context
6 Leo Wanner et al./ Procedia Computer Science 00 (2017) 000000
of basic care and healthcare. In its current version, the agent is able to understand and communicate in German, Polish,
and Spanish; in its next version, the language competence will be extended to cover also Arabic and Turkish. The
results of the first round of evaluation is encouraging, although some clear need for further improvement of the
individual technologies has been identified. Currently, the identified shortcomings are about to be addressed.
Acknowledgments
The presented work is funded by the European Commission as part of the H2020 Program under the contract
number 645012. We would like to thank our colleagues Chiara Baudracco, Jutta Mohr, Eylem Ög, Valérie Sarholz,
and Benjamin Schäfer for organizing and carrying out the evaluation trials and for their patience with the numerous
inadequacies of KRISTINA.
References
1. Anderson, K., André, E., Baur, T., Bernardini, S., Chollet, M., Chryssadou, E., Damian, I., Ennis, C., Egges, A., Gebhard, P., Jones, H., Ochs,
M., Pelachaud, C., Porayska-Pomsta, K., Rizzo, P., Sabouret, N.: The TARDIS framework: Intelligent virtual agents for social coaching in job
interviews. In: Reidsma, D., Katayose, H., Nijholt, A. (eds.) ACE, vol. LNCS, 8253. 2013; p. 476491. Springer, Heidelberg.
2. Ballesteros, M., Bohnet, B., Mille, S., Wanner, L.: Data-driven deep-syntactic dependency parsing. Natural Language Engineering. 2016;
22(6):939974.
3. Ballesteros, M., Bohnet, B., Mille, S., Wanner, L.: Data-driven sentence generation with non-isomorphic trees. In: Proceedings of the
Conference of the NAACL: Human Language Technologies; 2015. p. 387397.
4. Baldassare, M., Rosenfield, S., and Rook, K. The types of social relations predicting elderly well-being. Res on Aging. 1984. 6(4):549 559.
5. Baur, T., Mehlmann, G., Damian, I., Gebhard, P., Lingenfelser, F., Wagner, J., Lugrin, B., André E.: Context-Aware Automated Analysis and
Annotation of Social Human-Agent Interactions. ACM Transactions on Interactive Intelligent Systems. 2015; 5(2).
6. Bohnet, B., Wanner, L. Open source graph transducer interpreter and grammar development environment. In: Proceedings of the International
Conference on Language Resources and Evaluation; 2010.
7. Domínguez, M., Farrús, M., Burga, A., Wanner, L.: Using hierarchical information structure for prosody prediction in content-to-speech
application. In: Proceedings of the 8thInternational Conference on Speech Prosody; 2016.
8. Ekman, P., Rosenberg, E.L. What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System
(FACS). Oxford University Press, USA; 1997.
9. Gangemi, A.: The Semantic Web. In: Proceedings of the 4th International Semantic Web Conference; 2005, p. 262 27
10. Gebhard, P., Mehlmann, G.U., Kipp, M.: Visual SceneMaker: A Tool for Authoring Interactive Virtual Characters. Journal of Multimodal
User Interfaces: Interacting with Embodied Conversational Agents, Springer-Verlag. 2012; 6(1-2):3 11.
11. Gunes, H., Schuller, B.: Categorical and dimensional affect analysis in continuous input: Current trends and future directions. Image and Vision
Computing 2013; 31(2):120136.
12. Hofstede, G.H., Hofstede, G. Culture's consequences: Comparing values, behaviors, institutions and organizations across nations. Sage. 2001.
13. Hyde, J., Carter, E.J., Kiesler, S., Hodgins, J.K.: Assessing naturalness and emotional intensity: a perceptual study of animated facial motion.
In: Proceedings of the ACM Symposium on Applied Perception. 2014; p 1522. ACM.
14. Hyde, J., Carter, E.J., Kiesler, S., Hodgins, J.K.: Using an interactive avatar's facial expressiveness to increase persuasiveness and socialness.
In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. 2015; p. 1719 1728. ACM.
15. Lingenfelser, F.,Wagner, J., André, E., McKeown, G., Curran, W.: An event driven fusion approach for enjoyment recognition in real-time.
In: Proceedings of the Multimedia Conference. 2014; p. 377386.
16. Mehlmann, G., Janowski, K., André, E.: Modeling Grounding for Interactive Social Companions. Journal of Artificial Intelligence: Social
Companion Technologies. 2016. 30(1):4552.
17. Mehlmann, G., Janowski, K., Baur, T., Häring, M., André, E., Gebhard, P. Exploring a Model of Gaze for Grounding in HRI. In : Proceedings
of the 16th International Conference on Multimodal Interaction. 2014; p. 247254. ACM.
18. Ochs, M., Pelachaud, C.: Socially Aware Virtual Characters: The Social Signal of Smiles. IEEE Signal Processing Magazine. 2013; 30(2):128
132.
19. Pfeifer Vardoulakis, L., Ring, L., Barry, B., Sidner, C., Bickmore, T.: Designing relational agents as long term social companions for older
adults. In: Proceedings of the 12th International Conference on Intelligent Virtual Agents. 2012.
20. Pickett Y, Raue, PJ, Bruce, ML. Late-life depression in home healthcare. J Aging Health 2012; 8(3): 273284.
21. Posner, J., Russell, J., Peterson, B.: The circumplex model of affect: An integrative approach to affective neuroscience, cognitive development
and psychopathology. Development and psychopathology. 2005; 17(3).
22. Riaño, D., Real, F., Campana, F., Ercolani, S., Annicchiarico, R.: An ontology for the care of the elder at home. In: Proceedings of the 12th
Conference on Artifcial Intelligence in Medicine: Artifcial Intelligence in Medicine. 2009; p. 235239. AIME '09, Springer-Verlag, Berlin.
23. Savran, A., Sankur, B., Bilge, M.T.: Regression- based intensity estimation of facial action units. Image and Vision Computing 2012;
30(10):774 784.
24. Shaw, R., Troncy, R., Hardman, L.: Lode: Linking open descriptions of events. In: Proceedings of the 4th Asian Conference on the Semantic
Web. 2009; p. 153167. Shanghai, China.
25. Sorkin, D., Rook, K.S. and Lu, J.L.: Loneliness, lack of emotional support, lack of companionship, and the likelihood of having a heart
condition in an elderly sample. Ann. Behav. Med. 2002. 24: 290298.
926 Leo Wanner et al. / Procedia Computer Science 121 (2017) 920–926
Leo Wanner et al./ Procedia Computer Science 00 (2017) 000000 7
26. Vlachantoni, A., Shaw, R., Willis, R., Evandrou, M., Falkingham, J., Luf, R.. Measuring unmet need for social care amongst older people.
Population Trends. 2011; 145:117.
27. Wagner, J., Lingenfelser, F., André, E.: Building a robust system for multimodal emotion recognition. Emotion Recognition: A Pattern Analysis
Approach. 2015; p. 379419. John Wiley & Sons, Hoboken, NJ.
28. Wanner, L., Bohnet, B., Bouayad-Agha, N., Lareau, F., Nicklass, D.: MARQUIS: Generation of user-tailored multilingual air quality bulletins.
Applied Artifcial Intelligence. 2010; 24(10):914952.
29. Yasavur, U., Lisetti, C., Rishe, N.: Lets talk! Speaking virtual counselor offers you a brief intervention. Journal of Multimodal User Interfaces.
2014; 8(4):381398.
30. Zeng, Z., Pantic, M., Roisman, G., Huang, T.: A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE
transactions on pattern analysis and machine intelligence. 2009; 31(1):3958.
... The avatar is among other tasks given the role of a first information source for migrants regarding healthcare issues and can also be applied as a mediator between migrant patients and local caregivers. In addition to the challenge of dealing with multiple languages, the avatar is meant to represent a trustworthy contact partner and therefore a natural interaction in which emotional expressions are considered is a must (Wanner et al. [174]) (Figure 10.10). ...
... The same fusion system can be applied in both use cases, whilst profiting from additional gestural input whenever it becomes available. 174 ...
Thesis
The cues that describe emotional conditions are encoded within multiple modalities and fusion of multi-modal information is a natural way to improve the automated recognition of emotions. Throughout many studies, we see traditional fusion approaches in which decisions are synchronously forced for fixed time segments across all considered modalities and generic combination rules are applied. Varying success is reported, sometimes performance is worse than uni-modal classification. Starting from these premises, this thesis investigates and compares the performance of various synchronous fusion techniques. We enrich the traditional set with custom and emotion adapted fusion algorithms that are tailored towards the affect recognition domain they are used in. These developments enhance recognition quality to a certain degree, but do not solve the sometimes occurring performance problems. To isolate the issue, we conduct a systematic investigation of synchronous fusion techniques on acted and natural data and conclude that the synchronous fusion approach shows a crucial weakness especially on non-acted emotions: The implicit assumption that relevant affective cues happen at the same time across all modalities is only true if emotions are depicted very coherent and clear - which we cannot expect in a natural setting. This implies a switch to asynchronous fusion approaches. This change can be realized by the application of classification models with memory capabilities (\eg recurrent neural networks), but these are often data hungry and non-transparent. We consequently present an alternative approach to asynchronous modality treatment: The event-driven fusion strategy, in which modalities decide when to contribute information to the fusion process in the form of affective events. These events can be used to introduce an additional abstraction layer to the recognition process, as provided events do not necessarily need to match the sought target class but can be cues that indicate the final assessment. Furthermore, we will see that the architecture of an event-driven fusion system is well suited for real-time usage and is very tolerant to temporarily missing input from single modalities and is therefore a good choice for affect recognition in the wild. We will demonstrate mentioned capabilities in various comparison and prototype studies and present the application of event-driven fusion strategies in multiple European research projects.
... Considering the question-answer on the screen, participants were invited to use a seven-point Likert scale (1:completely disagree/7:completely agree) to rate the answer on four dimensions: natural, complete, meaningful, and well-written. The items were inspired in previous studies on conversation [111,133,156]. Each participant rated nine question-answer pairs randomly selected from the 54 possible (27 answers from and 27 answers from ), as well as one attention check. ...
Article
Full-text available
Chatbots are often designed to mimic social roles attributed to humans. However, little is known about the impact of using language that fails to conform to the associated social role. Our research draws on sociolinguistic to investigate how a chatbot’s language choices can adhere to the expected social role the agent performs within a context. We seek to understand whether chatbots design should account for linguistic register. This research analyzes how register differences play a role in shaping the user’s perception of the human-chatbot interaction. We produced parallel corpora of conversations in the tourism domain with similar content and varying register characteristics and evaluated users’ preferences of chatbot’s linguistic choices in terms of appropriateness, credibility, and user experience. Our results show that register characteristics are strong predictors of user’s preferences, which points to the needs of designing chatbots with register-appropriate language to improve acceptance and users’ perceptions of chatbot interactions.
... Randomized control trials found that Paro reduced stress and anxiety (Petersen et al., 2017) as well as increased social interaction (Wada and Shibata, 2007) in the elderly. Virtual agents, including chatbots, also exist for companionship in older adults (Vardoulakis et al., 2012;Wanner et al., 2017), such as during hospital stays (Bickmore et al., 2015). Moreover, users sometimes form social bonds with agents (i.e., designed for fitness and health purposes) not originally intended for companionship . ...
Article
Full-text available
From past research it is well known that social exclusion has detrimental consequences for mental health. To deal with these adverse effects, socially excluded individuals frequently turn to other humans for emotional support. While chatbots can elicit social and emotional responses on the part of the human interlocutor, their effectiveness in the context of social exclusion has not been investigated. In the present study, we examined whether an empathic chatbot can serve as a buffer against the adverse effects of social ostracism. After experiencing exclusion on social media, participants were randomly assigned to either talk with an empathetic chatbot about it (e.g., “I’m sorry that this happened to you”) or a control condition where their responses were merely acknowledged (e.g., “Thank you for your feedback”). Replicating previous research, results revealed that experiences of social exclusion dampened the mood of participants. Interacting with an empathetic chatbot, however, appeared to have a mitigating impact. In particular, participants in the chatbot intervention condition reported higher mood than those in the control condition. Theoretical, methodological, and practical implications, as well as directions for future research are discussed.
Conference Paper
Full-text available
State-of-the-art prosody modelling in content-to-speech (CTS) applications still uses the same methodology to predict intonation cues as text-to-speech (TTS) applications, namely the analysis of the generated surface sentences with respect to part of speech, syntactic dependency relations and word order. On the other side, several theoretical studies argue that morphology, syntax, and information (or communicative) structure that organizes a given content (semantic or deep-syntactic structure) with respect to the intention of the speaker show a strong correlation with intonation. However, little empirical work based on sufficiently large corpora has been carried out so far to buttress this argumentation. We present empirical evidence for the Information Structure–Prosody correlation using the Wall Street Journal Penn Treebank corpus recorded by native American English speakers. Our experiments reach a prosody prediction accuracy of 80% using the hierarchical information structure from the Meaning-Text Theory, compared to 59% of the baseline.
Article
Full-text available
Grounding is an important process that underlies all human interaction. Hence, it is also crucial for social companions to interact naturally. Maintaining the common ground requires domain knowledge but has also numerous social aspects, such as attention, engagement and empathy. Integrating these aspects and their interplay with the dialog management in a computational interaction model is a complex task. We present a modeling approach overcoming this challenge and illustrate it based on some social companion applications.
Article
Full-text available
‘Deep-syntactic’ dependency structures that capture the argumentative, attributive and coordinative relations between full words of a sentence have a great potential for a number of NLP-applications. The abstraction degree of these structures is in between the output of a syntactic dependency parser (connected trees defined over all words of a sentence and language-specific grammatical functions) and the output of a semantic parser (forests of trees defined over individual lexemes or phrasal chunks and abstract semantic role labels which capture the frame structures of predicative elements and drop all attributive and coordinative dependencies). We propose a parser that provides deep-syntactic structures. The parser has been tested on Spanish, English and Chinese.
Article
Full-text available
The outcome of interpersonal interactions depends not only on the contents that we communicate verbally, but also on nonverbal social signals. As a lack of social skills is a common problem for a significant number of people, serious games and other training environments have recently become the focus of research. In this work we present NovA (Nonverbal be-haviorAnalyzer), a system that analyzes and facilitates the interpretation of social signals automatically in a bi-directionalinteraction with a conversational agent. It records data of interactions, detects relevant social cues, and creates descriptive statistics for the recorded data with respect to the agents behavior and the context of the situation. This enhances the pos-sibilities for researchers to automatically label corpora of human-agent interactions and to give users feedback on strengthsand weaknesses of their social behavior.
Conference Paper
Full-text available
structures from which the generation naturally starts often do not contain any functional nodes, while surface-syntactic structures or a chain of tokens in a linearized tree contain all of them. Therefore, data-driven linguistic generation needs to be able to cope with the projection between non-isomorphic structures that differ in their topology and number of nodes. So far, such a projection has been a challenge in data-driven generation and was largely avoided. We present a fully stochastic generator that is able to cope with projection between non-isomorphic structures. The generator, which starts from PropBank-like structures, consists of a cascade of SVM-classifier based submodules that map in a series of transitions the input structures onto sentences. The generator has been evaluated for English on the Penn-Treebank and for Spanish on the multi-layered Ancora-UPF corpus.
Conference Paper
Full-text available
Grounding is an important process that underlies all human interaction. Hence, it is crucial for building social robots that are expected to collaborate effectively with humans. Gaze behavior plays versatile roles in establishing, maintaining and repairing the common ground. Integrating all these roles in a computational dialog model is a complex task since gaze is generally combined with multiple parallel information modalities and involved in multiple processes for the generation and recognition of behavior. Going beyond related ...
Conference Paper
Full-text available
Social signals and interpretation of carried information is of high importance in Human Computer Interaction. Often used for affect recognition, the cues within these signals are displayed in various modalities. Fusion of multi-modal signals is a natural and interesting way to improve automatic classification of emotions transported in social signals. Throughout most present studies, uni-modal affect recognition as well as multi-modal fusion, decisions are forced for fixed annotation segments across all modalities. In this paper, we investigate the less prevalent approach of event driven fusion, which indirectly accumulates asynchronous events in all modalities for final predictions. We present a fusion approach, handling short-timed events in a vector space, which is of special interest for real-time applications. We compare results of segmentation based uni-modal classification and fusion schemes to the event driven fusion approach. The evaluation is carried out via detection of enjoyment-episodes within the audiovisual Belfast Story-Telling Corpus.
Conference Paper
Research indicates that the facial expressions of animated characters and agents can influence people's perceptions and interactions with these entities. We designed an experiment to examine how an interactive animated avatar's facial expressiveness influences dyadic conversations between adults and the avatar. We animated the avatar in realtime using the tracked facial motion of a confederate. To adjust facial expressiveness, we damped and exaggerated the avatar's facial motion. We found that ratings of the avatar's extroversion were positively related to its expressiveness. However, impressions of the avatar's realism and naturalness worsened with increased expressiveness. We also found that the confederate was more influential when she appeared as the damped or exaggerated avatar. Adjusting the expressiveness of interactive animated avatars may be a simple way to influence people's social judgments and willingness to collaborate with animated avatars. These results have implications for using avatar facial expressiveness to improve the effectiveness of avatars in various contexts. Adjusting the expressiveness of interactive animated avatars may be a simple way to influence people's social judgments and willingness to collaborate with animated avatars.
Chapter
This chapter describes the development of a multimodal, ensemble-based system for emotion recognition covering the major steps in processing: emotion modeling, data segmentation and annotation, feature extraction and selection, classification and multimodal fusion techniques. It specifically focuses on the problem of temporary missing data in one or more observed modalities. In offline evaluation the issue can be easily solved by excluding those parts of the corpus where one or more channels are corrupted or not suitable for evaluation. In online applications, however, we cannot neglect the challenge of missing data and have to find adequate ways to handle it. The presented system solves the problem at the multimodal fusion stage-established and novel emotion-specific ensemble techniques and is enriched with strategies on how to compensate temporarily unavailable modalities. Extensive evaluation, including application of different annotation schemes, is carried out on the CALLAS Expressivity Corpus, featuring facial and vocal modalities.
Article
We developed a virtual counseling system which can deliver brief alcohol health interventions via a 3D anthropomorphic speech-enabled interface—a new field for spoken dialog interactions with intelligent virtual agents in the health domain. We present our spoken dialog system design and its evaluation. We developed our dialog system based on Markov decision processes framework and optimized it by using reinforcement learning algorithms with data we collected from real user interactions. The system begins to learn optimal dialog strategies for initiative selection and for the type of confirmations that it uses during the interaction. We compared the unoptimized system with the optimized system in terms of objective measures (e.g. task completion) and subjective measures (e.g. ease of use, future intention to use the system) and obtained positive results.