ChapterPDF Available

THERADIA: Digital Therapies Augmented by Artificial Intelligence

Authors:
  • Scientific Brain Training

Abstract

Digital plays a key role in the transformation of medicine. Beyond the simple computerisation of healthcare systems, many non-drug treatments are now possible thanks to digital technology. Thus, interactive stimulation exercises can be offered to people suffering from cognitive disorders, such as developmental disorders, neurodegenerative diseases, stroke or traumas. The efficiency of these new treatments, which are still primarily offered face-to-face by therapists, can be greatly improved if patients can pursue them at home. However, patients are left to their own devices which can be problematic. We introduce THERADIA, a 5-year project that aims to develop an empathic virtual agent that accompanies patients while receiving digital therapies at home, and that provides feedback to therapists and caregivers. We detail the architecture of our agent as well as the framework of our Wizard-of-Oz protocol, designed to collect a large corpus of interactions between people and our virtual assistant in order to train our models and improve our dialogues.
THERADIA: Digital Therapies augmented by Artificial
Intelligence
Franck Tarpin-Bernard1, Joan Fruitet1, Jean-Philippe Vigne2, Patrick Constant3, Hanna
Chainay4, Olivier Koenig4, Fabien Ringeval5, Béatrice Bouchot2,
Gérard Bailly6, François Portet5, Sina Alisamir5, Yongxin Zhou5, Jean Serre2,
Vincent Delerue2, Hippolyte Fournier4, Kévin Berenger4, Isabella Zsoldos4,
Olivier Perrotin6, Frédéric Elisei6, Martin Lenglet6, Charles Puaux3, Léo Pacheco3, Mélodie
Fouillen1, Didier Ghenassia1
1 SBT, Lyon, France {firstname.name}@sbt-human.com
2 Atos, Echirolles, France {firstname.name}@atos.net
3 Pertimm, Paris, France {firstname.name}@pertimm.com
4 EMC, Lyon 2 Univ., Bron, France {firstname.name}@univ-lyon2.fr
5 LIG, Grenoble-Alps Univ., Grenoble, France {firstname.name}@imag.fr
6 GIPSA-Lab, Grenoble-Alps Univ., Grenoble, France {firstname.name}@gipsa-lab.fr
Abstract. Digital plays a key role in the transformation of medicine. Beyond the
simple computerisation of healthcare systems, many non-drug treatments are now
possible thanks to digital technology. Thus, interactive stimulation exercises can
be offered to people suffering from cognitive disorders, such as developmental
disorders, neurodegenerative diseases, stroke or traumas. The efficiency of these
new treatments, which are still primarily offered face-to-face by therapists, can
be greatly improved if patients can pursue them at home. However, patients are
left to their own devices which can be problematic. We introduce THERADIA,
a 5-year project that aims to develop an empathic virtual agent that accompanies
patients while receiving digital therapies at home, and that provides feedback to
therapists and caregivers. We detail the architecture of our agent as well as the
framework of our Wizard-of-Oz protocol, designed to collect a large corpus of
interactions between people and our virtual assistant in order to train our models
and improve our dialogues.
Keywords: Healthcare, Cognitive Disorders, Digital Therapies, Artificial Intel-
ligence
Introduction
Today’s medicine is increasingly based on a Preventive, Personalised, Participatory and
Predictive approach (a.k.a. 4Ps medicine), where digital plays a key role in this trans-
formation. Beyond the simple computerisation of healthcare systems, many non-drug
treatments such as digital therapies are now possible thanks to the progress made in
both digital technology and artificial intelligence (AI). One such example is cognitive
remediation, which is a digital therapy based on interactive and configurable stimula-
tion exercises offered to people suffering from cognitive disorders [1][2]. The
efficiency of these new treatments, which are still mainly offered face-to-face by ther-
apists, can be greatly improved if patients are able to pursue them at home, as it in-
creases clinical efficiency without the need to make visits to the hospital [3][4]. How-
ever, conducting some sessions in autonomy where patients are left to their own devices
can make adherence to treatment difficult, which is a major issue [5].
One possible way to increase adherence to treatment is to accompany patients through-
out the course of therapy. As cognitive remediation exercises are presented to patients
through an application running on digital devices, it is possible to design and integrate
a virtual agent that acts as an empathic assistant supporting progress throughout the
digital therapy.
This is the purpose of THERADIA, a 5-year project structured in three main phases: (i)
initialisation architecture design, first implementations of components, (ii) collec-
tion of real data, and optimisation training of AI models, improvements of technical
components, and enrichment of dialogues, and (iii) clinical study validation of the
efficiency of the assisted digital therapy on patients with medical conditions. After one
year of work, the initialisation phase led to the design of a Wizard-of-Oz experiment.
In the remainder of this article, we present related work on virtual agents as personal
healthcare assistants, then introduce the THERADIA project and our virtual agent, as
well as the Wizard-of-Oz framework designed to collect pertinent interaction data, be-
fore giving concluding remarks.
Related Work: Virtual Agents as Personal Healthcare Assistants
There is a plethora of virtual agents that act as personal healthcare assistants on digital
devices, even for virtual worlds’ residents [6]. A recent systematic literature survey on
the use of conversational agents or chabots in the field of psychiatry for the screening,
diagnosis, and treatment of mental illness has shown the benefits of virtual assistants in
psychoeducation and self-adherence, with a high level of satisfaction reported from the
users, suggesting that they would be an effective and enjoyable tool in psychiatric treat-
ment [7].
Virtual assistants have been able to perform speech recognition and synthesis for
producing and understanding multimodal forms of expression since at least two dec-
ades [8]. Recent advances in artificial intelligence have made it possible to identify
information related to health and affect [9][10], and to generate emotional expressions
and attitudes [11][12]. In order for a virtual assistant to be truly effective in supporting
patients throughout the therapy, it must be able to adapt to the different interlocutors
and their expressiveness, and therefore be able to detect and synthesise different con-
versation styles.
THERADIA: an Empathic Virtual Assistant to Accompany Digital
Therapies
The THERADIA project reunites a consortium of academic and industrial researchers
who aims to develop a virtual therapeutic assistant that constitutes the relay and the
interface between the patient and the therapist, but also with the caregivers. Following
the path of affective computing, we believe that such virtual assistants should be able
to not only monitor the emotions and general well-being of patients throughout the dig-
ital therapy, but also respond to them in an appropriate manner [13]. In addition, we
believe that feedback about the progress or issues faced by the patient during the digital
therapy is needed, with specific information given either to caregivers or therapists. We
have designed the architecture of such an empathic virtual agent, which encompasses
several modules that are sketched in Fig. 1 and detailed below. The system acts as a
visioconference, with specific data-driven modules to analyse the expressions of the
patient and generate appropriate responses from the virtual agent.
The first level of modules extracts information from the patient's audiovisual stream
then the dialogue management selects an appropriate interaction among multiple sce-
narios which is finally played by the 3D female avatar and streamed to the patient
browser as an audiovisual stream. A standard session is basically structured in five main
phases:
- welcoming dialogue: during the first session this dialogue helps the assistant
to collect information about the patient. In the next sessions, the dialogue takes
care of the patient's motivation and mood before starting the training.
- exercise introduction: depending on whether this is the first time the patient is
introduced to a new exercise or not, this dialogue introduces the gameplay and
the goal of the task or recalls feedback from previous performance.
- during exercise: the assistant is not visible and remains passive and lets the
patient run the task. It only monitors the patient’s attention or detects special
events that would require stopping the task.
- result analysis: at the end of each exercise the assistant positively comments
on the performance of the patient and rewards the training. Then, a new exer-
cise is launched until the training session is over, which usually lasts 30 to 45
minutes.
- debriefing dialogue: at the end of the session, the assistant rewards the patient
for completing the session and gives an appointment for the next session.
Audio, Visual, and Textual Descriptors
Continuous audiovisual descriptors are extracted from state-of-the-art representations
such as Mel-Frequency Cepstral Coefficients (MFCCs) for audio, and Facial Action
Units (FAUs) for video, but also from self-supervised representations [14][15]. Once
the speech stream is segmented by the content process layer, a speech recognition mod-
ule performs its transcription, which is then processed to extract linguistic descriptors.
Fig. 1. Flowchart of the THERADIA system designed for accompanying patients suf-
fering from cognitive disorders when completing digital therapies; CPL: content pro-
cess layers, STT: speech-to-text, AVTTS: audiovisual text-to-speech synthesis.
Content Process Layer
The aim of the content process layer is to incrementally detect turn taking opportunities
[16] using both linguistic and prosodic features [17] so that the agent can lead the
conversation and stick to the dialogic objectives, especially when it has to give back the
floor after initiating an open-ended question. This component is coordinated with the
“active listening” component responsible for providing incentives and feedback at key
instants of the interlocutor’s speaking time.
Dialog Management
The Dialog Management is composed of a Natural Language Understanding (NLU)
module that interprets the patient’s speech and emotions and passes them to the dia-
logue engine, which runs encoded dialogues through the viky.ai platform. Interaction
scenarios are defined by a group of speech therapists (used to run cognitive remediation
sessions) using a dialogue editor.
Detection of Emotions, Mood, and others Mental States
Emotional AI is mainly driven by either categorical labels of emotion [18], or dimen-
sional representations of core affect [19]. Appraisal theories suggest that these ap-
proaches are however too reductive to conceptualise the complexity of the range of
human emotions. According to the Component Process Model (CPM) [20], emotions
can be distinguished by sequences of cognitive appraisals, where apparent motor
expressions produced by individuals (facial expressions, voice prosody) contain key
markers of appraisal sequences [21]. Our emotion recognition system is based on Re-
current Neural Networks that model contextual dependencies between signals and la-
bels as defined in the CPM.
Expressive AudioVisual Text-To-Speech synthesis
Typical text-to-speech synthesisers are based on end-to-end approaches where the syn-
thetic speech signal is produced in two steps: a sequence-to-sequence model (SSM)
warps a sequence of characters to a sequence of mel-spectrogram frames e.g., the
Tacotron2 encoder-attention-decoder framework [22] that a neural vocoder further
maps to an acoustic signal. Our system is based on a state-of-the-art TTS trained on 100
hours of audiobooks that is being extended to include: (i) a joint prediction of spectro-
grams and animation parameters using multi-head attention, (ii) expressive embeddings
[23], and (iii) behavioral alignment by biasing part of the expressive embeddings with
features of the conversational partner [24].
Automatic Feedback Generation
Based on examples provided by domain experts, we aim to automatically generate re-
ports summarising the cognitive remediation carried out at home by patients. This task
can be decomposed in the following steps: (i) identify, aggregate, select and structure
the relevant information to communicate, (ii) transform this structured information into
a coherent multimedia document, and (iii) adapt the production to the generation criteria
such as type of recipients (therapist or caregiver), period to be summarised, and length
of the summary.
Experimental Data Collection Using a Wizard-of-Oz
The collection of data in situ from the population of interest is essential for the devel-
opment of the THERADIA virtual assistant: it provides ground truth behaviors of the
patients interacting with a faked embodied conversational agent and optimal control
policy of the agent by the cognitive gift of the human pilot. Such data are essential for
feeding the multiple trainable components of the autonomous system (emotion detec-
tion, SST, TTS, etc). We are particularly interested in coadaptation mechanisms, both
short-term (using sets of pre-defined interaction profiles) and long-term (using contin-
uous training).
The virtual assistant is driven in real-time by a human pilot, who is filmed by a high-
quality camera, cf. Fig. 2. Head movements, gaze, speech, and articulation are captured
to drive the 3D avatar whose rendering is casted to the patient screen. As introduced
earlier, the virtual assistant does not interfere with the exercising: it intervenes before,
between and after the exercises to provide the patient with instructions, rewards and
feedback. The human pilot's interventions are scripted by the dialogue system, so that
the recorded conversations stick to what the automatic system is capable of, thanks to
a teleprompter: Dialogue acts and expected responses of the subject are overlaid onto
the pilot’s screen where the subject face is displayed. Alternatively, the pilot’s screen
displays the patient’s screen when he/she is exercising, in order to ground further
interventions. The system logs timestamped dialogue states and continuously monitors
behaviors of conversational partners: we use Dynamixyz® technology to monitor the
facial movements of the human pilot and the patient. A gaze correction is automatically
applied to the pilot’s eye movements for enabling eye contact between the avatar and
the patient. Each exercising session delivers three videos coming from the human pi-
lot’s camera, patient’s camera and patients’ screen, plus tracking audiovisual features
of both partners, and timestamped logs of exercising along with conversation switches
and dialogues.
Fig. 2. A human pilot (left) interacts with a patient (right) through a virtual assistant
(on screen in the right image) that is automatically animated based on the facial expres-
sions of the pilot
1
.
Conclusion
We have introduced the THERADIA project, whose aim is to endow a system for au-
tonomous cognitive remediation with a conversational agent capable of providing so-
cial presence, coaching and support when necessary. Challenges facing the implemen-
tation and long-term acceptability issues of such a technology are numerous. The Wiz-
ard-of-Oz system fulfills a twofold purpose; iteratively assess scripted dialogues and
provide trainable components (turn management, dialogue, active listening, TTS, STT,
etc.) with ground truth behaviors. “What to do” (saying is the main but not exclusive
action the system can perform) and “How to do it” are dual problems for the agent. The
human pilot continuously identifies failures of the dialogue management and her im-
provisation phases are constantly considered in order to improve the ability of the vir-
tual assistant to appropriately act and interact with the patient. The final system has
been designed as a visioconference: we will soon be able to monitor in-the-wild inter-
actions, with home environments and on-demand exercising. It also has to be extended
to handle all stakeholders (therapists and caregivers) and take into account all experi-
ences into its episodic memory. A final challenge is to explore the impact of this as-
sisted autonomous exercising on the health ecosystem.
1
See https://www.theradia.fr/AHFE2021 for more pictures and videos.
Acknowledgments
This research has received funding from the Banque Publique d’Investissement (BPI)
under grant agreement THERADIA, the Association Nationale de la Recherche et de la
Technologie (ANRT), under grant agreement No. 2019/0729, and has been partially
supported by MIAI@Grenoble-Alpes, (ANR-19-P3IA-0003).
References
1. Joubert, C., & Chainay, H. (2018). Aging brain: the effect of combined cognitive and physical
training on cognition as compared to cognitive and physical training alone - a systematic review.
Clinical Intervention in Aging, 13, 1267-1301.
2. Klimová, B., & Vališ, M. (2018). Smartphone applications can serve as effective cognitive training
tools in healthy aging. Frontiers in aging neuroscience, 9.
3. van der Linden, S., Sitskoorn, M.M, Rutten, G-J.M., & Gehring, K. (2018). Feasibility of the
evidence-based cognitive telerehabilitation program Remind for patients with primary brain tumors.
Journal of Neuro-Oncology, 137, 523-532.
4. Wilms, I.L. (2020). The computerized cognitive training alliance A proposal for alliance model for
home-based computerized cognitive training. CellPress, Heliyon 6, e03254.
5. Turunen, M., Hokkanen, L., Bäckman, L., Stigsdotter-Neely, A., Hänninen, T., Paajanen, T.,
Soininen, H., Kivipelto, M., & Ngandu, T. (2019) Computer-based cognitive training for older adults
: determinants of adherence. PlosOne, 14(7):e0219541.
6. Kethuneni, S., August, S.E., Ian Vales, J. (2009). Personal health care assistant/companion in virtual
world. Association for the Advancement of Artificial Intelligence (AAAI), Fall Symposium Series.
7. Vaidyam AN, Wisniewski H, Halamka JD, Kashavan MS, Torous JB. Chatbots and conversational
agents in mental health: a review of the psychiatric landscape. Can J Psychiatry. 2019 Jul;64(7):456
64.
8. Cassell, J., Sullivan, J., Prevost, o., & Churchill, E. (2000). Embodied conversational agents. MIT
press.
9. Cummins, N., Baird, A., & Schuller, B. W. (2018). Speech analysis for health: Current state-of-the-
art and the increasing impact of deep learning. Methods, 41-54.
10. Ringeval F. et al. (2019). AVEC 2019 Workshop and Challenge: State-of-Mind, Detecting
Depression with AI, and Cross-Cultural Affect Recognition. International Workshop on
Audio/Visual Emotion Challenge, AVEC'19, Nice, France.
11. Swerts, M., & Krahmer, E. (2005). Audiovisual prosody and feeling of knowing. Journal of Memory
and Language, 81-94.
12. Barbulescu, A., Ronfard, R., & Bailly, G. (2017). A generative audio-visual prosodic model for
virtual actors. IEEE computer graphics and applications, 37(6), 40-51.
13. Picard, R. W. (2000). Affective computing. MIT press.
14. Khare, A., Parthasarathy, S., & Sundaram, S. (2020) Self-Supervised learning with cross-modal
transformers for emotion recognition. arXiv preprint arXiv:2011.10652.
15. Siriwardhana, S., Reis, A., Weerasekera, R., & Nanayakkara, S. (2020). Jointly Fine-Tuning"
BERT-like" Self Supervised Models to Improve Multimodal Speech Emotion Recognition. arXiv
preprint arXiv:2008.06682.
16. Thórisson, K. R. (2002). Natural turn-taking needs no manual: Computational theory and model,
from perception to action. In Multimodality in language and speech systems (pp. 173-207). Springer,
Dordrecht.
17. Skantze, G., (2021, May) Turn-taking in conversational systems and human-robot interaction: a
review. Computer Speech & Language, 67, 101-178.
18. Ekman, P. (1992). Facial expressions of emotion: New findings, new questions.
19. Russell, J. A. (1997). Reading emotions from and into faces: Resurrecting a dimensional-contextual
perspective, In J. A. Russell & J. M. Fernández-Dols (Eds.), Studies in emotion and social
interaction. The psychology of facial expression (p. 295320). CUP.
20. Scherer, K. R. (2009). The dynamic architecture of emotion: Evidence for the component process
model. Cognition and emotion, 23(7), 1307-1351.
21. Scherer, K. R., Dieckmann, A., Unfried, M., Ellgring, H., & Mortillaro, M. (2019). Investigating
appraisal-driven facial expression and inference in emotion communication. Emotion, 21(1), 73.
22. Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., ... & Wu, Y. (2018). Natural tts
synthesis by conditioning wavenet on mel spectrogram predictions. IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779-4783.
23. Tits, N., Wang, F., El Haddad, K., Pagel, V., & Dutoit, T. (2019). Visualization and Interpretation
of Latent Spaces for Controlling Expressive Speech Synthesis Through Audio Analysis. Interspeech,
4475-4479.
24. Stanton, D., Wang, Y., & Skerry-Ryan, R. J. (2018). Predicting expressive speaking style from text
in end-to-end speech synthesis. IEEE Spoken Language Technology Workshop (SLT), 595-602.
... The virtual assistant must be able to provide adequate emotional support during the session, encouraging and rewarding participants for their efforts, to increase adherence [40]. In this respect, the development of an emotional artificial intelligence that would enable the assistant to detect and automatically adapt to the user's affective states would be particularly useful [42]. To provide a safe environment for patients with cognitive disorders, it is also necessary that the assistant's speech and its interactions with the user are scripted in such a way as to provide a stable and rather predictable framework [40]. ...
... In the light of these observations, we started the THERADIA (Thérapies Digitales Augmentées par l'Intelligence Artificielle) project in 2020 [42]. This 5-year project aims to develop an empathetic virtual assistant that can accompany users during at-home CCT. ...
... Conversely, a webcam also transmitted the participant's face in real time to the pilot's screen so that the pilot could follow the participant's gaze and movements during the discussions to make them more natural. The videos of the pilot and participant were recorded for later use in the development of the empathic virtual assistant proposed by the THERADIA consortium [42]. The pilot and participant communicated via headsets with integrated microphones, and no audio processing was performed to alter the pilot's voice. ...
Article
Full-text available
Background Impaired cognitive function is observed in many pathologies, including neurodegenerative diseases such as Alzheimer disease. At present, the pharmaceutical treatments available to counter cognitive decline have only modest effects, with significant side effects. A nonpharmacological treatment that has received considerable attention is computerized cognitive training (CCT), which aims to maintain or improve cognitive functioning through repeated practice in standardized exercises. CCT allows for more regular and thorough training of cognitive functions directly at home, which represents a significant opportunity to prevent and fight cognitive decline. However, the presence of assistance during training seems to be an important parameter to improve patients’ motivation and adherence to treatment. To compensate for the absence of a therapist during at-home CCT, a relevant option could be to include a virtual assistant to accompany patients throughout their training. Objective The objective of this exploratory study was to evaluate the interest of including a virtual assistant to accompany patients during CCT. We investigated the relationship between various individual factors (eg, age, psycho-affective functioning, personality, personal motivations, and cognitive skills) and the appreciation and usefulness of a virtual assistant during CCT. This study is part of the THERADIA (Thérapies Digitales Augmentées par l’Intelligence Artificielle) project, which aims to develop an empathetic virtual assistant. Methods A total of 104 participants were recruited, including 52 (50%) young adults (mean age 21.2, range 18 to 27, SD 2.9 years) and 52 (50%) older adults (mean age 67.9, range 60 to 79, SD 5.1 years). All participants were invited to the laboratory to answer several questionnaires and perform 1 CCT session, which consisted of 4 cognitive exercises supervised by a virtual assistant animated by a human pilot via the Wizard of Oz method. The participants evaluated the virtual assistant and CCT at the end of the session. Results Analyses were performed using the Bayesian framework. The results suggest that the virtual assistant was appreciated and perceived as useful during CCT in both age groups. However, older adults rated the assistant and CCT more positively overall than young adults. Certain characteristics of users, especially their current affective state (ie, arousal, intrinsic relevance, goal conduciveness, and anxiety state), appeared to be related to their evaluation of the session. Conclusions This study provides, for the first time, insight into how young and older adults perceive a virtual assistant during CCT. The results suggest that such an assistant could have a beneficial influence on users’ motivation, provided that it can handle different situations, particularly their emotional state. The next step of our project will be to evaluate our device with patients experiencing mild cognitive impairment and to test its effectiveness in long-term cognitive training.
... To advance the design and development of empathic agents, we propose the Empathic Virtual Agent Challenge (EVAC), a collaborative initiative to inspire research communities to tackle the challenges of creating empathic virtual agent interactions through Figure 1: Head movements, gaze, speech, and articulation of a remote operator were captured in real-time to drive a virtual assistant as a Wizard-of-Oz [33]. ...
Conference Paper
Full-text available
As autonomous interactive agents become increasingly prevalent, it is crucial for these virtual agents to understand and respond to both our verbal content and emotions, enabling deeper interactions. Despite significant advancements in the automatic recognition and understanding of human speech, challenges remain in accurately identifying and addressing the nuances of human emotions, hindering the development of more empathic virtual agents. We believe that empathic virtual agents should excel in three key tasks: (i) recognising spontaneous emotional expressions alongside understanding verbal content, (ii) generating timely and appropriate responses, and (iii) providing insightful feedback while comprehending user responses. To advance the development of empathic agents, we introduce the first Empathic Virtual Agent Challenge (EVAC). The inaugural edition focuses on robustly recognising spontaneous human expressions during interactions with a virtual agent, using the newly introduced THERADIA WoZ dataset. This paper provides an overview of the baseline systems operated on the pseudonymised version of the corpus on the two following modeling tasks: core affect presence and intensity, and appraisal based dimensions.
... For instance, it can improve the customer experience by finding out whether the customer feedback is positive or negative [Macary et al., 2020, Favre et al., 2015. In the context of health care, it is crucial to know how patients feel during human interactions such as clinical meetings, or human-machine interactions such as digital therapies [Tarpin-Bernard et al., 2021]. ...
Preprint
Full-text available
Automatic dialogue summarization is a well-established task that aims to identify the most important content from human conversations to create a short textual summary. Despite recent progress in the field, we show that most of the research has focused on summarizing the factual information, leaving aside the affective content, which can yet convey useful information to analyse, monitor, or support human interactions. In this paper, we propose and evaluate a set of measures PEmo, to quantify how much emotion is preserved in dialog summaries. Results show that, summarization models of the state-of-the-art do not preserve well the emotional content in the summaries. We also show that by reducing the training set to only emotional dialogues, the emotional content is better preserved in the generated summaries, while conserving the most salient factual information.
Article
Full-text available
The taking of turns is a fundamental aspect of dialogue. Since it is difficult to speak and listen at the same time, the participants need to coordinate who is currently speaking and when the next person can start to speak. Humans are very good at this coordination, and typically achieve fluent turn-taking with very small gaps and little overlap. Conversational systems (including voice assistants and social robots), on the other hand, typically have problems with frequent interruptions and long response delays, which has called for a substantial body of research on how to improve turn-taking in conversational systems. In this review article, we provide an overview of this research and give directions for future research. First, we provide a theoretical background of the linguistic research tradition on turn-taking and some of the fundamental concepts in theories of turn-taking. We also provide an extensive review of multi-modal cues (including verbal cues, prosody, breathing, gaze and gestures) that have been found to facilitate the coordination of turn-taking in human-human interaction, and which can be utilised for turn-taking in conversational systems. After this, we review work that has been done on modelling turn-taking, including end-of-turn detection, handling of user interruptions, generation of turn-taking cues, and multi-party human-robot interaction. Finally, we identify key areas where more research is needed to achieve fluent turn-taking in spoken interaction between man and machine.
Article
Full-text available
Background: To increase the number of hours available for cognitive rehabilitation, it may be an option to use the spouse or paid assistants to assist with computerized home training. However, the delegation of training responsibilities may affect the normal roles of the therapist, the spouse and the training assistants. Objective: This article suggests a new model for understanding the impact of computerized home training on the therapeutic alliance between the therapist, the patient and training assistants. Aspects of this knowledge are relevant also for the development and use of computerized training systems in clinical settings. Method: Qualitative Interpretative Phenomenological Analysis (IPA) of semi-structured interviews was used to analyse the experience gained during home-based computerized cognitive training. Results: Home-based computerized training enforces the delegation of aspects of the therapeutic alliance established between the therapist and the patient. The perceived authority of assistants and computer training systems may differ from the authority established through the patient/therapist alliance. Information may be lost in transition impacting skills and expertise long-term. Conclusion: Roles and responsibilities between the therapist, the assistants and the computerized training system need to be clearly defined. A Cognitive Training Alliance model is being proposed which takes into consideration the challenges of delegating training responsibility to computer systems and non-professional assistants.
Article
Full-text available
Theory and research on emotion expression, both on production and recognition, has been dominated by a categorical emotion approach suggesting that discrete emotions are elicited and expressed via prototypical facial muscle configurations that can then be recognized by observers, presumably via template matching. This tradition is increasingly challenged by alternative theoretical approaches. In particular, appraisal theorists have suggested that specific elements of facial expressions are directly produced by the result of certain appraisals and have made detailed predictions about the facial patterns to be expected for these appraisal configurations. This approach has been extended to emotion perception, with theorists claiming that observers first infer individual appraisals and only then make categorical emotion judgments from the estimated appraisal patterns, using semantic inference rules. Here we report two studies that empirically examine the two central hypotheses proposed by this theoretical position: (a) that specific appraisals produce predicted patterns of facial muscle expressions and (b) that observers can infer a person's appraisals of ongoing events from the predicted facial expression configurations. The results show that (a) professional actors use many of the predicted facial action unit patterns to enact, in a realistic scenario setting, appraisal outcomes specified by experimental design, and (b) observers systematically infer specific appraisals from ecologically valid video recordings of marketing research participants as they view TV commercials (selected according to the likelihood of eliciting specific appraisals). The patterns of facial action units identified in these studies correspond largely to prior predictions and encourage further research on appraisal-driven expression and inference. (PsycINFO Database Record (c) 2019 APA, all rights reserved).
Conference Paper
Full-text available
The Audio/Visual Emotion Challenge and Workshop (AVEC 2019) 'State-of-Mind, Detecting Depression with AI, and Cross-cultural Affect Recognition' is the ninth competition event aimed at the comparison of multimedia processing and machine learning methods for automatic audiovisual health and emotion analysis, with all participants competing strictly under the same conditions. The goal of the Challenge is to provide a common benchmark test set for multimodal information processing and to bring together the health and emotion recognition communities, as well as the audiovisual processing communities, to compare the relative merits of various approaches to health and emotion recognition from real-life data. This paper presents the major novelties introduced this year, the challenge guidelines, the data used, and the performance of the baseline systems on the three proposed tasks: state-of-mind recognition, depression assessment with AI, and cross-cultural affect sensing, respectively.
Conference Paper
Full-text available
The field of Text-to-Speech has experienced huge improvements last years benefiting from deep learning techniques. Producing realistic speech becomes possible now. As a consequence, the research on the control of the expressiveness, allowing to generate speech in different styles or manners, has attracted increasing attention lately. Systems able to control style have been developed and show impressive results. However the control parameters often consist of latent variables and remain complex to interpret. In this paper, we analyze and compare different latent spaces and obtain an interpretation of their influence on expressive speech. This will enable the possibility to build controllable speech synthesis systems with an understandable behaviour.
Article
Full-text available
The possibilities of computer-based cognitive training (CCT) in postponing the onset of dementia are currently unclear, but promising. Our aim is to investigate older adults´ adherence to a long-term CCT program, and which participant characteristics are associated with adherence to the CCT. This study was part of the Finnish Geriatric Intervention Study to Prevent Cognitive Impairment and Disability (FINGER). Participants were 60-77-year-old individuals with increased dementia risk, recruited from previous population-based studies. The participants included in this study (n = 631) had been randomized to receive a multi-domain lifestyle intervention, including CCT. The measure of adherence was the number of completed CCT sessions (max = 144) as continuous measure. Due to a substantial proportion of participants with 0 sessions, the zero inflated negative binomial regression analyses were used to enable assessment of both predictors of starting the training and predictors of completing a higher number of training sessions. Several cognitive, demographic, lifestyle, and health-related variables were examined as potential predictors of adherence to CCT. Altogether, 63% of the participants participated in the CCT at least once, 20% completed at least half of the training, and 12% completed all sessions. Previous experience with computers, being married or cohabiting, better memory performance, and positive expectations toward the study predicted greater odds for starting CCT. Previous computer use was the only factor associated with a greater number of training sessions completed. Our study shows that there is a large variation in adherence to a long-lasting CCT among older adults with an increased risk of dementia. The results indicate that encouraging computer use, and taking into account the level of cognitive functioning, may help boost adherence to CCT.
Chapter
This reference work provides broad and up-to-date coverage of the major perspectives - ethological, neurobehavioral, developmental, dynamic systems, componential - on facial expression. It reviews Darwin's legacy in the theories of Izard and Tomkins and in Fridlund's recently proposed Behavioral Ecology theory. It explores continuing controversies on universality and innateness. It also updates the research guidelines of Ekman, Friesen and Ellsworth. This book anticipates emerging research questions: what is the role of culture in children's understanding of faces? In what precise ways do faces depend on the immediate context? What is the ecology of facial expression: when do different expressions occur and in what frequency? The Psychology of Facial Expressions is aimed at students, researchers and educators in psychology anthropology, and sociology who are interested in the emotive and communicative uses of facial expression.
Book
According to Rosalind Picard, if we want computers to be genuinely intelligent and to interact naturally with us, we must give computers the ability to recognize, understand, even to have and express emotions. The latest scientific findings indicate that emotions play an essential role in decision making, perception, learning, and more—that is, they influence the very mechanisms of rational thinking. Not only too much, but too little emotion can impair decision making. According to Rosalind Picard, if we want computers to be genuinely intelligent and to interact naturally with us, we must give computers the ability to recognize, understand, even to have and express emotions. Part 1 of this book provides the intellectual framework for affective computing. It includes background on human emotions, requirements for emotionally intelligent computers, applications of affective computing, and moral and social questions raised by the technology. Part 2 discusses the design and construction of affective computers. Although this material is more technical than that in Part 1, the author has kept it less technical than typical scientific publications in order to make it accessible to newcomers. Topics in Part 2 include signal-based representations of emotions, human affect recognition as a pattern recognition and learning problem, recent and ongoing efforts to build models of emotion for synthesizing emotions in computers, and the new application area of affective wearable computers.