Content uploaded by Fabien Ringeval
Author content
All content in this area was uploaded by Fabien Ringeval on Sep 01, 2021
Content may be subject to copyright.
THERADIA: Digital Therapies augmented by Artificial
Intelligence
Franck Tarpin-Bernard1, Joan Fruitet1, Jean-Philippe Vigne2, Patrick Constant3, Hanna
Chainay4, Olivier Koenig4, Fabien Ringeval5, Béatrice Bouchot2,
Gérard Bailly6, François Portet5, Sina Alisamir5, Yongxin Zhou5, Jean Serre2,
Vincent Delerue2, Hippolyte Fournier4, Kévin Berenger4, Isabella Zsoldos4,
Olivier Perrotin6, Frédéric Elisei6, Martin Lenglet6, Charles Puaux3, Léo Pacheco3, Mélodie
Fouillen1, Didier Ghenassia1
1 SBT, Lyon, France {firstname.name}@sbt-human.com
2 Atos, Echirolles, France {firstname.name}@atos.net
3 Pertimm, Paris, France {firstname.name}@pertimm.com
4 EMC, Lyon 2 Univ., Bron, France {firstname.name}@univ-lyon2.fr
5 LIG, Grenoble-Alps Univ., Grenoble, France {firstname.name}@imag.fr
6 GIPSA-Lab, Grenoble-Alps Univ., Grenoble, France {firstname.name}@gipsa-lab.fr
Abstract. Digital plays a key role in the transformation of medicine. Beyond the
simple computerisation of healthcare systems, many non-drug treatments are now
possible thanks to digital technology. Thus, interactive stimulation exercises can
be offered to people suffering from cognitive disorders, such as developmental
disorders, neurodegenerative diseases, stroke or traumas. The efficiency of these
new treatments, which are still primarily offered face-to-face by therapists, can
be greatly improved if patients can pursue them at home. However, patients are
left to their own devices which can be problematic. We introduce THERADIA,
a 5-year project that aims to develop an empathic virtual agent that accompanies
patients while receiving digital therapies at home, and that provides feedback to
therapists and caregivers. We detail the architecture of our agent as well as the
framework of our Wizard-of-Oz protocol, designed to collect a large corpus of
interactions between people and our virtual assistant in order to train our models
and improve our dialogues.
Keywords: Healthcare, Cognitive Disorders, Digital Therapies, Artificial Intel-
ligence
Introduction
Today’s medicine is increasingly based on a Preventive, Personalised, Participatory and
Predictive approach (a.k.a. 4Ps medicine), where digital plays a key role in this trans-
formation. Beyond the simple computerisation of healthcare systems, many non-drug
treatments such as digital therapies are now possible thanks to the progress made in
both digital technology and artificial intelligence (AI). One such example is cognitive
remediation, which is a digital therapy based on interactive and configurable stimula-
tion exercises offered to people suffering from cognitive disorders [1][2]. The
efficiency of these new treatments, which are still mainly offered face-to-face by ther-
apists, can be greatly improved if patients are able to pursue them at home, as it in-
creases clinical efficiency without the need to make visits to the hospital [3][4]. How-
ever, conducting some sessions in autonomy where patients are left to their own devices
can make adherence to treatment difficult, which is a major issue [5].
One possible way to increase adherence to treatment is to accompany patients through-
out the course of therapy. As cognitive remediation exercises are presented to patients
through an application running on digital devices, it is possible to design and integrate
a virtual agent that acts as an empathic assistant supporting progress throughout the
digital therapy.
This is the purpose of THERADIA, a 5-year project structured in three main phases: (i)
initialisation — architecture design, first implementations of components, (ii) collec-
tion of real data, and optimisation — training of AI models, improvements of technical
components, and enrichment of dialogues, and (iii) clinical study — validation of the
efficiency of the assisted digital therapy on patients with medical conditions. After one
year of work, the initialisation phase led to the design of a Wizard-of-Oz experiment.
In the remainder of this article, we present related work on virtual agents as personal
healthcare assistants, then introduce the THERADIA project and our virtual agent, as
well as the Wizard-of-Oz framework designed to collect pertinent interaction data, be-
fore giving concluding remarks.
Related Work: Virtual Agents as Personal Healthcare Assistants
There is a plethora of virtual agents that act as personal healthcare assistants on digital
devices, even for virtual worlds’ residents [6]. A recent systematic literature survey on
the use of conversational agents or chabots in the field of psychiatry for the screening,
diagnosis, and treatment of mental illness has shown the benefits of virtual assistants in
psychoeducation and self-adherence, with a high level of satisfaction reported from the
users, suggesting that they would be an effective and enjoyable tool in psychiatric treat-
ment [7].
Virtual assistants have been able to perform speech recognition and synthesis — for
producing and understanding multimodal forms of expression — since at least two dec-
ades [8]. Recent advances in artificial intelligence have made it possible to identify
information related to health and affect [9][10], and to generate emotional expressions
and attitudes [11][12]. In order for a virtual assistant to be truly effective in supporting
patients throughout the therapy, it must be able to adapt to the different interlocutors
and their expressiveness, and therefore be able to detect and synthesise different con-
versation styles.
THERADIA: an Empathic Virtual Assistant to Accompany Digital
Therapies
The THERADIA project reunites a consortium of academic and industrial researchers
who aims to develop a virtual therapeutic assistant that constitutes the relay and the
interface between the patient and the therapist, but also with the caregivers. Following
the path of affective computing, we believe that such virtual assistants should be able
to not only monitor the emotions and general well-being of patients throughout the dig-
ital therapy, but also respond to them in an appropriate manner [13]. In addition, we
believe that feedback about the progress or issues faced by the patient during the digital
therapy is needed, with specific information given either to caregivers or therapists. We
have designed the architecture of such an empathic virtual agent, which encompasses
several modules that are sketched in Fig. 1 and detailed below. The system acts as a
visioconference, with specific data-driven modules to analyse the expressions of the
patient and generate appropriate responses from the virtual agent.
The first level of modules extracts information from the patient's audiovisual stream
then the dialogue management selects an appropriate interaction among multiple sce-
narios which is finally played by the 3D female avatar and streamed to the patient
browser as an audiovisual stream. A standard session is basically structured in five main
phases:
- welcoming dialogue: during the first session this dialogue helps the assistant
to collect information about the patient. In the next sessions, the dialogue takes
care of the patient's motivation and mood before starting the training.
- exercise introduction: depending on whether this is the first time the patient is
introduced to a new exercise or not, this dialogue introduces the gameplay and
the goal of the task or recalls feedback from previous performance.
- during exercise: the assistant is not visible and remains passive and lets the
patient run the task. It only monitors the patient’s attention or detects special
events that would require stopping the task.
- result analysis: at the end of each exercise the assistant positively comments
on the performance of the patient and rewards the training. Then, a new exer-
cise is launched until the training session is over, which usually lasts 30 to 45
minutes.
- debriefing dialogue: at the end of the session, the assistant rewards the patient
for completing the session and gives an appointment for the next session.
Audio, Visual, and Textual Descriptors
Continuous audiovisual descriptors are extracted from state-of-the-art representations
such as Mel-Frequency Cepstral Coefficients (MFCCs) for audio, and Facial Action
Units (FAUs) for video, but also from self-supervised representations [14][15]. Once
the speech stream is segmented by the content process layer, a speech recognition mod-
ule performs its transcription, which is then processed to extract linguistic descriptors.
Fig. 1. Flowchart of the THERADIA system designed for accompanying patients suf-
fering from cognitive disorders when completing digital therapies; CPL: content pro-
cess layers, STT: speech-to-text, AVTTS: audiovisual text-to-speech synthesis.
Content Process Layer
The aim of the content process layer is to incrementally detect turn taking opportunities
[16] — using both linguistic and prosodic features [17] — so that the agent can lead the
conversation and stick to the dialogic objectives, especially when it has to give back the
floor after initiating an open-ended question. This component is coordinated with the
“active listening” component responsible for providing incentives and feedback at key
instants of the interlocutor’s speaking time.
Dialog Management
The Dialog Management is composed of a Natural Language Understanding (NLU)
module that interprets the patient’s speech and emotions and passes them to the dia-
logue engine, which runs encoded dialogues through the viky.ai platform. Interaction
scenarios are defined by a group of speech therapists (used to run cognitive remediation
sessions) using a dialogue editor.
Detection of Emotions, Mood, and others Mental States
Emotional AI is mainly driven by either categorical labels of emotion [18], or dimen-
sional representations of core affect [19]. Appraisal theories suggest that these ap-
proaches are however too reductive to conceptualise the complexity of the range of
human emotions. According to the Component Process Model (CPM) [20], emotions
can be distinguished by sequences of cognitive appraisals, where apparent motor
expressions produced by individuals (facial expressions, voice prosody) contain key
markers of appraisal sequences [21]. Our emotion recognition system is based on Re-
current Neural Networks that model contextual dependencies between signals and la-
bels as defined in the CPM.
Expressive AudioVisual Text-To-Speech synthesis
Typical text-to-speech synthesisers are based on end-to-end approaches where the syn-
thetic speech signal is produced in two steps: a sequence-to-sequence model (SSM)
warps a sequence of characters to a sequence of mel-spectrogram frames — e.g., the
Tacotron2 encoder-attention-decoder framework [22] — that a neural vocoder further
maps to an acoustic signal. Our system is based on a state-of-the-art TTS trained on 100
hours of audiobooks that is being extended to include: (i) a joint prediction of spectro-
grams and animation parameters using multi-head attention, (ii) expressive embeddings
[23], and (iii) behavioral alignment by biasing part of the expressive embeddings with
features of the conversational partner [24].
Automatic Feedback Generation
Based on examples provided by domain experts, we aim to automatically generate re-
ports summarising the cognitive remediation carried out at home by patients. This task
can be decomposed in the following steps: (i) identify, aggregate, select and structure
the relevant information to communicate, (ii) transform this structured information into
a coherent multimedia document, and (iii) adapt the production to the generation criteria
such as type of recipients (therapist or caregiver), period to be summarised, and length
of the summary.
Experimental Data Collection Using a Wizard-of-Oz
The collection of data in situ from the population of interest is essential for the devel-
opment of the THERADIA virtual assistant: it provides ground truth behaviors of the
patients interacting with a faked embodied conversational agent and optimal control
policy of the agent by the cognitive gift of the human pilot. Such data are essential for
feeding the multiple trainable components of the autonomous system (emotion detec-
tion, SST, TTS, etc). We are particularly interested in coadaptation mechanisms, both
short-term (using sets of pre-defined interaction profiles) and long-term (using contin-
uous training).
The virtual assistant is driven in real-time by a human pilot, who is filmed by a high-
quality camera, cf. Fig. 2. Head movements, gaze, speech, and articulation are captured
to drive the 3D avatar whose rendering is casted to the patient screen. As introduced
earlier, the virtual assistant does not interfere with the exercising: it intervenes before,
between and after the exercises to provide the patient with instructions, rewards and
feedback. The human pilot's interventions are scripted by the dialogue system, so that
the recorded conversations stick to what the automatic system is capable of, thanks to
a teleprompter: Dialogue acts and expected responses of the subject are overlaid onto
the pilot’s screen where the subject face is displayed. Alternatively, the pilot’s screen
displays the patient’s screen when he/she is exercising, in order to ground further
interventions. The system logs timestamped dialogue states and continuously monitors
behaviors of conversational partners: we use Dynamixyz® technology to monitor the
facial movements of the human pilot and the patient. A gaze correction is automatically
applied to the pilot’s eye movements for enabling eye contact between the avatar and
the patient. Each exercising session delivers three videos coming from the human pi-
lot’s camera, patient’s camera and patients’ screen, plus tracking audiovisual features
of both partners, and timestamped logs of exercising along with conversation switches
and dialogues.
Fig. 2. A human pilot (left) interacts with a patient (right) through a virtual assistant
(on screen in the right image) that is automatically animated based on the facial expres-
sions of the pilot
1
.
Conclusion
We have introduced the THERADIA project, whose aim is to endow a system for au-
tonomous cognitive remediation with a conversational agent capable of providing so-
cial presence, coaching and support when necessary. Challenges facing the implemen-
tation and long-term acceptability issues of such a technology are numerous. The Wiz-
ard-of-Oz system fulfills a twofold purpose; iteratively assess scripted dialogues and
provide trainable components (turn management, dialogue, active listening, TTS, STT,
etc.) with ground truth behaviors. “What to do” (saying is the main but not exclusive
action the system can perform) and “How to do it” are dual problems for the agent. The
human pilot continuously identifies failures of the dialogue management and her im-
provisation phases are constantly considered in order to improve the ability of the vir-
tual assistant to appropriately act and interact with the patient. The final system has
been designed as a visioconference: we will soon be able to monitor in-the-wild inter-
actions, with home environments and on-demand exercising. It also has to be extended
to handle all stakeholders (therapists and caregivers) and take into account all experi-
ences into its episodic memory. A final challenge is to explore the impact of this as-
sisted autonomous exercising on the health ecosystem.
1
See https://www.theradia.fr/AHFE2021 for more pictures and videos.
Acknowledgments
This research has received funding from the Banque Publique d’Investissement (BPI)
under grant agreement THERADIA, the Association Nationale de la Recherche et de la
Technologie (ANRT), under grant agreement No. 2019/0729, and has been partially
supported by MIAI@Grenoble-Alpes, (ANR-19-P3IA-0003).
References
1. Joubert, C., & Chainay, H. (2018). Aging brain: the effect of combined cognitive and physical
training on cognition as compared to cognitive and physical training alone - a systematic review.
Clinical Intervention in Aging, 13, 1267-1301.
2. Klimová, B., & Vališ, M. (2018). Smartphone applications can serve as effective cognitive training
tools in healthy aging. Frontiers in aging neuroscience, 9.
3. van der Linden, S., Sitskoorn, M.M, Rutten, G-J.M., & Gehring, K. (2018). Feasibility of the
evidence-based cognitive telerehabilitation program Remind for patients with primary brain tumors.
Journal of Neuro-Oncology, 137, 523-532.
4. Wilms, I.L. (2020). The computerized cognitive training alliance – A proposal for alliance model for
home-based computerized cognitive training. CellPress, Heliyon 6, e03254.
5. Turunen, M., Hokkanen, L., Bäckman, L., Stigsdotter-Neely, A., Hänninen, T., Paajanen, T.,
Soininen, H., Kivipelto, M., & Ngandu, T. (2019) Computer-based cognitive training for older adults
: determinants of adherence. PlosOne, 14(7):e0219541.
6. Kethuneni, S., August, S.E., Ian Vales, J. (2009). Personal health care assistant/companion in virtual
world. Association for the Advancement of Artificial Intelligence (AAAI), Fall Symposium Series.
7. Vaidyam AN, Wisniewski H, Halamka JD, Kashavan MS, Torous JB. Chatbots and conversational
agents in mental health: a review of the psychiatric landscape. Can J Psychiatry. 2019 Jul;64(7):456–
64.
8. Cassell, J., Sullivan, J., Prevost, o., & Churchill, E. (2000). Embodied conversational agents. MIT
press.
9. Cummins, N., Baird, A., & Schuller, B. W. (2018). Speech analysis for health: Current state-of-the-
art and the increasing impact of deep learning. Methods, 41-54.
10. Ringeval F. et al. (2019). AVEC 2019 Workshop and Challenge: State-of-Mind, Detecting
Depression with AI, and Cross-Cultural Affect Recognition. International Workshop on
Audio/Visual Emotion Challenge, AVEC'19, Nice, France.
11. Swerts, M., & Krahmer, E. (2005). Audiovisual prosody and feeling of knowing. Journal of Memory
and Language, 81-94.
12. Barbulescu, A., Ronfard, R., & Bailly, G. (2017). A generative audio-visual prosodic model for
virtual actors. IEEE computer graphics and applications, 37(6), 40-51.
13. Picard, R. W. (2000). Affective computing. MIT press.
14. Khare, A., Parthasarathy, S., & Sundaram, S. (2020) Self-Supervised learning with cross-modal
transformers for emotion recognition. arXiv preprint arXiv:2011.10652.
15. Siriwardhana, S., Reis, A., Weerasekera, R., & Nanayakkara, S. (2020). Jointly Fine-Tuning"
BERT-like" Self Supervised Models to Improve Multimodal Speech Emotion Recognition. arXiv
preprint arXiv:2008.06682.
16. Thórisson, K. R. (2002). Natural turn-taking needs no manual: Computational theory and model,
from perception to action. In Multimodality in language and speech systems (pp. 173-207). Springer,
Dordrecht.
17. Skantze, G., (2021, May) Turn-taking in conversational systems and human-robot interaction: a
review. Computer Speech & Language, 67, 101-178.
18. Ekman, P. (1992). Facial expressions of emotion: New findings, new questions.
19. Russell, J. A. (1997). Reading emotions from and into faces: Resurrecting a dimensional-contextual
perspective, In J. A. Russell & J. M. Fernández-Dols (Eds.), Studies in emotion and social
interaction. The psychology of facial expression (p. 295–320). CUP.
20. Scherer, K. R. (2009). The dynamic architecture of emotion: Evidence for the component process
model. Cognition and emotion, 23(7), 1307-1351.
21. Scherer, K. R., Dieckmann, A., Unfried, M., Ellgring, H., & Mortillaro, M. (2019). Investigating
appraisal-driven facial expression and inference in emotion communication. Emotion, 21(1), 73.
22. Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., ... & Wu, Y. (2018). Natural tts
synthesis by conditioning wavenet on mel spectrogram predictions. IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779-4783.
23. Tits, N., Wang, F., El Haddad, K., Pagel, V., & Dutoit, T. (2019). Visualization and Interpretation
of Latent Spaces for Controlling Expressive Speech Synthesis Through Audio Analysis. Interspeech,
4475-4479.
24. Stanton, D., Wang, Y., & Skerry-Ryan, R. J. (2018). Predicting expressive speaking style from text
in end-to-end speech synthesis. IEEE Spoken Language Technology Workshop (SLT), 595-602.