Project

GOMINOLA: User-awareness and adaptation in affective conversational agents based on microservices

Goal: GOMINOLA-UPM will specifically address the objective of developing user-aware, affective and adaptive conversational agents. All the developments of the suggested technologies and capacities will be conveniently handled and exposed as microservices through standardized interfaces between potential applications (e.g. demonstrator) and our microservices architecture.
Conversational assistants should be user-aware to the extent that they recognize, at least, the user identity and his/her emotional states that are relevant in a given interaction domain. Our efforts in this regard will be focused on researching signal processing and new algorithms for the development of: (1) Advanced speech recognition techniques for rapidly adapting to the dialogue context with a multilevel language model dealing with Out Of Vocabulary (OOV) words. (2) Advanced multimodal speaker diarization and attribution technology for the speaker identification and subsequent characterization. (3) New algorithms for activity recognition based on inertial sensors for better detection of the type of movements and user characteristics. Enhancing conversational agents (CA) with social and emotional capabilities is essential for understanding and improving CA users affective experience. Our specific research aims in this regard will be the development of: (1) Novel multimodal emotion recognition models for generating emotion-aware dialogues that require an understanding of the user's emotions. (2) Novel multimodal trustworthiness computational models for the development of more emotionally intelligent and engaging conversational interfaces. (3) Sentiment Analysis models as alternative and novel proxies to analyze the opinions that users have about the interaction and for estimating its quality. (4) Novel models exploring the relation between multimedia content and different affective-defining variables, such as aesthetics, memorability and attention, for enabling smarter responses with proactive content suggestion. Conversational assistants should also be user-adaptive to the extent that it dynamically adapts its dialogue behavior and response according to the user and his/her emotional state. Our research efforts will be focused on: (1) Technologies for creating open-domain and
open-task CAs that can include emotional and persona-based characteristics for providing a more engaging user experience. (2) The development of novel multi-level multi-emotion and multi-style neural speech synthesisers to strengthen the capacity of the system to carry out a truly emotional adaptation.

Date: 1 September 2021 - 1 October 2024

Updates
0 new
0
Recommendations
0 new
0
Followers
0 new
1
Reads
0 new
27

Project log

Fernando Fernández-Martínez
added a research item
Intent recognition is a key component of any task-oriented conversational system. The intent recognizer can be used first to classify the user’s utterance into one of several predefined classes (intents) that help to understand the user’s current goal. Then, the most adequate response can be provided accordingly. Intent recognizers also often appear as a form of joint models for performing the natural language understanding and dialog management tasks together as a single process, thus simplifying the set of problems that a conversational system must solve. This happens to be especially true for frequently asked question (FAQ) conversational systems. In this work, we first present an exploratory analysis in which different deep learning (DL) models for intent detection and classification were evaluated. In particular, we experimentally compare and analyze conventional recurrent neural networks (RNN) and state-of-the-art transformer models. Our experiments confirmed that best performance is achieved by using transformers. Specifically, best performance was achieved by fine-tuning the so-called BETO model (a Spanish pretrained bidirectional encoder representations from transformers (BERT) model from the Universidad de Chile) in our intent detection task. Then, as the main contribution of the paper, we analyze the effect of inserting unseen domain words to extend the vocabulary of the model as part of the fine-tuning or domain-adaptation process. Particularly, a very simple word frequency cut-off strategy is experimentally shown to be a suitable method for driving the vocabulary learning decisions over unseen words. The results of our analysis show that the proposed method helps to effectively extend the original vocabulary of the pretrained models. We validated our approach with a selection of the corpus acquired with the Hispabot-Covid19 system obtaining satisfactory results.
Fernando Fernández-Martínez
added a research item
Emotion recognition is attracting the attention of the research community due to its multiple applications in different fields, such as medicine or autonomous driving. In this paper, we proposed an automatic emotion recognizer system that consisted of a speech emotion recognizer (SER) and a facial emotion recognizer (FER). For the SER, we evaluated a pre-trained xlsr-Wav2Vec2.0 transformer using two transfer-learning techniques: embedding extraction and fine-tuning. The best accuracy results were achieved when we fine-tuned the whole model by appending a multilayer perceptron on top of it, confirming that the training was more robust when it did not start from scratch and the previous knowledge of the network was similar to the task to adapt. Regarding the facial emotion recognizer, we extracted the Action Units of the videos and compared the performance between employing static models against sequential models. Results showed that sequential models beat static models by a narrow difference. Error analysis reported that the visual systems could improve with a detector of high-emotional load frames, which opened a new line of research to discover new ways to learn from videos. Finally, combining these two modalities with a late fusion strategy, we achieved 86.70% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. Results demonstrated that these modalities carried relevant information to detect users’ emotional state and their combination allowed to improve the final system performance.
Fernando Fernández-Martínez
added a research item
Emotion Recognition is attracting the attention of the research community due to the multiple areas where it can be applied, such as in healthcare or in road safety systems. In this paper, we propose a multimodal emotion recognition system that relies on speech and facial information. For the speech-based modality, we evaluated several transfer-learning techniques, more specifically, embedding extraction and Fine-Tuning. The best accuracy results were achieved when we fine-tuned the CNN-14 of the PANNs framework, confirming that the training was more robust when it did not start from scratch and the tasks were similar. Regarding the facial emotion recognizers, we propose a framework that consists of a pre-trained Spatial Transformer Network on saliency maps and facial images followed by a bi-LSTM with an attention mechanism. The error analysis reported that the frame-based systems could present some problems when they were used directly to solve a video-based task despite the domain adaptation, which opens a new line of research to discover new ways to correct this mismatch and take advantage of the embedded knowledge of these pre-trained models. Finally, from the combination of these two modalities with a late fusion strategy, we achieved 80.08% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. The results revealed that these modalities carry relevant information to detect users' emotional state and their combination enables improvement of system performance.
Fernando Fernández-Martínez
added a project goal
GOMINOLA-UPM will specifically address the objective of developing user-aware, affective and adaptive conversational agents. All the developments of the suggested technologies and capacities will be conveniently handled and exposed as microservices through standardized interfaces between potential applications (e.g. demonstrator) and our microservices architecture.
Conversational assistants should be user-aware to the extent that they recognize, at least, the user identity and his/her emotional states that are relevant in a given interaction domain. Our efforts in this regard will be focused on researching signal processing and new algorithms for the development of: (1) Advanced speech recognition techniques for rapidly adapting to the dialogue context with a multilevel language model dealing with Out Of Vocabulary (OOV) words. (2) Advanced multimodal speaker diarization and attribution technology for the speaker identification and subsequent characterization. (3) New algorithms for activity recognition based on inertial sensors for better detection of the type of movements and user characteristics. Enhancing conversational agents (CA) with social and emotional capabilities is essential for understanding and improving CA users affective experience. Our specific research aims in this regard will be the development of: (1) Novel multimodal emotion recognition models for generating emotion-aware dialogues that require an understanding of the user's emotions. (2) Novel multimodal trustworthiness computational models for the development of more emotionally intelligent and engaging conversational interfaces. (3) Sentiment Analysis models as alternative and novel proxies to analyze the opinions that users have about the interaction and for estimating its quality. (4) Novel models exploring the relation between multimedia content and different affective-defining variables, such as aesthetics, memorability and attention, for enabling smarter responses with proactive content suggestion. Conversational assistants should also be user-adaptive to the extent that it dynamically adapts its dialogue behavior and response according to the user and his/her emotional state. Our research efforts will be focused on: (1) Technologies for creating open-domain and
open-task CAs that can include emotional and persona-based characteristics for providing a more engaging user experience. (2) The development of novel multi-level multi-emotion and multi-style neural speech synthesisers to strengthen the capacity of the system to carry out a truly emotional adaptation.