About
210
Publications
46,680
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
10,436
Citations
Citations since 2017
Introduction
Additional affiliations
September 2015 - present
September 2009 - August 2015
Publications
Publications (210)
The expression and perception of human emotions are not uniformly distributed over time. Therefore, tracking local changes of emotion within a segment can lead to better models for
speech emotion recognition
(SER), even when the task is to provide a sentence-level prediction of the emotional content. A challenge to exploring local emotional chang...
In autonomous, as well as manually operated vehicles, monitoring the driver visual attention provides useful information about the behavior, intent and vigilance level of the driver. The gaze of the driver can be formulated in terms of a probabilistic visual map representing the region around which the driver’s attention is focused. The area of the...
Emotional voice conversion (EVC) aims to convert the emotional state of an utterance from one emotion to another while preserving the linguistic content and speaker identity. Current studies mostly focus on modelling the conversion between several specific emotion types. Synthesizing mixed effects of emotions could help us to better imitate human e...
Deep learning approaches for medical image analysis are limited by small data set size due to multiple factors such as patient privacy and difficulties in obtaining expert labelling for each image. In medical imaging system development pipelines, phases for system development and classification algorithms often overlap with data collection, creatin...
Advancing speech emotion recognition (SER) depends highly on the source used to train the model, i.e., the emotional speech corpora. By permuting different design parameters, researchers have released versions of corpora that attempt to provide a "better-quality" source for training SER. In this work, we focus on studying communication modes of col...
The prediction of valence from speech is an important, but challenging problem. The expression of valence in speech has speaker-dependent cues, which contribute to performances that are often significantly lower than the prediction of other emotional attributes such as arousal and dominance. A practical approach to improve valence prediction from s...
Emotion recognition using audiovisual features is a challenging task for human-machine interaction systems. Under ideal conditions (perfect illumination, clean speech signals, and non-occluded visual data) many systems are able to achieve reliable results. However, few studies have considered developing multimodal systems and training strategies to...
Previous studies on speech emotion recognition (SER) with categorical emotions have often formulated the task as a single-label classification problem, where the emotions are considered orthogonal to each other. However, previous studies have indicated that emotions can co-occur, especially for more ambiguous emotional sentences (e.g., a mixture of...
Speech emotion recognition (SER) is a challenging task due to the limited availability of real-world labeled datasets. Since it is easier to find unlabeled data, the use of self-supervised learning (SSL) has become an attractive alternative. This study proposes new pretext tasks for SSL to improve SER. While our target application is SER, the propo...
A smart vehicle should be able to monitor the actions and behaviors of the human driver to provide critical warnings or intervene when necessary. Recent advancements in deep learning and computer vision have shown great promise in monitoring human behavior and activities. While these algorithms work well in a controlled environment, naturalistic dr...
Head pose estimation is an important problem as it facilitates tasks such as gaze estimation and attention modeling. In the automotive context, head pose provides crucial information about the driver’s mental state, including drowsiness, distraction and attention. It can also be used for interaction with in-vehicle infotainment systems. While compu...
A challenging task in audiovisual emotion recognition is to implement neural network architectures that can leverage and fuse mul-timodal information while temporally aligning modalities, handling missing modalities, and capturing information from all modalities without losing information during training. These requirements are important to achieve...
The decision of ground truth for speech emotion recognition (SER) is still a critical issue in affective computing tasks. Previous studies on emotion recognition often rely on consensus labels after aggregating the classes selected by multiple annotators. It is common for a perceptual evaluation conducted to annotate emotional corpora to include th...
Studying the facial expressions of humans has been one of the major applications of computer vision. An open question is whether common machine learning techniques can also be used to track behaviors of animals, which is a less explored research problem. Since animals are not capable of verbal communication, computer vision solutions can provide va...
Anomaly driving detection is an important problem in advanced driver assistance systems (ADAS). It is important to identify potential hazard scenarios as early as possible to avoid potential accidents. This study proposes an unsupervised method to quantify driving anomalies using a conditional generative adversarial network (GAN). The approach pred...
The prediction of valence from speech is an important, but challenging problem. The externalization of valence in speech has speaker-dependent cues, which contribute to performances that are often significantly lower than the prediction of other emotional attributes such as arousal and dominance. A practical approach to improve valence prediction f...
Visual attention is one of the most important aspects related to driver distraction. Predicting the driver's visual attention can help a vehicle understand the awareness state of the driver, providing important contextual information. While estimating the exact gaze direction is difficult in the car environment, a coarse estimation of the visual at...
Driving anomaly detection aims to identify objects, events or actions that can increase the risk of accidents, reducing road safety. While supervised approaches can effectively identify aspects related to driving anomalies, it is unfeasible to tabulate and address all potential driving anomalies. Instead, it is appealing to design unsupervised appr...
This study proposes the novel formulation of measuring emotional similarity between speech recordings. This formulation explores the ordinal nature of emotions by comparing emotional similarities instead of predicting an emotional attribute, or recognizing an emotional category. The proposed task determines which of two alternative samples has the...
Speech emotion recognition (SER) is an important research area, with direct impacts in applications of our daily lives, spanning education, health care, security and defense, entertainment, and human–computer interaction. The advances in many other speech signal modeling tasks, such as automatic speech recognition, text-to-speech synthesis, and spe...
In contrast to previous studies that focused on classical machine learning algorithms and hand-crafted features, we present an end-to-end neural network classification method able to accommodate lesion heterogeneity for improved oral cancer diagnosis using multispectral autofluorescence lifetime imaging (maFLIM) endoscopy. Our method uses an autoen...
Multispectral autofluorescence lifetime imaging (maFLIM) can be used to clinically image a plurality of metabolic and biochemical autofluorescence biomarkers of oral epithelial dysplasia and cancer. This study tested the hypothesis that maFLIM-derived autofluorescence biomarkers can be used in machine-learning (ML) models to discriminate dysplastic...
The Generation power of Generative Adversarial Neural Networks (GANs) has shown great promise to learn representations from unlabelled data while guided by a small amount of labelled data. We aim to utilise the generation power of GANs to learn Audio Representations. Most existing studies are, however, focused on images. Some studies use GANs for s...
A critical issue of current speech-based sequence-to-one learning tasks, such as speech emotion recognition(SER), is the dynamic temporal modeling for speech sentences with different durations. The goal is to extract an informative representation vector of the sentence from acoustic feature sequences with varied length. Traditional methods rely on...
A smart vehicle should be able to understand human behavior and predict their actions to avoid hazardous situations. Specific traits in human behavior can be automatically predicted, which can help the vehicle make decisions, increasing safety. One of the most important aspects pertaining to the driving task is the driver's visual attention. Predic...
A smart vehicle should be able to monitor the actions and behaviors of the human driver to provide critical warnings or intervene when necessary. Recent advancements in deep learning and computer vision have shown great promise in monitoring human behaviors and activities. While these algorithms work well in a controlled environment, naturalistic d...
Social impairment is a cardinal feature of schizophrenia spectrum disorders (SZ). Smaller social network size, diminished social skills, and loneliness are highly prevalent. Existing, gold-standard assessments of social impairment in SZ often rely on self-reported information that depends on retrospective recall and detailed accounts of complex soc...
Social impairment is a cardinal feature of schizophrenia spectrum disorders (SZ). Smaller social network size, diminished social skills, and loneliness are highly prevalent. Existing, gold-standard assessments of social impairment in SZ often rely on self-reported information that depends on retrospective recall and detailed accounts of complex soc...
Face analysis is an important area in affective computing. While studies have reported important progress in detecting emotions from still images, an open challenge is to determine emotions from videos, leveraging the dynamic nature in the externalization of emotions. A common approach in earlier studies is to individually process each frame of a v...
An
automatic speech recognition
(ASR) system is a key component in current speech-based systems. However, the surrounding acoustic noise can severely degrade the performance of an ASR system. An appealing solution to address this problem is to augment conventional audio-based ASR systems with visual features describing lip activity. This paper pr...
Speech emotion recognition (SER) systems find applications in various fields such as healthcare, education, and security and defense. A major drawback of these systems is their lack of generalization across different conditions. For example, systems that show superior performance on certain databases show poor performance when tested on other corpo...
New developments in advanced driver assistance systems (ADAS) can help drivers deal with risky driving maneuvers, preventing potential hazard scenarios. A key challenge in these systems is to determine when to intervene. While there are situations where the needs for intervention or feedback is clear (e.g., lane departure), it is often difficult to...
Wavelet Packets (WPs) bases are explored seeking new discriminative features for texture indexing. The task of WP feature design is formulated as a learning decision problem by selecting the filter-bank structure of a basis (within a WPs family) that offers an optimal balance between estimation and approximation errors. To address this problem, a c...
Artificial intelligence and machine learning systems have demonstrated huge improvements and human-level parity in a range of activities, including speech recognition, face recognition and speaker verification. However, these diverse tasks share a key commonality that is not true in affective computing: the ground truth information that is inferred...
With the transformative technologies and the rapidly changing global R&D landscape, the multimedia and multimodal community is now faced with many new opportunities and uncertainties. With the open source dissemination platform and pervasive computing resources, new research results are being discovered at an unprecedented pace. In addition, the ra...
Speech activity detection (SAD) plays an important role in current speech processing systems, including automatic speech recognition (ASR). SAD is particularly difficult in environments with acoustic noise. A practical solution is to incorporate visual information, increasing the robustness of the SAD approach. An audiovisual system has the advanta...
Articulation, emotion, and personality play strong roles in the orofacial movements. To improve the naturalness and expressiveness of
virtual agents
(VAs), it is important that we carefully model the complex interplay between these factors. This paper proposes a conditional generative adversarial network, called
conditional sequential GAN
(CSG)...
Speech emotion recognition (SER) systems find applications in various fields such as healthcare, education, and security and defense. A major drawback of these systems is their lack of generalization across different conditions. This problem can be solved by training models on large amounts of labeled data from the target domain, which is expensive...
A common step in the area of speech emotion recognition is to obtain ground-truth labels describing the emotional content of a sentence. The underlying emotion of a given recording is usually unknown, so perceptual evaluations are conducted to annotate its perceived emotion. Each sentence is often annotated by multiple raters, which are aggregated...
This study introduces a method to design a curriculum for machine-learning to maximize the efficiency during the training process of
deep neural networks
(DNNs) for speech emotion recognition. Previous studies in other machine-learning problems have shown the benefits of training a classifier following a curriculum where samples are gradually pre...
This study aims to create neutral reference models from synthetic speech to contrast the emotional content of a speech signal. Modeling emotional behaviors is a challenging task due to the variability in perceiving and describing emotions. Previous studies have indicated that relative assessments are more reliable than absolute assessments. These s...
Representing computationally everyday emotional states is a challenging task and, arguably, one of the most fundamental for affective computing. Standard practice in emotion annotation is to ask humans to assign a value of intensity or a class value to each emotional behavior they observe. Psychological theories and evidence from multiple disciplin...
Automatic emotion recognition plays a crucial role in various fields such as healthcare, human-computer interaction (HCI) and security and defense. While most of previous studies have focused on the recognition of emotion in isolated utterances, a more natural approach is to continuously track emotions during human interaction, identifying regions...
Speech activity detection (SAD) plays an important role in current speech processing systems, including automatic speech recognition (ASR). SAD is particularly difficult in environments with acoustic noise. A practical solution is to incorporate visual information, increasing the robustness of the SAD approach. An audiovisual system has the advanta...
Visual information can improve the performance of automatic speech recognition (ASR), especially in the presence of background noise or different speech modes. A key problem is how to fuse the acoustic and visual features leveraging their complementary information and overcoming the alignment differences between modalities. Current audiovisual ASR...
Articulation, emotion, and personality play strong roles in the orofacial movements. To improve the naturalness and expressiveness of virtual agents (VAs), it is important that we carefully model the complex interplay between these factors. This paper proposes a conditional generative adversarial network, called conditional sequential GAN (CSG), wh...
This study introduces a method to design a curriculum for machine-learning to maximize the efficiency during the training process of deep neural networks (DNNs) for speech emotion recognition. Previous studies in other machine-learning problems have shown the benefits of training a classifier following a curriculum where samples are gradually prese...
Recognizing emotions using few attribute dimensions such as arousal, valence and dominance provides the flexibility to effectively represent complex range of emotional behaviors. Conventional methods to learn these emotional descriptors primarily focus on separate models to recognize each of these attributes. Recent work has shown that learning the...
The performance of speech emotion recognition is affected by the differences in data distributions between train (source domain) and test (target domain) sets used to build and evaluate the models. This is a common problem, as multiple studies have shown that the performance of emotional classifiers drop when they are exposed to data that does not...
Human gaze directly signals visual attention, therefore, estimation of gaze has been an important research topic in fields such as human attention modeling and human-computer interaction. Accurate gaze estimation requires user, system and even session dependent parameters, which can be obtained by calibration process. However, this process has to b...
Audio-based automatic speech recognition (A-ASR) systems are affected by noisy conditions in real-world applications. Adding visual cues to the ASR system is an appealing alternative to improve the robustness of the system, replicating the audiovisual perception process used during human interactions. A common problem observed when using audiovisua...