Article

Deep Learning for Human Affect Recognition: Insights and New Developments

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Automatic human affect recognition is a key step towards more natural human-computer interaction. Recent trends include recognition in the wild using a fusion of audiovisual and physiological sensors, a challenging setting for conventional machine learning algorithms. Since 2010, novel deep learning algorithms have been applied increasingly in this field. In this paper, we review the literature on human affect recognition between 2010 and 2017, with a special focus on approaches using deep neural networks. By classifying a total of 950 studies according to their usage of shallow or deep architectures, we are able to show a trend towards deep learning. Reviewing a subset of 233 studies that employ deep neural networks, we comprehensively quantify their applications in this field. We find that deep learning is used for learning of (i) spatial feature representations, (ii) temporal feature representations, and (iii) joint feature representations for multimodal sensor data. Exemplary state-of-the-art architectures illustrate the progress. Our findings show the role deep architectures will play in human affect recognition, and can serve as a reference point for researchers working on related applications.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Finally, we have a discussion and explore significant research challenges and opportunities. Various surveys for emotion recognition have been published in recent years [18][19][20][21][22][23][24][25][26][27][28][29]. For FERs, Li et al. [18] investigated the state-of-the-art methods for both static and dynamic FERs, and detail the pipelines of FERs in terms of datasets, preprocessing, deep learning embedding and comparison of performances, etc. Patel et al. [19] review the machine and deep learning networks for FERs based on static images. ...
... Deng et al. [24] provide a thorough survey, which systematically reviews deep learning-based methods for TERs in terms of word embedding, deep learning architecture, training-level approaches, and challenges in detail. In addition, several surveys for multimodal emotion recognition are proposed [25][26][27][28][29]. Poria et al. [27] review both emotion recognition and sentiment analysis from unimodality to multimodality in 2017. ...
... EEG, ECG) into consideration. Rouast et al. [26] provides an overview of deep learningbased emotion recognition approaches between 2010 and 2017 applied to visual, auditory, and physiological sensor data. More recently, [28] discusses the progress of research into multimodal emotion recognition based on brain-computer interfaces (BCI). ...
Article
Full-text available
With the advancement of multimedia and human-computer interaction, it has become increasingly crucial to perceive people’s emotional states in dynamic data (e.g., video, audio, text stream) in order to effectively serve them. Emotion recognition has emerged as a prominent research area over the past decades. Traditional methods for emotion recognition heavily rely on manually crafted features and primarily focus on uni-modality. However, these approaches encounter challenges in extracting sufficient discriminative information for complex emotion recognition tasks. To tackle this issue, deep neural model-based methods have gained significant popularity in emotion recognition tasks. These methods leverage deep neural models to automatically learn more discriminative emotional features, thereby addressing the problem of poor discriminability associated with manually designed features. Moreover, deep neural models are also employed to integrate information across multiple modalities, thereby enhancing the extraction of discriminative information. In this paper, we provide a comprehensive review of the relevant studies on deep neural model-based emotion recognition in dynamic data using facial, speech, and textual cues published within the past five years. Specifically, we first explain discretized and continuous representations of emotions by introducing widely accepted emotion models. Subsequently, we elucidate how advanced methods integrate different neural models by scoping these methods using variant popular deep neural models (e.g. Transformer), along with corresponding preprocessing mechanisms. In addition, we present the development trend by surveying diverse datasets, metrics, and competitive performances. Finally, we have a discussion and explore significant research challenges and opportunities. Our survey bridges the gaps in the literature since existing surveys are narrow in focus, either exclusively covering single-modal methods, solely concentrating on multi-modal methods, overlooking certain aspects of face, speech, and text, or emphasizing outdated methodologies.
... Some studies focus on the sensors employed for data collection [30,31], while others examine the types of machine learning algorithms used for data analysis. For example, the study by Rouast P. et al., in 2021, which emphasizes deep learning investigations related to human affect recognition analysis, provides valuable insights and developments [32]. Similarly, the study by Ahmed N. et al., published in 2023, systematically reviews studies on multimodal emotion recognition employing machine learning algorithms [33]. ...
... This study was led following the essential guidelines for performing a systematic literature review (SLR) proposed by Kitchenham B. [32]. The key procedure instructions involve a well-structured planning, execution, and detailed reporting, as shown in Figure 1. ...
Article
Full-text available
This systematic literature review delves into the extensive landscape of emotion recognition, sentiment analysis, and affective computing, analyzing 609 articles. Exploring the intricate relationships among these research domains, and leveraging data from four well-established sources—IEEE, Science Direct, Springer, and MDPI—this systematic review classifies studies in four modalities based on the types of data analyzed. These modalities are unimodal, multi-physical, multi-physiological, and multi-physical–physiological. After the classification, key insights about applications, learning models, and data sources are extracted and analyzed. This review highlights the exponential growth in studies utilizing EEG signals for emotion recognition, and the potential of multimodal approaches combining physical and physiological signals to enhance the accuracy and practicality of emotion recognition systems. This comprehensive overview of research advances, emerging trends, and limitations from 2018 to 2023 underscores the importance of continued exploration and interdisciplinary collaboration in these rapidly evolving fields.
... In addition, several toolboxes are available for crafting these commonly used audio signal features [25,26]. (b) Deep learning methods recast SER as an end-to-end optimization procedure for learning spatial, temporal, or joint feature representation [27]. Some works hand-craft the short-timeframe-level acoustic features and aggregate these features into an utterance-level representation using recurrent neural networks (RNNs) and local attention modules [28]. ...
... Diverse approaches have been proposed for encoding emotional cues in audio signals, including audio feature extraction methods [25,26] and deep-learning-based spatial, temporal, and joint feature representation methods [27]. ...
Article
Full-text available
Speech emotion recognition (SER) aims to recognize human emotions through in-depth analysis of audio signals. However, it remains challenging to encode emotional cues and to fuse the encoded cues effectively. In this study, dual-stream representation is developed, and both full training and fine-tuning of different deep networks are employed for encoding emotion patterns. Specifically, a cross-attention fusion (CAF) module is designed to integrate the dual-stream output for emotion recognition. Using different dual-stream encoders (fully training a text processing network and fine-tuning a pre-trained large language network), the CAF module is compared to other three fusion modules on three databases. The SER performance is quantified with weighted accuracy (WA), unweighted accuracy (UA), and F1-score (F1S). The experimental results suggest that the CAF outperforms the other three modules and leads to promising performance on the databases (EmoDB: WA, 97.20%; UA, 97.21%; F1S, 0.8804; IEMOCAP: WA, 69.65%; UA, 70.88%; F1S, 0.7084; RAVDESS: WA, 81.86%; UA, 82.75.21%; F1S, 0.8284). It is also found that fine-tuning a pre-trained large language network achieves superior representation than fully training a text processing network. In a future study, improved SER performance could be achieved through the development of a multi-stream representation of emotional cues and the incorporation of a multi-branch fusion mechanism for emotion recognition.
... The progress in technologies such as Internet of Things(IoT) and the availability of Big Data have strengthened this trend. The developments in deep learning algorithms yielded significant results in many of the social signal processing problems [9], [10] [11], [12]. Roust et al. [9] and Kumar [10] examined the automatic affect recognition of humans using deep neural networks. ...
... The developments in deep learning algorithms yielded significant results in many of the social signal processing problems [9], [10] [11], [12]. Roust et al. [9] and Kumar [10] examined the automatic affect recognition of humans using deep neural networks. They reviewed around 950 studies on deep learning conducted from 2010 to 2017. ...
Article
Full-text available
Learner engagement is a significant factor determining the success of implementing an intelligent educational network. Currently the use of Massive Open Online Courses has increased because of the flexibility offered by such online learning systems. The COVID period has encouraged practitioners to continue to engage in new ways of online and hybrid teaching. However, monitoring student engagement and keeping the right level of interaction in an online classroom is challenging for teachers. In this paper we propose an engagement recognition model by combining the image traits obtained from a camera, such as facial emotions, gaze tracking with head pose estimation and eye blinking rate. In the first step, a face recognition model was implemented. The next stage involved training the facial emotion recognition model using deep learning convolutional neural network with the datasets FER 2013.The classified emotions were assigned weights corresponding to the academic affective states. Subsequently, by using the Dlib’s face detector and shape predicting algorithm, the gaze direction with head pose estimation, eyes blinking rate and status of the eye (closed or open) were identified. Combining all these modalities obtained from the image traits, we propose an engagement recognition system. The experimental results of the proposed system were validated by the quiz score obtained at the end of each session. This model can be used for real time video processing of the student’s affective state. The teacher can obtain a detailed analytics of engagement statics on a spreadsheet at the end of the session thus facilitating the necessary follow-up actions.
... The adoption of facial recognition technology in affective computing began in 2006, focusing on emotion recognition through frontal-view facial images (Pantic & Patras, 2006). With the advent of deep neural networks, there has been a significant enhancement in affect recognition, as these networks facilitate the learning of expressive features and capture various levels of abstraction, thereby improving the accuracy of emotion recognition (Rouast et al. 2019). ...
Conference Paper
Full-text available
Artificial emotional intelligence (AEI) systems, which sense, interpret, and respond to human emotions, are increasingly utilised across various sectors, enhancing interpersonal interactions while raising ethical concerns. This scoping review examines the evolving field of AEI, covering its historical development, current applications, and emerging research opportunities. Our analysis draws from 96 articles spanning multiple disciplines, revealing significant progress from initial scepticism to growing acceptance of AEI as an interdisciplinary study. We highlight AEI's applications in healthcare, where it improves patient care; in marketing, where it enhances customer interactions; and in the love and sex industries, where it facilitates new forms of romantic and erotic engagement. Each sector demonstrates AEI's potential to transform practices and provoke ethical debates. The review provides a framework to understand AEI's sociotechnical implications and identifies future research opportunities regarding trustworthy, privacy-preserving, context-aware, and culturally adaptive AEI systems, with a focus on their profound impact on human relationships.
... Preference learning as an alternative to classification and regression has been commonly applied to modeling player aspects. Numerous neural network architectures to predict the emotional and cognitive states of players have been explored by Rouast et al. (2019). ...
... The field of computer science, particularly within the domains of Social Signal Processing [26], Affective Computing [21], Educational Technology [13], and Human-Robot Interaction (HRI) [9], has increasingly recognized the substantial influence of nonverbal communication in educational contexts. Advanced computational models and machine learning techniques offers novel avenues, that may hold distinct advantages over traditional observer or questionnaire assessments of nonverbal behaviors. ...
Preprint
Full-text available
This paper introduces a novel computational approach for analyzing nonverbal social behavior in educational settings. Integrating multimodal behavioral cues, including facial expressions, gesture intensity, and spatial dynamics, the model assesses the nonverbal immediacy (NVI) of teachers from RGB classroom videos. A dataset of 400 30-second video segments from German classrooms was constructed for model training and validation. The gesture intensity regressor achieved a correlation of 0.84, the perceived distance regressor 0.55, and the NVI model 0.44 with median human ratings. The model demonstrates the potential to provide a valuable support in nonverbal behavior assessment, approximating the accuracy of individual human raters. Validated against both questionnaire data and trained observer ratings, our models show moderate to strong correlations with relevant educational outcomes, indicating their efficacy in reflecting effective teaching behaviors. This research advances the objective assessment of nonverbal communication behaviors, opening new pathways for educational research.
... Multimodal models, which adeptly fuse information from a diverse range of sources, have yielded promising results and found many applications such as language and vision [27,29,44,17,51], multimedia [2,35,12], affective computing [63,41,64,48,56], robotics [23,49,24], humancomputer interaction [43,41], and healthcare diagnosis [40,60,36,14,61]. Multimodal machine learning presents distinctive computational and theoretical research challenges due to the diversity of data sources involved [25,26]. ...
Article
Full-text available
Recent years have witnessed a surge of interest in integrating high-dimensional data captured by multisource sensors, driven by the impressive success of neural networks in integrating multimodal data. However, the integration of heterogeneous multimodal data poses a significant challenge, as confounding effects and dependencies among such heterogeneous data sources introduce unwanted variability and bias, leading to suboptimal performance of multimodal models. Therefore, it becomes crucial to normalize the low-or high-level features extracted from data modalities before their fusion takes place. This paper introduces RegBN, a novel approach for multimodal Batch Normalization with REGular-ization. RegBN uses the Frobenius norm as a regularizer term to address the side effects of confounders and underlying dependencies among different data sources. The proposed method generalizes well across multiple modalities and eliminates the need for learnable parameters, simplifying training and inference. We validate the effectiveness of RegBN on eight databases from five research areas, encompassing diverse modalities such as language, audio, image, video, depth, tabular, and 3D MRI. The proposed method demonstrates broad applicability across different architectures such as multilayer perceptrons, convolutional neural networks, and vision transformers, enabling effective normalization of both low-and high-level features in multimodal neural networks. RegBN is available at https://mogvision.github.io/RegBN.
... This could be overcome by employing various parallel hardware architecture platforms, such as GPU and FPGA [16]. In addition, there is a higher difficulty for the inherent ambiguity of annotations in affective databases [17]. ...
Article
Full-text available
This paper proposes a model-based method for real-time automatic mood estimation in video sequences. The approach is customized by learning the person’s specific facial parameters, which are transformed into facial Action Units (AUs). A model mapping for mood representation is used to describe moods in terms of the PAD space: Pleasure, Arousal, and Dominance. From the intersection of these dimensions, eight octants represent fundamental mood categories. In the experimental evaluation, a stimulus video randomly selected from a set prepared to elicit different moods was played to participants, while the participant’s facial expressions were recorded. From the experiment, Dominance is the dimension least impacted by facial expression, and this dimension could be eliminated from mood categorization. Then, four categories corresponding to the quadrants of the Pleasure–Arousal (PA) plane, “Exalted”, “Calm”, “Anxious” and “Bored”, were defined, with two more categories for the “Positive” and “Negative” signs of the Pleasure (P) dimension. Results showed a 73% of coincidence in the PA categorization and a 94% in the P dimension, demonstrating that facial expressions can be used to estimate moods, within these defined categories, and provide cues for assessing users’ subjective states in real-world applications.
... However, categorising and identifying human emotions is difficult for a machine. Automatic Facial Emotion Detection is the most studied modality [7], but it's hard since everyone exhibits emotion differently. Researchers should consider the real-world challenges head position, brightness, backdrop, and occlusion. ...
Article
Full-text available
This research focuses on addressing the challenges of visual emotion analysis and emphasizes the need for improved comprehension and categorization of emotions by machines. The objective is to develop an efficient architecture for emotion classification in real-world scenarios encountered in multimedia retrieval tasks, considering factors like illumination, occlusion, pose variations, small face sizes, multimodal detection, and big data issues. The proposed Dense Blocked Network-based VEDANet architecture surpasses the state-of-the-art on benchmark datasets by leveraging pre-trained CNN architectures to recognize facial features and extract metadata. The exploration of 128 descriptors from a deep residual network further enhances the operations. Using the VEDANet framework, emotions are classified into seven categories with superior accuracy compared to conventional approaches. The study investigates the efficacy of an over-the-top optimization (OTO) layer to enhance emotion classification. The proposed model achieves impressive accuracy scores on diverse datasets, including AffectNet, Google FEC, Yale Face DB, and FER2013, with percentages of 87.30%, 92.75%, 95.07%, and 90.53% respectively. Real-time performance is achieved with exceptional accuracy of 93.46% on live frames, while minimizing turn-around time by optimizing network size and parameters. This research contributes to the advancement of visual emotion analysis, providing a comprehensive and efficient solution for AI-based emotion detection in various applications.
... In recent years, deep learning recognition algorithms have been applied to many domains such as face recognition [6,7], plant disease detection [8,9], autonomous driving [10], etc. Deep learning algorithms rely on deep networks that usually consist of layers of convolution, batch normalization, and activation. These algorithms are able to automatically extract the feature representation of the dataset and make reliable predictions for real-world samples [11][12][13]. Deep learning algorithms are also widely used in SAR ATR, since the network can be trained without human interference and can achieve better results compared to hand-crafted features used in traditional SAR ATR methods [14][15][16]. ...
Article
Full-text available
Generative adversarial network (GAN) can generate diverse and high-resolution images for data augmentation. However, when GAN is applied to the synthetic aperture radar (SAR) dataset, the generated categories are not of the same quality. The unrealistic category will affect the performance of the subsequent automatic target recognition (ATR). To overcome the problem, we propose a reinforced constraint filtering with compensation afterwards GAN (RCFCA-GAN) algorithm to generate SAR images. The proposed algorithm includes two stages. We focus on improving the quality of easily generated categories in Stage 1. Then, we record the categories that are hard to generate and compensate by using traditional augmentation methods in Stage 2. Thus, the overall quality of the generated images is improved. We conduct experiments on the moving and stationary target acquisition and recognition (MSTAR) dataset. Recognition accuracy and Fréchet inception distance (FID) acquired by the proposed algorithm indicate its effectiveness.
... COVAREP provides 72 low-level speech acoustic features, which are derived from the speech signal, that include pitch, energy, spectral envelope, loudness, voice quality and other characteristics. Both eGeMAPs and COVAREP have been used extensively in the analysis of psychological disorders [103], [104], [105] and affect recognition [106], [107] LIWC features: We use Linguistic Inquiry and Word Count (LIWC) [108], [109], which is a text analysis tool that determines the percentage of words in a text that fall into one or more linguistic, psychological, and topical categories. We extract 92 features from the verbal content of each interview. ...
Preprint
Full-text available
To develop reliable, valid, and efficient measures of obsessive-compulsive disorder (OCD) severity, comorbid depressionseverity, and total electrical energy delivered (TEED) by deep brain stimulation (DBS), we trained and compared random forestsregression models in a clinical trial of participants receiving DBS for refractory OCD. Six participants were recorded during open-endedinterviews at pre- and post-surgery baselines and then at 3-month intervals following DBS activation. Ground-truth severity wasassessed by clinical interview and self-report. Visual and auditory modalities included facial action units, head and facial landmarks,speech behavior and content, and voice acoustics. Mixed-effects random forest regression with Shapley feature reduction stronglypredicted severity of OCD, comorbid depression, and total electrical energy delivered by the DBS electrodes (intraclass correlation,ICC, = 0.83, 0.87, and 0.81, respectively. When random effects were omitted from the regression, predictive power decreased tomoderate for severity of OCD and comorbid depression and remained comparable for total electrical energy delivered (ICC = 0.60,0.68, and 0.83, respectively). Multimodal measures of behavior outperformed ones from single modalities. Feature selection achievedlarge decreases in features and corresponding increases in prediction. The approach could contribute to closed-loop DBS that wouldautomatically titrate DBS based on affect measures
... This unobserved frailty component accounts for individual-specific traits that influence variability in the risk of experiencing recurrent events, such as genetic predispositions or other unmeasured factors. This idea is integrated in parametric frailty models by adding a random effect term to the hazard function (Rouast, Adam, & Chiong, 2019). This means that the hazard function is now a function of the individual-specific frailty term and the baseline hazard. ...
... For example, communicating with the system through natural language provides the model provider with immensely rich information. This kind of rich exchange between user and system will even further develop when users soon start talking instead of only texting with ChatGPT because voice transmits contextual information to the communication partner, such as gender and emotions (Gunes and Schuller 2013;Rouast et al. 2021;Stern et al. 2021). ...
... In order to fully reflect the characteristics of the optimized federated learning, we selected two commonly used emotion recognition models for experiments, which are deep neural network and SVM. According to the calculation method in Chapter 3, the deep neural network is set to a 6-layer structure [31], and it can be predicted that the required memory is 862M FLOPs, the support vector machine algorithm [32] estimates that the required memory is 364M FLOPs. Using machine learning methods, extract timedomain features, frequency-domain features and nonlinear features of physiological signals, such as EEG and ECG. ...
Article
Full-text available
Federated learning (FL) is widely used because it is effective at enhancing data privacy. However, there will be many problems in the FL training process, such as poor performance of training models and the model converging too slowly, as the data is typically heterogeneous and the computing capabilities of the participants device are different. Here, we proposed an optimized FL model paradigm, that applies model arithmetic prediction to prevent the training process's inefficiency due to the participants' limited computational resources. The proposed formula for participant selection is based on posterior probabilities and correlation coefficients, which have been validated to reduce data noise and enhance the effect of central model aggregation. In addition, high-quality participant models are selected based on posterior probability, combined with correlation coefficients, which allows the server model to aggregate as many better-performing participant models as possible, meanwhile avoiding the impact of participants with too much data noise. During the aggregation step, the model loss values and the participant training delay are used to weight factors for participant devices, which accelerates FL convergence and improves model performance. Data heterogeneity and non-IID are fully taken into consideration in the method we proposed. Finally, these results have been verified by extensive experimental, we demonstrate better performance in the presence of non-IID data, especially affective computing. Compared with previous research, reduces training latency by 4 seconds, and the model accuracy is increased by 10% on average.
... Multimodal fusion HRI is commonly employed in human emotion recognition, as it captures implicit information conveyed through speech, facial expressions, gestures, and other channels [46]. Mamyrbayev et al. built a speech multimodal recognition system with the help of vision-based lip recognition and speech recognition [47]. ...
Article
Full-text available
Collaborative robots, also known as cobots, are designed to work alongside humans in a shared workspace and provide assistance to them. With the rapid development of robotics and artificial intelligence in recent years, cobots have become faster, smarter, more accurate, and more dependable. They have found applications in a broad range of scenarios where humans require assistance, such as in the home, healthcare, and manufacturing. In manufacturing, in particular, collaborative robots combine the precision and strength of robots with the flexibility of human dexterity to replace or aid humans in highly repetitive or hazardous manufacturing tasks. However, human–robot interaction still needs improvement in terms of adaptability, decision making, and robustness to changing scenarios and uncertainty, especially in the context of continuous interaction with human operators. Collaborative robots and humans must establish an intuitive and understanding rapport to build a cooperative working relationship. Therefore, human–robot interaction is a crucial research problem in robotics. This paper provides a summary of the research on human–robot interaction over the past decade, with a focus on interaction methods in human–robot collaboration, environment perception, task allocation strategies, and scenarios for human–robot collaboration in manufacturing. Finally, the paper presents the primary research directions and challenges for the future development of collaborative robots.
... In recent years, there has been a growing interest in AC, which involves modelling and recognising affective displays in individuals (Rouast, Adam and Chiong, 2021). While early models primarily emphasised the outward expression of emotions, subsequent research has suggested a more comprehensive approach combining expression and physiology (Huang et al., 2019). ...
Thesis
Full-text available
Managing pain poses a significant challenge for many individuals, impacting their overall quality of life. While Virtual Reality (VR) distraction therapy has shown promise in alleviating pain perception, its real-world effectiveness requires further exploration. This research delves into the potential of VR technology and conditioning stimuli to deliver personalised distraction therapy in practical settings, addressing the need for effective pain management interventions. The study consisted of two phases. Initially, participants engaged in a VR experience while wearing a smartwatch that collected physiological data, specifically heart rate. This data was used to train a machine-learning model that estimated participants' distraction levels through affective computing. The model adjusted the VR experience, aiming to maintain participants in a "flow" state while simultaneously triggering vibrations on the wearable device. The second phase assessed the vibrational stimuli's effectiveness as a conditioned stimulus to enhance distraction and mitigate pain. Participants self-reported their pain levels before and after wearing the smartwatch in various scenarios. The study revealed significant findings concerning the reduction of pain, demonstrating a decrease of 41% in overall pain ratings. These results provide valuable insights into the varied responses to pain interventions across different conditions. In conclusion, this research addresses the general pain management problem by investigating the effectiveness of combining VR distraction therapy and conditioning techniques. The primary findings underscore that this integrated approach reduces pain perception in real-world scenarios. These insights have implications for enhancing pain management interventions, improving the well-being and quality of life for individuals struggling with pain. iii Acknowledgements I would like to extend my heartfelt appreciation to my dissertation supervisor for their unwavering guidance and support during this research study. Their valuable feedback and insights have played a pivotal role in shaping both my academic journey and the trajectory of this research. Their critical evaluation and constructive criticism have significantly contributed to refining my work.
... As a result, there has been a shift away from discrete emotion recognition towards the prediction of affective states in the continuous dimensional space [10]. A widely embraced dimensional emotion model within this context is the valence-arousal model, proposed by Russell in 1980 [11]. It utilizes the two fundamental dimensions: valence, representing the positive and negative aspects of human emotion, and arousal, indicating the degree of excitement and depression within the emotional scope. ...
Article
Full-text available
Continuous emotion recognition plays a crucial role in developing friendly and natural human-computer interaction applications. However, there exist two significant challenges unresolved in this field: how to effectively fuse complementary information from multiple modalities and capture long-range contextual dependencies during emotional evolution. In this paper, a novel multimodal continuous emotion recognition framework was proposed to address the above challenges. For the multimodal fusion challenge, the Multimodal Attention Fusion (MAF) method is proposed to fully utilize complementarity and redundancy between multiple modalities. To tackle temporal context dependencies, the Local Contextual Temporal Convolutional Network (LC-TCN) and the Global Contextual Temporal Convolutional Network (GC-TCN) were presented. These networks have the ability to progressively integrate multi-scale temporal contextual information from input streams of different modalities. Comprehensive experiments are conducted on the RECOLA and SEWA datasets to assess the effectiveness of our proposed framework. The experimental results demonstrate superior recognition performance compared to state-of-the-art approaches, achieving 0.834 and 0.671 on RECOLA, 0.573 and 0.533 on SEWA in terms of arousal and valence, respectively. These findings indicate a novel direction for continuous emotion recognition by exploring temporal multi-scale information.
... COVAREP provides 72 low-level speech acoustic features, which are derived from the speech signal, that include pitch, energy, spectral envelope, loudness, voice quality and other characteristics. Both eGeMAPs and COVAREP have been used extensively in the analysis of psychological disorders [99], [100], [101] and affect recognition [102], [103] LIWC features: We use Linguistic Inquiry and Word Count (LIWC) [104], [105], which is a text analysis tool that determines the percentage of words in a text that fall into one or more linguistic, psychological, and topical categories. We extract 92 features from the verbal content of each interview. ...
Preprint
Full-text available
To develop reliable, valid, and efficient measures of severity of OCD and comorbid depression and electrical deep brain stimulation (DBS), we trained and compared random forests regression models in a clinical trial of participants receiving DBS for refractory OCD. Six participants were recorded during open-ended interviews at pre- and post-surgery baselines and then at 3-month intervals following DBS activation. Ground-truth severity was assessed by clinical interview and self-report. Visual and auditory modalities included facial action units, head and facial landmarks, speech behavior and content, and voice acoustics. Mixed-effects random forest regression with Shapley feature reduction strongly predicted severity of OCD, comorbid depression, and total electrical energy delivered by the DBS electrodes (intraclass correlation, ICC, = 0.83, 0.87, and 0.81, respectively. When random effects were omitted from the regression, predictive power decreased to moderate for severity of OCD and comorbid depression and remained comparable for total electrical energy delivered (ICC = 0.60, 0.68, and 0.83, respectively). Multimodal measures of behavior outperformed ones from single modalities. Feature selection achieved large decreases in features and corresponding increases in prediction. The approach could contribute to closed-loop DBS that would automatically titrate DBS based on affect measures.
... E MOTION recognition plays a pivotal role in conversational human-machine interaction (HMI) [1], [2], enabling systems to perceive and respond to users' emotional states [3], [4]. Research in affective computing has long proved the possibility of detecting emotion expressions with datadriven approaches using different input modalities, mainly linguistic, acoustic, and facial expressions [5], [6]. Multimodal approaches have also shown promising results in enhancing recognition accuracy and robustness [7]. ...
Preprint
Full-text available
The EMPATHIC project aimed to design an emotionally expressive virtual coach capable of engaging healthy seniors to improve well-being and promote independent aging. One of the core aspects of the system is its human sensing capabilities, allowing for the perception of emotional states to provide a personalized experience. This paper outlines the development of the emotion expression recognition module of the virtual coach, encompassing data collection, annotation design, and a first methodological approach, all tailored to the project requirements. With the latter, we investigate the role of various modalities, individually and combined, for discrete emotion expression recognition in this context: speech from audio, and facial expressions, gaze, and head dynamics from video. The collected corpus includes users from Spain, France, and Norway, and was annotated separately for the audio and video channels with distinct emotional labels, allowing for a performance comparison across cultures and label types. Results confirm the informative power of the modalities studied for the emotional categories considered, with multimodal methods generally outperforming others (around 68% accuracy with audio labels and 72-74% with video labels). The findings are expected to contribute to the limited literature on emotion recognition applied to older adults in conversational human-machine interaction.
... A szociális robotika esetében érdemes különbséget tennünk aközött, hogy az érzelmi reakció mérésének az a célja, hogy a kutatások keretein belül vizsgáljuk például a robotok fizikai megjelenése vagy viselkedése által emberben kiváltott érzelmeket, vagy hogy a robotokat ruházzuk fel azzal a képességgel, hogy működésük során tudják érzékelni a velük interakcióba lépő emberek érzelmi reakcióit, és aszerint reagáljanak. A céltól függően más -más módszerek alkalmazhatóak, például a kutatásokban elterjedt kérdőíves felmérésekkel szemben a roboton működő érzelemfelismerő alkalmazásoknál gyakori a gépi tanulás módszereken alapuló, automatizált érzelemfelismerő algoritmusok használata [57]- [59]. ...
Article
Full-text available
A szociális robotok térnyerésével az emberekkel folytatott interakciók gördülékenyebbé tétele kulcsfontosságúvá válik. Az emberi kommunikáció fontos aspektusa az érzelmek, mint belső állapotok viselkedésben megjelenő kifejezése, melyek felismerése, és mesterségesen generált, helyzetnek megfelelő érzelemkifejezések mutatása a robotok számára is nagy jelentőséggel bír. Az emberi érzelmek kutatása, definiálása és modellezése a pszichológia, etológia és más kapcsolódó diszciplínák területéről kilépve a robotikával fonódik össze. Jelen cikkünkben az ember-robot interakciós kutatásokban használt elterjedt érzelemfelismerési és -kifejezési módszereket, illetve ezek szakirodalmi hátterét foglaljuk össze.
... This has inevitably led to a surge of interest in affective computing among a growing cadre of researchers. Supplemented by Xi'an Founder Robot Co., LTD, Xi'an 710068, China developing deep learning methodologies and curating relevant datasets, affective computing has penetrated diverse domains spanning education, healthcare, and commerce [2]. Facial expressions stand out as a quintessential manifestation of human emotional states in emotion recognition. ...
Article
Full-text available
Edge computing has shown significant successes in addressing the security and privacy issues related to facial expression recognition (FER) tasks. Although several lightweight networks have been proposed for edge computing, the computing demands and memory access cost (MAC) imposed by these networks hinder their deployment on edge devices. Thus, we propose an edge computing-oriented real-time facial expression recognition network, called EC-RFERNet. Specifically, to improve the inference speed, we devise a mini-and-fast (MF) block based on the partial convolution operation. The MF block effectively reduces the MAC and parameters by processing only a part of the input feature maps and eliminating unnecessary channel expansion operations. To improve the accuracy, the squeeze-and-excitation (SE) operation is introduced into certain MF blocks, and the MF blocks at different levels are selectively connected by the harmonic dense connection. SE operation is used to complete the adaptive channel weighting, and the harmonic dense connection is used to exchange information between different MF blocks to enhance the feature learning ability. The MF block and the harmonic dense connection together constitute the harmonic-MF module, which is the core component of EC-RFERNet. This module achieves a balance between accuracy and inference speed. Five public datasets are used to test the validity of EC-RFERNet and to demonstrate its competitive performance, with only 2.25 MB and 0.55 million parameters. Furthermore, one human–robot interaction system is constructed with a humanoid robot equipped with the Raspberry Pi. The experimental results demonstrate that EC-RFERNet can provide an effective solution for practical FER applications.
... Formally, the model has five layers AE: a fully-connected (FC) layer with 48 dimensions, a dropout layer with a drop-out rate of p = 0.1, an FC layer with 36 dimensions, a dropout layer with a drop-out rate of p = 0.05, and an FC layer with 18 dimensions. The FC layers are all followed by a ReLu activation function 45 . Afterward, the latent space of the AE is contaminated with one of the frames before and after it. ...
Preprint
Full-text available
Affective states are reflected in the facial expressions of all mammals. Facial behaviors linked to pain have attracted most of the attention so far in non-human animals, leading to the development of numerous instruments for evaluating pain through facial expressions for various animal species. Nevertheless, manual facial expression analysis is susceptible to subjectivity and bias, is labor-intensive and often necessitates specialized expertise and training. This challenge has spurred a growing body of research into automated pain recognition, which has been explored for multiple species, including cats. In our previous studies, we have presented and studied artificial intelligence (AI) pipelines for automated pain recognition in cats using 48 facial landmarks grounded in cats' facial musculature, as well as an automated detector of these landmarks. However, so far automated recognition of pain in cats used solely static information obtained from hand-picked single images of good quality. This study takes a significant step forward in fully automated pain detection applications by presenting an end-to-end AI pipeline that requires no manual efforts in the selection of suitable images or their landmark annotation. By working with video rather than still images, this new pipeline approach also optimises the temporal dimension of visual information capture in a way that is not practical to preform manually. The presented pipeline reaches over 70% and 66% accuracy respectively in two different cat pain datasets, outperforming previous landmark-based approaches using single frames under similar conditions, indicating that dynamics matter in cat pain recognition. We further define metrics for measuring different dimensions of deficiencies in datasets with animal pain faces, and investigate their impact on the performance of the presented pain recognition AI pipeline.
... Recently, deep learning techniques [25,37,47,48,56] have shown the ability to learn multiple high-level facial expression representations and identification of complex patterns to discriminate human facial expressions. Khorrami et. ...
Article
Full-text available
The generation of a large human-labelled facial expression dataset is challenging due to ambiguity in labelling the facial expression class, and annotation cost. However, facial expression recognition (FER) systems demand discriminative feature representation, and require many training samples to establish stronger decision boundaries. Recently, FER approaches have used data augmentation techniques to increase the number of training samples for model generation. However, these augmented samples are derived from existing training data, and therefore have limitations for developing an accurate FER system. To achieve meaningful facial expression representations, we introduce an augmentation technique based on deep learning and genetic algorithms for FER. The proposed approach exploits the hypothesis that augmenting the feature-set is better than augmenting the visual data for FER. By evaluating this relationship, we discovered that the genetic evolution of discriminative features for facial expression is significant in developing a robust FER approach. In this approach, facial expression samples are generated from RGB visual data from videos considering human face frames as regions of interest. The face detected frames are further processed to extract key-frames within particular intervals. Later, these key-frames are convolved through a deep convolutional network for feature generation. A genetic algorithm’s fitness function is gauged to select optimal genetically evolved deep facial expression receptive fields to represent virtual facial expressions. The extended facial expression information is evaluated through an extreme learning machine classifier. The proposed technique has been evaluated on five diverse datasets i.e. JAFFE, CK+, FER2013, AffectNet and our application-specific Instructor Facial Expression (IFEV) dataset. Experimentation results and analysis show the promising accuracy and significance of the proposed technique on all these datasets.
... They confirmed that the DE feature derived from the EEG signal is an accurate and steady classification feature. Besides, Song et al. proposed Dynamic Graph Convolutional Neural Network (DGCNN) for EEG sentiment classification (Rouast et al., 2019). All of these deep models yielded better performance than shallow models. ...
Article
Full-text available
Studying brain activity and deciphering the information in electroencephalogram (EEG) signals has become an emerging research field, and substantial advances have been made in the EEG-based classification of emotions. However, using different EEG features and complementarity to discriminate other emotions is still challenging. Most existing models extract a single temporal feature from the EEG signal while ignoring the crucial temporal dynamic information, which, to a certain extent, constrains the classification capability of the model. To address this issue, we propose an Attention-Based Depthwise Parameterized Convolutional Gated Recurrent Unit (AB-DPCGRU) model and validate it with the mixed experiment on the SEED and SEED-IV datasets. The experimental outcomes revealed that the accuracy of the model outperforms the existing state-of-the-art methods, which confirmed the superiority of our approach over currently popular emotion recognition models.
... It focuses on creating algorithmic technologies capable of sensing, interpreting, and reacting to human emotions. Pioneered by MIT's Rosalind Picard (1995), AEI has expanded to include methods like facial expression recognition (Rouast et al. 2019), body motion analysis (Noroozi et al. 2018), natural language analysis (Yadollahi et al. 2017), and electroencephalography (Bhatti et al. 2016). ...
Conference Paper
Full-text available
As global loneliness intensifies alongside rapid AI advancements, artificial emotional intelligence (AEI) presents itself as a paradoxical solution. This study examines the rising trend of AEI personification-the ascription of inherently human attributes, like empathy, consciousness, and morality, to AEI agents such as companion chatbots and sex robots. Drawing from Leavitt's socio-technical systems framework and a critical literature review, we recast "artificial empathy" as emerging from the intricate relationship between people, technology, tasks, and structures, rather than a quality of AEI itself. Our research uncovers a (de)humanisation paradox: by humanising AI agents, we may inadvertently dehumanise ourselves, leading to an ontological blurring in human-AI interactions. This paradox reshapes conventional understanding of human essence in the digital era, sparking discussions about ethical issues tied to personhood, consent, and objectification, and unveiling new avenues for exploring the legal, socioeconomic , and ontological facets of human-AI relations.
Chapter
Full-text available
The integration of artificial intelligence (AI) tools in higher education has opened new avenues for personalized learning. This chapter examines the role of ChatGPT, an advanced AI-powered conversational agent, in facilitating tailored educational experiences for students with diverse learning needs. Through an exploration of the implementation of ChatGPT in higher education settings, this chapter sheds light on the potential of AI to enhance inclusivity and address challenges faced by students. It highlights the various ways in which ChatGPT can be utilized to customize learning paths, provide individualized support, and promote a more accessible and inclusive educational environment. The chapter also delves into the ethical considerations and findings of recent studies associated with the use of AI in personalized learning for students with various educational needs, emphasizing the importance of fostering an inclusive and supportive educational ecosystem. Overall, the findings underscore the impact of AI-driven personalized learning and its capacity to empower students in higher education. Based on recent studies, this chapter discusses the implications of generative AI language model in higher education. It explores and assesses how ChatGPT can assist teachers and students in providing support in teaching and learning based on specific prompts, resulting in a personalized and adaptive pathway in higher education.
Article
Full-text available
Emotional health significantly impacts physical and psychological well-being, with emotional imbalances and cognitive disorders leading to various health issues. Timely diagnosis of mental illnesses is crucial for preventing severe disorders and enhancing medical care quality. Physiological signals, such as Electrocardiograms (ECG) and Electroencephalograms (EEG), which reflect cardiac and neuronal activities, are reliable for emotion recognition as they are less susceptible to manipulation than physical signals. Galvanic Skin Response (GSR) is also closely linked to emotional states. Researchers have developed various methods for classifying signals to detect emotions. However, these signals are susceptible to noise and are inherently non-stationary, meaning they change constantly over time. Consequently, emotions can vary rapidly. Traditional techniques for analyzing physiological signals may not be adequate to study the dynamic changes in emotional states. This research introduces a deep learning approach using a combination of advanced signal processing and machine learning to analyze physiological signals for emotion recognition. We propose a CNN-Vision Transformer (CVT) based method with ensemble classification. The process involves decomposing signals into segments, removing noise, and extracting features using 1D CNN and Vision Transformers. These features are integrated into a single vector for classification by an ensemble of LSTM, ELM, and SVM classifiers. The outputs are then synthesized using Model Agnostic Meta Learning (MAML) to improve prediction accuracy. Validated on AMIGOS and DEAP datasets with 10-fold cross-validation, our method achieved accuracies up to 98.2%, sensitivity of 99.15%, and specificity of 99.53%, outperforming existing emotion charting techniques. This novel method provides significant improvements 3 to 4% in the accuracy, sensitivity, and specificity of emotion detection, leveraging physiological signals for comprehensive emotional assessments.
Chapter
Facial expression-based automatic emotion recognition is an intriguing field of study that has been presented and used in a variety of contexts, including human-machine interfaces, safety, and health. In order to improve computer predictions, researchers in this field are interested in creating methods for interpreting, coding, and extracting facial expressions. Deep learning has been incredibly successful, and as a result, its various architectures are being used to improve performance. This paper aims to investigate recent advances in deep learning-based automatic facial emotion recognition (FER). We highlight the contributions addressed, the architecture, and the databases employed. We also demonstrate the advancement by contrasting the suggested approaches with the outcomes attained. This paper aims to assist and direct researchers by reviewing current literature and offering perspectives to advance this field.
Article
Emotion recognition is increasingly important in areas ranging from human-computer interaction to mental health assessment. This paper introduces an innovative approach called the Cumulative Attribute-Weighted Graph Neural Network (CA-GNN) for trimodal emotion recognition. The method combines textual, audio, and video data from the IEMOCAP and CMUMOSI datasets. For text analysis, our approach uses speaker embedding in Long Short-Term Memory (LSTM) networks, Deep Neural Networks (DNN) for audio processing, and Convolutional Neural Networks (CNN) for video processing. The CA-GNN model employs a weighted graph structure, which enables effective integration of these modalities, highlighting how different emotional cues are interconnected. The paper's significant contribution is demonstrated through its experimental results. On the CMUMOSI dataset, our novel algorithm achieved an accuracy of 94%, with precision, recall, and F1-scores above 0.92 for Negative, Neutral, and Positive emotion categories. On the IEMOCAP dataset, the algorithm showed robust performance with an overall accuracy of 93%, and particularly high precision and recall in the Neutral and Positive categories. These results represent a significant advancement over current state-of-the-art models and show the potential of our approach in improving emotion recognition by synergistically using trimodal data. The comprehensive analysis and substantial results of this study not only prove the effectiveness of the proposed CA-GNN system in recognizing nuanced emotional states but also pave the way for future progress in affective computing. It emphasizes the importance of integrating multimodal data for increased accuracy and robustness in emotion recognition.
Article
To develop reliable, valid, and efficient measures of obsessive-compulsive disorder (OCD) severity, comorbid depression severity, and total electrical energy delivered (TEED) by deep brain stimulation (DBS), we trained and compared random forests regression models in a clinical trial of participants receiving DBS for refractory OCD. Six participants were recorded during open-ended interviews at pre- and post-surgery baselines and then at 3-month intervals following DBS activation. Ground-truth severity was assessed by clinical interview and self-report. Visual and auditory modalities included facial action units, head and facial landmarks, speech behavior and content, and voice acoustics. Mixed-effects random forest regression with Shapley feature reduction strongly predicted severity of OCD, comorbid depression, and total electrical energy delivered by the DBS electrodes (intraclass correlation, ICC, = 0.83, 0.87, and 0.81, respectively. When random effects were omitted from the regression, predictive power decreased to moderate for severity of OCD and comorbid depression and remained comparable for total electrical energy delivered (ICC = 0.60, 0.68, and 0.83, respectively). Multimodal measures of behavior outperformed ones from single modalities. Feature selection achieved large decreases in features and corresponding increases in prediction. The approach could contribute to closed-loop DBS that would automatically titrate DBS based on affect measures.
Article
Full-text available
In recent years, emotion recognition has received significant attention, presenting a plethora of opportunities for application in diverse fields such as human–computer interaction, psychology, and neuroscience, to name a few. Although unimodal emotion recognition methods offer certain benefits, they have limited ability to encompass the full spectrum of human emotional expression. In contrast, Multimodal Emotion Recognition (MER) delivers a more holistic and detailed insight into an individual's emotional state. However, existing multimodal data collection approaches utilizing contact-based devices hinder the effective deployment of this technology. We address this issue by examining the potential of contactless data collection techniques for MER. In our tertiary review study, we highlight the unaddressed gaps in the existing body of literature on MER. Through our rigorous analysis of MER studies, we identify the modalities, specific cues, open datasets with contactless cues, and unique modality combinations. This further leads us to the formulation of a comparative schema for mapping the MER requirements of a given scenario to a specific modality combination. Subsequently, we discuss the implementation of Contactless Multimodal Emotion Recognition (CMER) systems in diverse use cases with the help of the comparative schema which serves as an evaluation blueprint. Furthermore, this paper also explores ethical and privacy considerations concerning the employment of contactless MER and proposes the key principles for addressing ethical and privacy concerns. The paper further investigates the current challenges and future prospects in the field, offering recommendations for future research and development in CMER. Our study serves as a resource for researchers and practitioners in the field of emotion recognition, as well as those intrigued by the broader outcomes of this rapidly progressing technology.
Conference Paper
The field of Artificial Intelligence (AI) has a significant impact on the way computers and humans interact. The topic of (facial) emotion recognition has gained a lot of attention in recent years. Majority of research literature focuses on improvement of algorithms and Machine Learning (ML) models for single data sets. Despite the impressive results achieved, the impact of the (training) data quality with its potential biases and annotation discrepancies is often neglected. Therefore, this paper demonstrates an approach to detect and evaluate annotation label discrepancies between three separate (facial) emotion recognition databases by Transfer Testing with three ML architectures. The findings indicate Transfer Testing to be a new promising method to detect inconsistencies in data annotations of emotional states, implying label bias and/or ambiguity. Therefore, Transfer Testing is a method to verify the transferability of trained ML models. Such research is the foundation for developing more accurate AI-based emotion recognition systems, which are also robust in real-life scenarios.
Article
Full-text available
Facial expression recognition (FER) has been extensively studied in various applications over the past few years. However, in real facial expression datasets, labels can become noisy due to the ambiguity of expressions, the similarity between classes, and the subjectivity of annotators. These noisy labels negatively affect FER and significantly reduce classification performance. In previous methods, overfitting can occur as the noise ratio increases. To solve this problem, we propose the split and merge consistency regularization (SMEC) method that is robust to noisy labels by examining various image regions rather than just one part of facial expression images without negatively affecting the meaning. We split facial expression images into two images and input them into the backbone network to extract class activation maps (CAMs). This approach merges two CAMs and improves robustness to noisy labels by normalizing the consistency between the CAM of the original image and the merged CAM. The proposed SMEC method aims to improve FER performance and robustness against highly noisy labels by preventing the model from focusing on only a single part without losing the semantics of the facial expression images. The SMEC method demonstrates robust performance over state-of-the-art noisy label FER models on an unbalanced facial expression dataset called the real-world affective faces database (RAF-DB) regarding class-wise accuracy for clean and noisy labels, even at severe noise rates of 40% to 60%.
Article
Automatically recognising apparent emotions from face and voice is hard, in part because of various sources of uncertainty, including in the input data and the labels used in a machine learning framework. This paper introduces an uncertainty-aware multimodal fusion approach that quantifies modality-wise aleatoric or data uncertainty towards emotion prediction. We propose a novel fusion framework, in which latent distributions over unimodal temporal context are learned by constraining their variance. These variance constraints, Calibration and Ordinal Ranking, are designed such that the variance estimated for a modality can represent how informative the temporal context of that modality is w.r.t. emotion recognition. When well-calibrated, modality-wise uncertainty scores indicate how much their corresponding predictions are likely to differ from the ground truth labels. Well-ranked uncertainty scores allow the ordinal ranking of different frames across different modalities. To jointly impose both these constraints, we propose a softmax distributional matching loss. Our evaluation on AVEC 2019 CES, CMU-MOSEI, and IEMOCAP datasets shows that the proposed multimodal fusion method not only improves the generalisation performance of emotion recognition models and their predictive uncertainty estimates, but also makes the models robust to novel noise patterns encountered at test time.
Conference Paper
Full-text available
Research in automatic affect recognition has come a long way. This paper describes the fifth Emotion Recognition in the Wild (EmotiW) challenge 2017. EmotiW aims at providing a common benchmarking platform for researchers working on different aspects of affective computing. This year there are two sub-challenges: a) Audio-video emotion recognition and b) group-level emotion recognition. These challenges are based on the acted facial expressions in the wild and group affect databases, respectively. The particular focus of the challenge is to evaluate method in `in the wild' settings. `In the wild' here is used to describe the various environments represented in the images and videos, which represent real-world (not lab like) scenarios. The baseline, data, protocol of the two challenges and the challenge participation are discussed in detail in this paper.
Conference Paper
Full-text available
This paper presents our approach for group-level emotion recognition in the Emotion Recognition in the Wild Challenge 2017. The task is to classify an image into one of the group emotion such as positive, neutral or negative. Our approach is based on two types of Convolutional Neural Networks (CNNs), namely individual facial emotion CNNs and global image based CNNs. For the individual facial emotion CNNs, we first extract all the faces in an image, and assign the image label to all faces for training. In particular, we utilize a large-margin softmax loss for discriminative learning and we train two CNNs on both aligned and non-aligned faces. For the global image based CNNs, we compare several recent state-of-the-art network structures and data augmentation strategies to boost performance. For a test image, we average the scores from all faces and the image to predict the final group emotion category. We win the challenge with accuracies 83.9% and 80.9% on the validation set and testing set respectively, which improve the baseline results by about 30%.
Article
Full-text available
Recurrent neural networks (RNNs) have been successfully applied to various natural language processing (NLP) tasks and achieved better results than conventional methods. However, the lack of understanding of the mechanisms behind their effectiveness limits further improvements on their architectures. In this paper, we present a visual analytics method for understanding and comparing RNN models for NLP tasks. We propose a technique to explain the function of individual hidden state units based on their expected response to input texts. We then co-cluster hidden state units and words based on the expected response and visualize co-clustering results as memory chips and word clouds to provide more structured knowledge on RNNs' hidden states. We also propose a glyph-based sequence visualization based on aggregate information to analyze the behavior of an RNN's hidden state at the sentence-level. The usability and effectiveness of our method are demonstrated through case studies and reviews from domain experts.
Conference Paper
Full-text available
Automatic emotion recognition is a challenging task which can make great impact on improving natural human computer interactions. In this paper, we present our effort for the Affect Subtask in the Audio/Visual Emotion Challenge (AVEC) 2017, which requires participants to perform continuous emotion prediction on three affective dimensions: Arousal, Valence and Likability based on the audiovisual signals. We highlight three aspects of our solutions: 1) we explore and fuse different hand-crafted and deep learned features from all available modalities including acoustic, visual, and textual modalities, and we further consider the interlocutor influence for the acoustic features; 2) we compare the effectiveness of non-temporal model SVR and temporal model LSTM-RNN and show that the LSTM-RNN can not only alleviate the feature engineering efforts such as construction of contextual features and feature delay, but also improve the recognition performance significantly; 3) we apply multi-task learning strategy for collaborative prediction of multiple emotion dimensions with shared representations according to the fact that different emotion dimensions are correlated with each other. Our solutions achieve the CCC of 0.675, 0.756 and 0.509 on arousal, valence, and likability respectively on the challenge testing set, which outperforms the baseline system with corresponding CCC of 0.375, 0.466, and 0.246 on arousal, valence, and likability.
Article
Full-text available
Dimensional affect recognition is a challenging topic and current techniques do not yet provide the accuracy necessary for HCI applications. In this work we propose two new methods. The first is a novel self-organizing model that learns from similarity between features and affects. This method produces a graphical representation of the multidimensional data which may assist the expert analysis. The second method uses extreme learning machines, an emerging artificial neural network model. Aiming for minimum intrusiveness, we use only the heart rate variability, which can be recorded using a small set of sensors. The methods were validated with two datasets. The first is composed of 16 sessions with different participants and was used to evaluate the models in a classification task. The second one was the publicly available Remote Collaborative and Affective Interaction (RECOLA) dataset, which was used for dimensional affect estimation. The performance evaluation used the kappa score, unweighted average recall and the concordance correlation coefficient. The concordance coefficient on the RECOLA test partition was 0.421 in arousal and 0.321 in valence. Results shows that our models outperform state-of-the-art models on the same data and provides new ways to analyze affective states.
Conference Paper
Full-text available
In this paper, we propose a multimodal deep learning architecture for emotion recognition in video regarding our participation to the audio-video based sub-challenge of the Emotion Recognition in the Wild 2017 challenge. Our model combines cues from multiple video modalities, including static facial features, motion patterns related to the evolution of the human expression over time, and audio information. Specifically, it is composed of three sub-networks trained separately: the first and second ones extract static visual features and dynamic patterns through 2D and 3D Convolutional Neural Networks (CNN), while the third one consists in a pretrained audio network which is used to extract useful deep acoustic signals from video. In the audio branch, we also apply Long Short Term Memory (LSTM) networks in order to capture the temporal evolution of the audio features. To identify and exploit possible relationships among different modalities, we propose a fusion network that merges cues from the different modalities in one representation. The proposed architecture outperforms the challenge baselines (38.81% and 40.47%): we achieve an accuracy of 50.39% and 49.92% respectively on the validation and the testing data.
Article
Full-text available
Recent advancements in human–computer interaction research has led to the possibility of emotional communication via brain–computer interface systems for patients with neuropsychiatric disorders or disabilities. In this study, we efficiently recognize emotional states by analyzing the features of electroencephalography (EEG) signals, which are generated from EEG sensors that non-invasively measure the electrical activity of neurons inside the human brain, and select the optimal combination of these features for recognition. In this study, the scalp EEG data of 21 healthy subjects (12–14 years old) were recorded using a 14-channel EEG machine while the subjects watched images with four types of emotional stimuli (happy, calm, sad, or scared). After preprocessing, the Hjorth parameters (activity, mobility, and complexity) were used to measure the signal activity of the time series data. We selected the optimal EEG features using a balanced one-way ANOVA after calculating the Hjorth parameters for different frequency ranges. Features selected by this statistical method outperformed univariate and multivariate features. The optimal features were further processed for emotion classification using support vector machine (SVM), k-nearest neighbor (KNN), linear discriminant analysis (LDA), Naive Bayes, Random Forest, deep-learning, and four ensembles methods (bagging, boosting, stacking, and voting). The results show that the proposed method substantially improves the emotion recognition rate with respect to the commonly used spectral power band method.
Article
Full-text available
Many paralinguistic tasks are closely related and thus representations learned in one domain can be leveraged for another. In this paper, we investigate how knowledge can be transferred between three paralinguistic tasks: speaker, emotion, and gender recognition. Further, we extend this problem to cross-dataset tasks, asking how knowledge captured in one emotion dataset can be transferred to another. We focus on progressive neural networks and compare these networks to the conventional deep learning method of pre-training and fine-tuning. Progressive neural networks provide a way to transfer knowledge and avoid the forgetting effect present when pre-training neural networks on different tasks. Our experiments demonstrate that: (1) emotion recognition can benefit from using representations originally learned for different paralinguistic tasks and (2) transfer learning can effectively leverage additional datasets to improve the performance of emotion recognition systems.
Article
Full-text available
Facial expressions play a significant role in human communication and behavior. Psychologists have long studied the relationship between facial expressions and emotions. Paul Ekman et al., devised the Facial Action Coding System (FACS) to taxonomize human facial expressions and model their behavior. The ability to recognize facial expressions automatically, enables novel applications in fields like human-computer interaction, social gaming, and psychological research. There has been a tremendously active research in this field, with several recent papers utilizing convolutional neural networks (CNN) for feature extraction and inference. In this paper, we employ CNN understanding methods to study the relation between the features these computational networks are using, the FACS and Action Units (AU). We verify our findings on the Extended Cohn-Kanade (CK+), NovaEmotions and FER2013 datasets. We apply these models to various tasks and tests using transfer learning, including cross-dataset validation and cross-task performance. Finally, we exploit the nature of the FER based CNN models for the detection of micro-expressions and achieve state-of-the-art accuracy using a simple long-short-term-memory (LSTM) recurrent neural network (RNN).
Article
Full-text available
Automatic affect recognition is a challenging task due to the various modalities emotions can be expressed with. Applications can be found in many domains including multimedia retrieval and human computer interaction. In recent years, deep neural networks have been used with great success in determining emotional states. Inspired by this success, we propose an emotion recognition system using auditory and visual modalities. To capture the emotional content for various styles of speaking, robust features need to be extracted. To this purpose, we utilize a Convolutional Neural Network (CNN) to extract features from the speech, while for the visual modality a deep residual network (ResNet) of 50 layers. In addition to the importance of feature extraction, a machine learning algorithm needs also to be insensitive to outliers while being able to model the context. To tackle this problem, Long Short-Term Memory (LSTM) networks are utilized. The system is then trained in an end-to-end fashion where - by also taking advantage of the correlations of the each of the streams - we manage to significantly outperform the traditional approaches based on auditory and visual handcrafted features for the prediction of spontaneous and natural emotions on the RECOLA database of the AVEC 2016 research challenge on emotion recognition.
Article
Full-text available
Facial expression recognition (FER) is increasingly gaining importance in various emerging affective computing applications. In practice, achieving accurate FER is challenging due to the large amount of inter-personal variations such as expression intensity variations. In this paper, we propose a new spatio-temporal feature representation learning for FER that is robust to expression intensity variations. The proposed method utilizes representative expression-states (e.g., onset, apex and offset of expressions) which can be specified in facial sequences regardless of the expression intensity. The characteristics of facial expressions are encoded in two parts in this paper. As the first part, spatial image characteristics of the representative expression-state frames are learned via a convolutional neural network. Five objective terms are proposed to improve the expression class separability of the spatial feature representation. In the second part, temporal characteristics of the spatial feature representation in the first part are learned with a long short-term memory of the facial expression. Comprehensive experiments have been conducted on a deliberate expression dataset (MMI) and a spontaneous micro-expression dataset (CASME II). Experimental results showed that the proposed method achieved higher recognition rates in both datasets compared to the state-of-the-art methods.
Conference Paper
Full-text available
Automatic emotion recognition from speech is a challenging task which relies heavily on the effectiveness of the speech features used for classification. In this work, we study the use of deep learning to automatically discover emotionally relevant features from speech. It is shown that using a deep recurrent neural network, we can learn both the short-time frame-level acoustic features that are emotionally relevant, as well as an appropriate temporal aggregation of those features into a compact utterance-level representation. Moreover, we propose a novel strategy for feature pooling over time which uses local attention in order to focus on specific regions of a speech signal that are more emotionally salient. The proposed solution is evaluated on the IEMOCAP corpus, and is shown to provide more accurate predictions compared to existing emotion recognition algorithms.
Article
Full-text available
Pain is an unpleasant feeling that has been shown to be an important factor for the recovery of patients. Since this is costly in human resources and difficult to do objectively, there is the need for automatic systems to measure it. In this paper, contrary to current state-of-the-art techniques in pain assessment, which are based on facial features only, we suggest that the performance can be enhanced by feeding the raw frames to deep learning models, outperforming the latest state-of-the-art results while also directly facing the problem of imbalanced data. As a baseline, our approach first uses convolutional neural networks (CNNs) to learn facial features from VGG_Faces, which are then linked to a long short-term memory to exploit the temporal relation between video frames. We further compare the performances of using the so popular schema based on the canonically normalized appearance versus taking into account the whole image. As a result, we outperform current state-of-the-art area under the curve performance in the UNBC-McMaster Shoulder Pain Expression Archive Database. In addition, to evaluate the generalization properties of our proposed methodology on facial motion recognition, we also report competitive results in the Cohn Kanade+ facial expression database.
Book
According to Rosalind Picard, if we want computers to be genuinely intelligent and to interact naturally with us, we must give computers the ability to recognize, understand, even to have and express emotions. The latest scientific findings indicate that emotions play an essential role in decision making, perception, learning, and more—that is, they influence the very mechanisms of rational thinking. Not only too much, but too little emotion can impair decision making. According to Rosalind Picard, if we want computers to be genuinely intelligent and to interact naturally with us, we must give computers the ability to recognize, understand, even to have and express emotions. Part 1 of this book provides the intellectual framework for affective computing. It includes background on human emotions, requirements for emotionally intelligent computers, applications of affective computing, and moral and social questions raised by the technology. Part 2 discusses the design and construction of affective computers. Although this material is more technical than that in Part 1, the author has kept it less technical than typical scientific publications in order to make it accessible to newcomers. Topics in Part 2 include signal-based representations of emotions, human affect recognition as a pattern recognition and learning problem, recent and ongoing efforts to build models of emotion for synthesizing emotions in computers, and the new application area of affective wearable computers.
Chapter
State-of-the-art visual perception models for a wide range of tasks rely on supervised pretraining. ImageNet classification is the de facto pretraining task for these models. Yet, ImageNet is now nearly ten years old and is by modern standards “small”. Even so, relatively little is known about the behavior of pretraining with datasets that are multiple orders of magnitude larger. The reasons are obvious: such datasets are difficult to collect and annotate. In this paper, we present a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images. Our experiments demonstrate that training for large-scale hashtag prediction leads to excellent results. We show improvements on several image classification and object detection tasks, and report the highest ImageNet-1k single-crop, top-1 accuracy to date: 85.4% (97.6% top-5). We also perform extensive experiments that provide novel empirical data on the relationship between large-scale pretraining and transfer learning performance.
Conference Paper
State-of-the-art approaches for the previous emotion recognition in the wild challenges are usually built on prevailing Convolutional Neural Networks (CNNs). Although there is clear evidence that CNNs with increased depth or width can usually bring improved predication accuracy, existing top approaches provide supervision only at the output feature layer, resulting in the insufficient training of deep CNN models. In this paper, we present a new learning method named Supervised Scoring Ensemble (SSE) for advancing this challenge with deep CNNs. We first extend the idea of recent deep supervision to deal with emotion recognition problem. Benefiting from adding supervision not only to deep layers but also to intermediate layers and shallow layers, the training of deep CNNs can be well eased. Second, we present a new fusion structure in which class-wise scoring activations at diverse complementary feature layers are concatenated and further used as the inputs for second-level supervision, acting as a deep feature ensemble within a single CNN architecture. We show our proposed learning method brings large accuracy gains over diverse backbone networks consistently. On this year's audio-video based emotion recognition task, the average recognition rate of our best submission is 60.34%, forming a new envelop over all existing records.
Article
Regularization is one of the crucial ingredients of deep learning, yet the term regularization has various definitions, and regularization methods are often studied separately from each other. In our work we present a systematic, unifying taxonomy to categorize existing methods. We distinguish methods that affect data, network architectures, error terms, regularization terms, and optimization procedures. We do not provide all details about the listed methods; instead, we present an overview of how the methods can be sorted into meaningful categories and sub-categories. This helps revealing links and fundamental similarities between them. Finally, we include practical recommendations both for users and for developers of new regularization methods.
Article
While logistic sigmoid neurons are more biologically plausable that hyperbolic tangent neurons, the latter work better for training multi-layer neural networks. This paper shows that rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbolic tangent networks in spite of the hard non-linearity and non-differentiability at zero, creating sparse representations with true zeros, which seem remarkably suitable for naturally sparse data. Even though they can take advantage of semi-supervised setups with extra-unlabelled data, deep rectifier networks can reach their best performance without requiring any unsupervised pre-training on purely supervised tasks with large labelled data sets. Hence, these results can be seen as a new milestone in the attempts at understanding the difficulty in training deep but purely supervised nueral networks, and closing the performance gap between neural networks learnt with and without unsupervised pre-training
Technical Report
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates backslashemphdeep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
Automated affective computing in the wild setting is a challenging problem in computer vision. Existing annotated databases of facial expressions in the wild are small and mostly cover discrete emotions (aka the categorical model). There are very limited annotated facial databases for affective computing in the continuous dimensional model (e.g., valence and arousal). To meet this need, we collected, annotated, and prepared for public distribution a new database of facial emotions in the wild (called AffectNet). AffectNet contains more than 1,000,000 facial images from the Internet by querying three major search engines using 1250 emotion related keywords in six different languages. About half of the retrieved images were manually annotated for the presence of seven discrete facial expressions and the intensity of valence and arousal. AffectNet is by far the largest database of facial expression, valence, and arousal in the wild enabling research in automated facial expression recognition in two different emotion models. Two baseline deep neural networks are used to classify images in the categorical model and predict the intensity of valence and arousal. Various evaluation metrics show that our deep neural network baselines can perform better than conventional machine learning methods and off-the-shelf facial expression recognition systems.
Article
Emotion recognition is challenging due to the emotional gap between emotions and audio-visual features. Motivated by the powerful feature learning ability of deep neural networks, this paper proposes to bridge the emotional gap by using a hybrid deep model, which first produces audio-visual segment features with Convolutional Neural Networks (CNN) and 3DCNN, then fuses audio-visual segment features in a Deep Belief Networks (DBN). The proposed method is trained in two stages. First, CNN and 3D-CNN models pre-trained on corresponding large-scale image and video classification tasks, are fine-tuned on emotion recognition tasks to learn audio and visual segment features, respectively. Second, the outputs of CNN and 3DCNN models are combined into a fusion network built with a DBN model. The fusion network is trained to jointly learn a discriminative audio-visual segment feature representation. After average-pooling segment features learned by DBN to form a fixedlength global video feature, a linear Support Vector Machine (SVM) is used for video emotion classification. Experimental results on three public audio-visual emotional databases, including the acted RML database, the acted eNTERFACE05 database, and the spontaneous BAUM-1s database, demonstrate the promising performance of the proposed method. To the best of our knowledge, this is an early work fusing audio and visual cues with CNN, 3D-CNN and DBN for audio-visual emotion recognition.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
This paper presents a novel and efficient Deep Fusion Convolutional Neural Network (DF-CNN) for multi-modal 2D+3D Facial Expression Recognition (FER). DF-CNN comprises a feature extraction subnet, a feature fusion subnet and a softmax layer. In particular, each textured 3D face scan is represented as six types of 2D facial attribute maps (i.e., geometry map, three normal maps, curvature map, and texture map), all of which are jointly fed into DF-CNN for feature learning and fusion learning, resulting in a highly concentrated facial representation (32- dimensional). Expression prediction is performed by two ways: 1) learning linear SVM classifiers using the 32-dimensional fused deep features; 2) directly performing softmax prediction using the 6-dimensional expression probability vectors. Different from existing 3D FER methods, DF-CNN combines feature learning and fusion learning into a single end-to-end training framework. To demonstrate the effectiveness of DF-CNN, we conducted comprehensive experiments to compare the performance of DFCNN with handcrafted features, pre-trained deep features, finetuned deep features, and state-of-the-art methods on three 3D face datasets (i.e., BU-3DFE Subset I, BU-3DFE Subset II, and Bosphorus Subset). In all cases, DF-CNN consistently achieved the best results. To the best of our knowledge, this is the first work of introducing deep CNN to 3D FER and deep learning based feature-level fusion for multi-modal 2D+3D FER.
Article
We present a new action recognition deep neural network which adaptively learns the best action velocities in addition to the classification. While deep neural networks have reached maturity for image understanding tasks, we are still exploring network topologies and features to handle the richer environment of video clips. Here, we tackle the problem of multiple velocities in action recognition, and provide state-of-the-art results for facial expression recognition, on known and new collected datasets. We further provide the training steps for our semi-supervised network, suited to learn from huge unlabeled datasets with only a fraction of labeled examples.
Article
Emotion analysis is a crucial problem to endow artifact machines with real intelligence in many large potential applications. As external appearances of human emotions, electroencephalogram (EEG) signals and video face signals are widely used to track and analyze human's affective information. According to their common characteristics of spatial-temporal volumes, in this paper we propose a novel deep learning framework named spatial-temporal recurrent neural network (STRNN) to unify the learning of two different signal sources into a spatial-temporal dependency model. In STRNN, to capture those spatially cooccurrent variations of human emotions, a multi-directional recurrent neural network (RNN) layer is employed to capture longrange contextual cues by traversing the