Article

Deep Learning for Human Affect Recognition: Insights and New Developments

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Automatic human affect recognition is a key step towards more natural human-computer interaction. Recent trends include recognition in the wild using a fusion of audiovisual and physiological sensors, a challenging setting for conventional machine learning algorithms. Since 2010, novel deep learning algorithms have been applied increasingly in this field. In this paper, we review the literature on human affect recognition between 2010 and 2017, with a special focus on approaches using deep neural networks. By classifying a total of 950 studies according to their usage of shallow or deep architectures, we are able to show a trend towards deep learning. Reviewing a subset of 233 studies that employ deep neural networks, we comprehensively quantify their applications in this field. We find that deep learning is used for learning of (i) spatial feature representations, (ii) temporal feature representations, and (iii) joint feature representations for multimodal sensor data. Exemplary state-of-the-art architectures illustrate the progress. Our findings show the role deep architectures will play in human affect recognition, and can serve as a reference point for researchers working on related applications.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Refer to [Rouast et al., 2019] [Chang et al., 2006], Rothkrantz, 2004, Pantic andBartlett, 2007] and [Kotsia and Pitas, 2006]. 21 ...
... Although Deep Neural Networks (DNNs) have been extensively applied to audiovisual emotion recognition [Rouast et al., 2019, Noroozi et al., 2017, Schoneveld et al., 2021, Gerczuk et al., 2021, estimating modality-wise uncertainty for improved fusion performance is a relatively under-explored avenue. However, modelling predictive uncertainty (or confidence, its opposite) in DNNs received widespread attention in recent years [Guo et al., 2017, Mukhoti et al., 2020, motivated by the observation that DNNs tend to make over-confident predictions , Szegedy et al., 2013b. ...
... The reader is referred to [Poria et al., 2017] and [Rouast et al., 2019] for comprehensive surveys of affect recognition in multimodal settings and contemporary deep learning-specific advancements in it. Since the main focus in this chapter is on uncertainty-aware fusion models for emotion recognition, it reviews the literature closely related to the following key research topics: i) uncertainty modelling for emotion and expression recognition, ii) uncertainty-aware multimodal fusion, iii) calibrated uncertainty, and iv) ...
Article
The ability to recognise emotional expressions from non-verbal behaviour plays a key role in human-human interaction. Endowing machines with the same ability is critical to enriching human-computer interaction. Despite receiving widespread attention so far, human-level automatic recognition of affective expressions is still an elusive task for machines. Towards improving the current state of machine learning methods applied to affect recognition, this thesis identifies two challenges: label ambiguity and label scarcity. Firstly, this thesis notes that it is difficult to establish a clear one-to-one mapping between inputs (face images or speech segments) and their target emotion labels, considering that emotion perception is inherently subjective. As a result, the problem of label ambiguity naturally arises in the manual annotations of affect. Ignoring this fundamental problem, most existing affect recognition methods implicitly assume a one-to-one input-target mapping and use deterministic function learning. In contrast, this thesis proposes to learn non-deterministic functions based on uncertainty-aware probabilistic models, as they can naturally accommodate the one-to-many input-target mapping. Besides improving the affect recognition performance, the proposed uncertainty-aware models in this thesis demonstrate three important applications: adaptive multimodal affect fusion, human-in-the-loop learning of affect, and improved performance on downstream behavioural analysis tasks like personality traits estimation. Secondly, this thesis aims to address the challenge of scarcity of affect labelled datasets, caused by the cumbersome and time-consuming nature of the affect annotation process. To this end, this thesis notes that audio and visual feature encoders used in the existing models are label-inefficient i.e. learning them requires large amounts of labelled training data. As a solution, this thesis proposes to pre-train the feature encoders using unlabelled data to make them more label-efficient i.e. using as few labelled training examples as possible to achieve good emotion recognition performance. A novel self-supervised pre-training method is proposed in this thesis by posing hand-engineered emotion features as task-specific representation learning priors. By leveraging large amounts of unlabelled audiovisual data, the proposed self-supervised pre-training method demonstrates much better label efficiency compared to the commonly employed pre-training methods.
... Affective Computing is a subfield of Artificial Intelligence whose main research includes the detection and interpretation of human affect, on the one hand, [12], [13], and the simulation and representation of affect on the other hand in both the human users and the social interactive agents [14], see also [3] and [15] for overviews. In this paper, we focus on the use of LLMs for Affective Computing by testing their zero-shot capabilities in automatic affect detection, emotion representation, and the computational elicitation of emotions. ...
... 1) Automatic affect detection: Automatic detection and interpretation of behavioral signals of affect involves a broad range of tasks including affect recognition from the face, body, speech, EEG and other physiological signals, and text [12], [13], [16], [17]. As of writing this paper, text is the only modality available in the interaction with ChatGPT. ...
Preprint
Full-text available
Large language models, in particular generative pre-trained transformers (GPTs), show impressive results on a wide variety of language-related tasks. In this paper, we explore ChatGPT's zero-shot ability to perform affective computing tasks using prompting alone. We show that ChatGPT a) performs meaningful sentiment analysis in the Valence, Arousal and Dominance dimensions, b) has meaningful emotion representations in terms of emotion categories and these affective dimensions, and c) can perform basic appraisal-based emotion elicitation of situations based on a prompt-based computational implementation of the OCC appraisal model. These findings are highly relevant: First, they show that the ability to solve complex affect processing tasks emerges from language-based token prediction trained on extensive data sets. Second, they show the potential of large language models for simulating, processing and analyzing human emotions, which has important implications for various applications such as sentiment analysis, socially interactive agents, and social robotics.
... It focuses on creating algorithmic technologies capable of sensing, interpreting, and reacting to human emotions. Pioneered by MIT's Rosalind Picard (1995), AEI has expanded to include methods like facial expression recognition (Rouast et al. 2019), body motion analysis (Noroozi et al. 2018), natural language analysis (Yadollahi et al. 2017), and electroencephalography (Bhatti et al. 2016). ...
Conference Paper
Full-text available
As global loneliness intensifies alongside rapid AI advancements, artificial emotional intelligence (AEI) presents itself as a paradoxical solution. This study examines the rising trend of AEI personification-the ascription of inherently human attributes, like empathy, consciousness, and morality, to AEI agents such as companion chatbots and sex robots. Drawing from Leavitt's socio-technical systems framework and a critical literature review, we recast "artificial empathy" as emerging from the intricate relationship between people, technology, tasks, and structures, rather than a quality of AEI itself. Our research uncovers a (de)humanisation paradox: by humanising AI agents, we may inadvertently dehumanise ourselves, leading to an ontological blurring in human-AI interactions. This paradox reshapes conventional understanding of human essence in the digital era, sparking discussions about ethical issues tied to personhood, consent, and objectification, and unveiling new avenues for exploring the legal, socioeconomic , and ontological facets of human-AI relations.
... ResNet networks have been developed based on the concept of residual learning [35][36] [37]. This technique is one of the popular techniques in the deep learning model developed by He et al. in 2016 [32]. ...
... Deep learning algorithms can accurately detect and track objects in video streams, even in complex and cluttered environments, and can detect and classify different types of objects, such as people, vehicles, and animals. Deep learning can be performed by various types of neural networks such as CNNs [39][40][41][42], DNNs [43,44], and RNNs [45,46], to recognize objects, track movements, and analyze behavior in video data [47]. To train on video data in a supervised learning task, it is necessary to pre-process and annotate the data. ...
Article
Full-text available
Smart cities are being developed worldwide with the use of technology to improve the quality of life of citizens and enhance their safety. Video surveillance is a key component of smart city infrastructure, as it involves the installation of cameras at strategic locations throughout the city for monitoring public spaces and providing real-time surveillance footage to law enforcement and other city representatives. Video surveillance systems have evolved rapidly in recent years, and are now integrated with advanced technologies like deep learning, blockchain, edge computing, and cloud computing. This study provides a comprehensive overview of video surveillance systems in smart cities, as well as the functions and challenges of those systems. The aim of this paper is to highlight the importance of video surveillance systems in smart cities and to provide insights into how they could be used to enhance safety, security, and the overall quality of life for citizens.
... The primary reasons for increase in the interest are availability of big data and progress in artificial intelligence. Specifically, the breakthroughs in deep learning have got remarkable results in social signal processing problems [24][25][26][27], together with learning analytic in classroom and other environments [28,29]. The literature can be categorized based on the following criteria: ...
Article
Full-text available
Student engagement is positively related to comprehension in teaching–learning process. Student engagement is widely studied in online learning environments, whereas this research focuses on student engagement recognition in classroom environments using visual cues. To incorporate learning-centered affective states, we curated a dataset with six learning-centered affective states from four public datasets. A graph convolution network (GCN)-based deep learning model with attention was designed and implemented to extract more contributing features from input video for student engagement recognition. The proposed architecture was evaluated on curated as well as four public datasets. An ablation study was conducted on a curated dataset, the best performing model with minority oversampling and focal cross-entropy loss achieved 65.35% accuracy. We also estimated the student engagement in authentic classroom data, and it showed a positive correlation between students’ engagement levels and post-lesson test scores with a Pearson’s coefficient value of 0.64. The proposed method outperformed the existing state-of-the-art methods on two of the public datasets with accuracy scores of 99.20% and 56.17%, and it achieved accuracy scores of 64.92% and 56.17% on other two public datasets which are better than many baseline results on them.
... Wang et al. [17] embed a varied number of region features produced by CNN into a compact fixed-length representation. Rouast et al. [18] found that deep neural networks are used for multiple modal data feature learning, including spatial feature, temporal feature, and joint feature representations. However, the above approach does not consider more complex emotion recognition tasks, such as compound and ambiguous emotion. ...
Article
Full-text available
Human emotion label prediction is crucial to Artificial Intelligent in the Internet of Things (IoT). Facial expression recognition is the main technique to predict human emotion labels. Existing facial expression recognition methods do not consider the compound emotion and the fuzziness of emotion labels. Fuzzy learning is a mathematical tool for dealing with fuzziness and uncertainty information. The advantage of using fuzzy learning for human emotion recognition is that multiple fuzzy sentiment labels can be processed simultaneously. This paper proposes a fuzzy learning-based expression recognition method for human emotion label prediction. First, a fuzzy label distribution system is constructed using fuzzy sets for representing facial expressions. Then, two fuzzy label distribution prediction methods based on fuzzy rough sets are proposed to solve the compound emotion prediction. The probability that a sample is likely and definitely belongs to an emotion is obtained by calculating the upper and lower approximations. Experiments show the proposed algorithm not only performs well on human emotion label prediction but can also be used for other label distribution prediction tasks. The proposed method is more accurate and more general than other methods. The improvement of the method on the effect of emotion recognition extends the application scope of artificial intelligence in IoT.
... The intelligence of a user assistance system describes the level of context-specific task fulfilment support provided by the system. The advances in AI support accelerate the development of more intelligent user assistance systems incorporating even emotional intelligence in recognising human affect and adapting corresponding responses via the user interface (Rouast et al., 2021). Figure 24 shows a conceptualisation of a user assistance system. ...
Thesis
Full-text available
The requirements for the design of information and assistance systems in labour-intensive processes are interdisciplinary and have not yet been sufficiently addressed in research. This dissertation analyses, evaluates and describes possibilities for increasing the effectiveness and efficiency of labour-intensive processes through design-optimised socio-technical systems. The work thus contributes to further developing information and assistance systems for industrial applications and use in healthcare. The central dimensions of people, activity, context and technology are the focus of the scientific investigations following the Design Science Research paradigm. Design principles derived from this, a corresponding taxonomy, and a conceptual reference model for the design of socio-technical systems are the results of this dissertation.
... The use of physical and physiological sensors [1]- [5] and interaction log data from student-computer interactions [6]- [9] has been proven effective in detecting affective states in the classroom. However, the transferability of affect detection models across different student populations is not always successful [10], [11]. ...
Conference Paper
Full-text available
Classroom observation has been used to obtain training labels for affect detection, but is expensive for large representative samples. Active Learning (AL) methods have been proposed to address this challenge by identifying the specific samples that should be labeled to improve detector performance, based on a metric of informativeness. While previous work has investigated the potential benefits of AL methods in affect detection, they have considered scenarios that may not completely reflect reality, where an observer can code any student and time window within the entire data set. Unfortunately, actual use of such a method can only take place in the current time window-classroom observers cannot time travel. This paper explores the potential benefit of AL methods in a scenario that more closely mimics the human coder's observation process in a real classroom-where the coder can only observe behavior occurring at the current moment. Our experimental results show that AL methods slightly improve the performance indicators of binary detectors for concentration, confusion, and frustration compared to control sampling methods. However, there is no benefit for boredom detection. These findings have implications for the use of active learning-based data collection protocols for developing affect detectors.
... There are many researches in the area of Affective Computing (AC) and Human-Computer Interaction (HCI), applied to both Computer Vision [7], [8], [9], [10], [11], as in other areas of knowledge, such as in Education, in support of Intelligent Learning [12], in Health, with support for Neu-rophysiological Monitoring [13] and in Sentiment Analysis for Natural Language Processing (NLP) [14], whose purpose is to develop cognitive, intelligent, and reliable emotion detection systems to distinguish and understand people's affect and thus provide sensitive and ready responses to users for applicability in a given area. The development of this system, generally, is made from techniques and models of Artificial Intelligence whose applicability is employed in various services in society today, such as measurement of Facial Affect [15]; in the Legal area, from the analysis of Affective Linguistic skills of lawyers [16]; Educational [17] and in Health, with the analysis of Psychobehavioral Behaviors [18]. ...
Article
Full-text available
Understanding emotions is one of the greatest capabilities of human beings, as it allows the understanding of facial expressions that facilitate the capture of important information about other individuals, which are used for the perception of mental or emotional states. Advances in Artificial Intelligence and Visual Computing, more specifically in Deep Learning with the advent of Artificial Neural Networks, have enhanced the ability of machines to infer human emotions through image analysis. This paper presents a Systematic Literature Review (SLR) with the purpose of researching, mapping and summarizing studies that address techniques or algorithms more efficiently. The convolutional neural network models analyzed in this review are based on deep learning with an emphasis on expression and microexpression recognition. The results suggest that database uses, with laboratory controlled images, combined with CNN’s such as VGG and ResNet, have excellent performances in their tests. For better understanding, we will detail and compare all the methods obtained in the review.
... In addition to subject identity bias, variations in posture, illumination, and occlusions are also common in unconstrained facial expression scenes, which are nonlinearly confused with facial expressions, reinforcing the need for deep networks to address large intra-class variability and learn effective specific expression representations. A survey of the research on deep learning FER can be found in [21,29,30]. Depending on different network structures, the end-to-end deep learningbased methods can be further divided into convolutional neural networks [7,[12][13][14], deep belief networks [31], deep autoencoders [32], recurrent neural networks [33], and generative adversarial networks [34]. ...
Article
Full-text available
To extract facial features with different receptive fields and improve the decision fusion performance of network ensemble, a symmetric multi-scale residual network (SMResNet) ensemble with a weighted evidence fusion (WEF) strategy for facial expression recognition (FER) was proposed. Firstly, aiming at the defect of connecting different filter groups of Res2Net only from one direction in a hierarchical residual-like style, a symmetric multi-scale residual (SMR) block, which can symmetrically extract the features from two directions, was improved. Secondly, to highlight the role of different facial regions, a network ensemble was constructed based on three networks of SMResNet to extract the decision-level semantic of the whole face, eyes, and mouth regions, respectively. Meanwhile, the decision-level semantics of three regions were regarded as different pieces of evidence for decision-level fusion based on the Dempster-Shafer (D-S) evidence theory. Finally, to fuse the different regional expression evidence of the network ensemble, which has ambiguity and uncertainty, a WEF strategy was introduced to overcome conflicts within evidence based on the support degree adjustment. The experimental results showed that the facial expression recognition rates achieved 88.73%, 88.46%, and 88.52% on FERPlus, RAF-DB, and CAER-S datasets, respectively. Compared with other state-of-the-art methods on three datasets, the proposed network ensemble, which not only focuses the decision-level semantics of key regions, but also addresses to the whole face for the absence of regional semantics under occlusion and posture variations, improved the performance of facial expression recognition in the wild.
... Computational studies, such as the Fourier-Bessel model [16], machine learning (ML) methodologies [17], and deep learning (DL) [18] approaches are much needed in the WKFinduced device variability. Due to their reliability and scalability while upholding the same efficiency, ML/DL approaches are playing crucial roles in applied science and technology, such as image classification [19], computer vision [20], machine translation [21], human-computer interaction [22], natural language processing [23], and many more. Motivated by this phenomenon, the semiconductor industry is also acquainting DL in modeling and simulation of device and circuit. ...
Article
Full-text available
Presently deep learning (DL) techniques are massively used in the semiconductor industry. At the same time, applying a deep learning approach for small datasets is also an immense challenge as larger dataset generation needs more computational time-cost factors for technology computer-aided design (TCAD) simulation. In this paper, to overcome the aforesaid issue, a hybrid DL-aided prediction of electrical characteristics of the multichannel devices induced by work function fluctuation (WKF) with a smaller dataset is proposed. For the first time, an amalgamation approach of two deep learning algorithms (i.e.1D-CNN and LSTM) is implemented for all four channels (1 to 4 channels) of gate-all-around (GAA) silicon Nanosheet and Nanofin MOSFETs (NS-FETs and NF-FETs). The proposed joint learning framework combines a one-dimensional convolutional neural network (1D-CNN) with long short-term memory (LSTM) model. In this architecture, CNN can extract the features efficiently from the input WKF, and LSTM identifies the historical sequence of the captured features of the input regression data. To illustrate the excellency of the proposed approach, a comparative study of our hybrid model along with three individual DL models i.e. 1D-CNN and LSTM including a baseline multilevel perceptron (MLP) model are demonstrated for a promising small dataset (i.e.1100 samples). The results indicate a superior prediction of 1D-CNN-LSTM in terms of root mean square error (RMSE) and R <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> Score within the shortest time span in contrast to the other three algorithms. Finally, it can be quantified according to the evaluation and performance that hybrid methodology not only adopts the complexity of both NS- and NF-FETs, but also estimates the characteristics of all four channels of it efficiently with a smaller dataset, lesser time span, and reduced computational cost.
... These signals are usually referred to as implicit and are derived from a form of communication defined as nonverbal communication. Research shows that a significant amount of affective information is not communicated verbally but is sent and received naturally through this communication channel, which encompasses facial expressions, body language, speech tonality, and other emotional cues [2,3]. This ability allows for an unobtrusive experience and a perception of emotion that is not posed. ...
Preprint
Full-text available
Emotion recognition is the task of classifying perceived emotions in people. Previous works have utilized various nonverbal cues to extract features from images and correlate them to emotions. Of these cues, situational context is particularly crucial in emotion perception since it can directly influence the emotion of a person. In this paper, we propose an approach for high-level context representation extraction from images. The model relies on a single cue and a single encoding stream to correlate this representation with emotions. Our model competes with the state-of-the-art, achieving an mAP of 0.3002 on the EMOTIC dataset while also being capable of execution on consumer-grade hardware at approximately 90 frames per second. Overall, our approach is more efficient than previous models and can be easily deployed to address real-world problems related to emotion recognition.
... By monitoring humans, machine intelligence systems can detect patterns in behavior and suggest changes that could improve interaction and quality of experience. One of the many sub-tasks involved in the recognition of behavior is called emotion recognition, which, from multiple definitions in both computer science and psychology research, can be defined as the process of identifying emotions based on nonverbal cues given by the person [1][2][3][4][5][6][7]. The nonverbal aspect is essential in many contexts where the user needs to interact with environments instead of systems, allowing for an unobtrusive experience and a perception of emotion that is not forced. ...
Article
Full-text available
Emotion recognition is the task of identifying and understanding human emotions from data. In the field of computer vision, there is a growing interest due to the wide range of possible applications in smart cities, health, marketing, and surveillance, among others. To date, several datasets have been proposed to allow techniques to be trained, validated, and finally deployed to production. However, these techniques have several limitations related to the construction of these datasets. In this work, we survey the datasets currently employed in state-of-the-art emotion recognition, to list and discuss their applicability and limitations in real-world scenarios. We propose experiments on the data to extract essential insights related to the provided visual information in each dataset and discuss how they impact the training and validation of techniques. We also investigate the presence of nonverbal cues in the datasets and propose experiments regarding their representativeness, visibility, and data quality. Among other discussions, we show that EMOTIC has more diverse context representations than CAER, however, with conflicting annotations. Finally, we discuss application scenarios and how techniques to approach them could leverage these datasets, suggesting approaches based on findings from these datasets to help guide future research and deployment. With this work we expect to provide a roadmap for upcoming research and experimentation in emotion recognition under real-world conditions.
... Audiovisual data are a dynamic set of signals that change in both space and time (Rouast et al. 2019). Name three different ways that deep learning is often used to model these signals: Spatial feature representations: This is about learning features from single images, short sequences of images, or short chunks of sound. ...
Article
Full-text available
The acts we engage in that transmit our emotional state or attitude to other people are referred to as emotional expressions. Communication, both verbal and nonverbal, is how they manifest themselves in the world. One of the most difficult problems to solve in data science is the problem of voice emotion recognition, often known as categorisation. In this study, we used two independent datasets, referred to as RAVDESS and TESS, each including seven distinct feelings, including neutral, happy, sad, angry, afraid, disgusted, and startled. In the raw audio wave, noise, stretching, shifting, and pitching have been used to perform the preprocessing and data augmentation that have been conducted. The characteristics like MFCC, MFC, and Chroma are taken out of the image. CNN and RESNET are the names of the two models that have been put up as possibilities. In order to achieve a higher level of accuracy in our classifications, we make use of an incremental approach to adjust the pre-trained model. In contrast to some earlier methods, none of the presented models require the data to be converted into a visual representation in order to function. Instead, they all work directly with the raw sound data. According to the findings of our experiments, our best-performing model performs better than any of the existing frameworks for TESS, thereby establishing a new standard for the industry.
... In combination with multimodal approaches, this makes it possible to achieve the most accurate results in emotion recognition [2,5]. However, for deep learning methods to provide reliable results, it is necessary to train them on large amounts of data [49]. Therefore, numerous studies are conducted in the field of Affective Computing to provide datasets that contain signals from different channels. ...
Article
Full-text available
Most of the research in the field of emotion recognition is based on datasets that contain data obtained during affective computing experiments. However, each dataset is described by different metadata, stored in various structures and formats. This research can be counted among those whose aim is to provide a structural and semantic pattern for affective computing datasets, which is an important step to solve the problem of data reuse and integration in this domain. In our previous work, the ROAD ontology was introduced. This ontology was designed as a skeleton for expressing contextual data describing time series obtained in various ways from various signals and was focused on common contextual data, independent of specific signals. The aim of the presented research is to provide a carefully curated vocabulary for describing signals obtained from electrodermal activity, a very important subdomain of emotion analysis. We decided to present it as an extension to the ROAD ontology in order to offer means of sharing metadata for datasets in a unified and precise way. To meet this aim, the research methodology was defined, mostly focusing on requirements specification and integration with other existing ontologies. Application of this methodology resulted firstly in sharing the requirements to allow a broader discussion and secondly development of the EDA extension of the ROAD ontology, validated against the MAHNOB-HCI dataset. Both these results are very important with respect to the vast context of the work, i.e. providing an extendable framework for describing affective computing experiments. Introducing the methodology also opens the way for providing new extensions systematically just by executing the steps defined in the methodology.
... Various surveys for emotion recognition have been published in recent years [10] [11] [12] [13] [14] [15] [16] [17] [18]. Li et al. [11] investigate the state-of-the-art for both static and dynamic FERs, and detail the pipelines of FERs in terms of datasets, preprocessing, hand-crafted features, deep learning embedding and comparison of performances, etc. ...
Preprint
Full-text available
With the development of social media and human-computer interaction, it is essential to serve people by perceiving people's emotional state in videos. In recent years, a large number of studies tackle the issue of emotion recognition based on three most common modalities in videos, that is, face, speech and text. The focus of this paper is to sort out the relevant studies of emotion recognition using facial, speech and textual cues based on deep learning techniques due to the lack of review papers concentrating on the three modalities. In this paper, we firstly introduce widely accepted emotion models for the purpose of interpreting the definition of emotion. Then we introduce the state-of-the-art for emotion recognition based on unimodality including facial expression recognition, speech emotion recognition and textual emotion recognition. For multimodal emotion recognition, we summarize the feature-level and decision-level fusion methods in detail. In addition, the description of relevant benchmark datasets, the definition of metrics and the performance of the state-of-the-art in recent years are also outlined for the convenience of readers to find out the current research progress. Ultimately, we explore some potential research challenges and opportunities to give researchers reference for the enrichment of emotion recognition-related researches.
... Deep learning related methods have had great success in the field of pattern recognition. More and more researchers are using it in affective computing tasks [16]. For example, new deep learning models [17], and many innovative models have been generated in machine learning models. ...
Article
Full-text available
The recognition of human emotions is expected to completely change the mode of human-computer interaction. In emotion recognition research, we need to focus on accuracy and real-time performance in order to apply emotional recognition based on physiological signals to solve practical problems. Considering the timeliness dimension of emotion recognition, we propose a terminal-edge-cloud system architecture. Compared to traditional sentiment computing architectures, the proposed architecture in this paper reduces the average time consumption by 15% when running the same affective computing process. Proposed Joint Mutual Information (JMI) based feature extraction affective computing model, and we conducted extensive experiments on the AMIGOS dataset. Through experimental comparison, this feature extraction network has obvious advantages over the commonly used methods. The model performs sentiment classification, and the average accuracy of valence and arousal is 71% and 81.8%, compared with recent similar sentiment classifier research, the average accuracy is improved by 0.85%. In addition, we set up an experiment with 30 people in an online learning scenario to validate the computing system and algorithm model. The result proved that the accuracy and real-time recognition were satisfactory, and improved the online learning real-time emotional interaction experience.
... Ideally, with sufcient high-quality labeled samples and many iterations of training, the CNNs can learn to extract high-level representations that build from low-level features. Many studies have shown that, for the basic emotion recognition, the facial expression features learned by CNNs outperform predesigned features and have achieved state-of-the-art performances [22,23]. ...
Article
Full-text available
Current deep learning-based facial expression recognition mainly focused on the six basic human emotions and relied on large-scale and well-annotated data. For complex emotion recognition, such a large amount of data are not easy to obtain, and a high-quality annotation is even more difficult. Therefore, in this paper, we regard complex emotion recognition via facial expressions as a few-shot learning problem and introduce a metric-based few-shot model named self-cure relation networks (SCRNet), which is robust to label noises and is able to classify facial images of new classes of emotions by only few examples from each. Specifically, SCRNet learns a distance metric based on deep features abstracted by convolutional neural networks and predicts a query image’s emotion category by computing relation scores between the query image and the few examples of each new class. To tackle the label noise problem, SCRNet gives corrected labels to noise data via class prototype stored in external memory during the meta-training phase. Experimenting on public datasets as well as on synthetic noise datasets demonstrates the effectiveness of our method.
... Automatic emotion recognition, as the first step to enable machines to have emotional intelligence, has been an active research area for the past two decades. Video emotion recognition (VER) refers to predicting the emotional states of the target person by analyzing information from different cues such as facial actions, acoustic characteristics and spoken language (Rouast et al., 2019;Wang et al., 2022). At the heart of this task is how to effectively learn emotional salient representations from multiple modalities including audio, visual, and text. ...
Article
Full-text available
Video emotion recognition aims to infer human emotional states from the audio, visual, and text modalities. Previous approaches are centered around designing sophisticated fusion mechanisms, but usually ignore the fact that text contains global semantic information, while speech and face video show more fine-grained temporal dynamics of emotion. From the perspective of cognitive sciences, the process of emotion expression, either through facial expression or speech, is implicitly regulated by high-level semantics. Inspired by this fact, we propose a multimodal interaction enhanced representation learning framework for emotion recognition from face video, where a semantic enhancement module is first designed to guide the audio/visual encoder using the semantic information from text, then the multimodal bottleneck Transformer is adopted to further reinforce the audio and visual representations by modeling the cross-modal dynamic interactions between the two feature sequences. Experimental results on two benchmark emotion databases indicate the superiority of our proposed method. With the semantic enhanced audio and visual features, it outperforms the state-of-the-art models which fuse the features or decisions from the audio, visual and text modalities.
... Since 2010 researchers started to explore the application of deep learning architectures for affect recognition, by following traditional machine learning insights for affect recognition. It is important to notice that between 2010 and 2017, around 950 studies were published in this area [36] showing a significant interest by the community. In this work, we propose an easy to use deep learning approach for affect recognition. ...
Chapter
Full-text available
The access to information and distributed computing resources are becoming more and more open, but its accessibility has not advanced towards the solution of the reproducibility crisis for computational studies and their experimentation. Researchers over the years have investigated the factors that affect reproducibility in data science related studies. Some common findings point that non-reproducible studies lack information or access to the dataset in its original form and order, the software environment used, randomization control, and the actual implementation of the proposed techniques. In addition to that, some studies require a large number of computational resources that not everybody can afford. This work explores how to overcome some of the main challenges in reproducible research with a focus on multimodal video action recognition. We present MultiAffect, an inclusive reproducible research framework that standardizes feature extraction techniques, training and evaluation methods, and research document formatting for multimodal video action recognition tasks in an online environment. The proposed framework is designed to use a simple vanilla version of popular algorithms as a baseline, with the flexibility to plug-in state-of-the-art algorithms into the workflow with ease for further research. We tested the framework on two different video analysis approaches: video action recognition and affect recognition. MultiAffect was able to perform both tasks by only setting up the proper configuration. The results produced by MultiAffect were competitive in regard to published studies, and was deployed in Google Colaboratory (http://bit.ly/multiaffect), validating its inclusiveness as we are able to reproduce experiments with no client requirements (online), no-configuration, and free-of-charge. We aim that inclusive reproducible research frameworks for complex and highly demanding tasks can reduce the barrier to entry of video analysis and boost the progress in this area.
... Deep models based on the convolutional filters recently became an interesting tool to analyze the signal and image in different applications such as pattern recognition [42,43], image segmentation [44,45], and signal extraction [46,47]. In the proposed approach, a layer of convolutional filters is added before the compression. ...
Article
A chaotic map with two sine functions is studied here, and its chaotic dynamics are investigated. Bifurcation diagrams and Lyapunov exponents of the map by different initiation methods show the multistability in some intervals of the bifurcation parameter. Then an encryption method based on sparse representation and the chaotic map is proposed. In the approach, the wavelet domain is used to obtain sparse representation. The sparsity is enhanced by a sparsification operation. A convolutional layer is considered before the compression by the measurement matrices to extract the local information. Experimental results show that the method improves the image reconstruction performance and security level compared to previous methods.
... Machine learning techniques have been applied in various fields, including agriculture, transportation, business, and education. Machine learning has led to the development of affective computing methods that automatically recognize human emotions and behaviours (Schuller 2015;Kratzwald et al., 2018;Zhao et al., 2019;Rouast et al., 2021), supporting the advancement of artificial intelligence in education applications Ouyang and Jiao 2021). Therefore, in general, automatic engagement estimation methods are referred to as machine learning (ML)-based algorithms. ...
Article
Full-text available
Background Recognizing learners’ engagement during learning processes is important for providing personalized pedagogical support and preventing dropouts. As learning processes shift from traditional offline classrooms to distance learning, methods for automatically identifying engagement levels should be developed. Objective This article aims to present a literature review of recent developments in automatic engagement estimation, including engagement definitions, datasets, and machine learning-based methods for automation estimation. The information, figures, and tables presented in this review aim at providing new researchers with insight on automatic engagement estimation to enhance smart learning with automatic engagement recognition methods. Methods A literature search was carried out using Scopus, Mendeley references, the IEEE Xplore digital library, and ScienceDirect following the four phases of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA): identification, screening, eligibility, and inclusion. The selected studies included research articles published between 2010 and 2022 that focused on three research questions (RQs) related to the engagement definitions, datasets, and methods used in the literature. The article selection excluded books, magazines, news articles, and posters. Results Forty-seven articles were selected to address the RQs and discuss engagement definitions, datasets, and methods. First, we introduce a clear taxonomy that defines engagement according to different types and the components used to measure it. Guided by this taxonomy, we reviewed the engagement types defined in the selected articles, with emotional engagement (n = 40; 65.57%) measured by affective cues appearing most often (n = 38; 57.58%). Then, we reviewed engagement and engagement-related datasets in the literature, with most studies assessing engagement with external observations (n = 20; 43.48%) and self-reported measures (n = 9; 19.57%). Finally, we summarized machine learning (ML)-based methods, including deep learning, used in the literature. Conclusions This review examines engagement definitions, datasets and ML-based methods from forty-seven selected articles. A taxonomy and three tables are presented to address three RQs and provide researchers in this field with guidance on enhancing smart learning with automatic engagement recognition. However, several key challenges remain, including cognitive and personalized engagement and ML issues that may affect real-world implementations.
... Based on theoretical support, outstanding performance and quantity of existing work, and potential for future development, it is necessary to review the state of graphbased FAA methods. Although many reviews have discussed FFA's historical evolutions [7,34,44] and recent advances [45,46,47], including some specific problems like occluded expression [48], multi-modal affect [49] and microexpression [50], this is the FIRST systematic and in-depth survey for the graph-based FAA field as far as we know. We emphasize representative research proposed after 2010. ...
Article
Full-text available
As one of the most important affective signals, facial affect analysis (FAA) is essential for developing human-computer interaction systems. Early methods focus on extracting appearance and geometry features associated with human affects while ignoring the latent semantic information among individual facial changes, leading to limited performance and generalization. Recent work attempts to establish a graph-based representation to model these semantic relationships and develop frameworks to leverage them for various FAA tasks. This paper provides a comprehensive review of graph-based FAA, including the evolution of algorithms and their applications. First, the FAA background knowledge is introduced, especially on the role of the graph. We then discuss approaches widely used for graph-based affective representation in literature and show a trend towards graph construction. For the relational reasoning in graph-based FAA, existing studies are categorized according to their non-deep or deep learning methods, emphasizing the latest graph neural networks. Performance comparisons of the state-of-the-art graph-based FAA methods are also summarized. Finally, we discuss the challenges and potential directions. As far as we know, this is the first survey of graph-based FAA methods. Our findings can serve as a reference for future research in this field.
... Changing social design over time also implies the need to understand the psychological constructs in a more detailed way, which researchers can achieve by more consequently modeling user feelings. Especially, an increased use of user feedback and sensors, such as activity trackers and video cameras, may support an improved modeling of user feelings (Rouast et al., 2021). Using this additional knowledge would allow researchers to adapt a DHR according to users' preferences and perceptions and, thus, increase their interest and motivation to participate in an intervention. ...
Article
As our world becomes increasingly interconnected, technology continues to evolve at an unprecedented pace. One of the most intriguing advancements in recent years is the development of emotion recognition technologies [1] . These systems, powered by artificial intelligence algorithms for machine reasoning and learning, running on sophisticated hardware, can analyze human expressions, vocal tones, and body language to identify and interpret emotions [2] . While the potential applications of this technology are vast, ranging from mental health support [3] to enhancing customer experiences [4] , it also raises significant concerns about personal privacy and public safety. This article delves into the complex implications of emotion recognition technologies, exploring the delicate balance between the promise of improved public safety and the protection of individual privacy.
Article
In the process of day-to-day learning, expressions are very vital. Detection of human expressions is growing and has caught the interest of many researchers in the past few years, to facilitate research on Affective Computing. Classification of expressions is necessarily needed to keep focus on several expressions for developing an efficient electronic learning system. This classification can provide several solutions in the application of psychology, human computer interaction, artificial intelligence, education, gaming etc. The core attention of this review is to highlight the most recent algorithms used for the extraction of features from multi-modal datasets and various classifiers that are used in the discipline of Multi-modal expression detection (MED). In this study, we review the literature on MED between 2015 and 2022, focusing mainly on feature extraction and classifier schemes. Their advantages, disadvantages and the entire vision have been discussed in brief. This survey also shows that the efficiency of the MED environment is better as compared to single-modal ones.
Article
In this study, novel Spectro-Temporal Energy Ratio features based on the formants of vowels, linearly spaced low-frequency, and logarithmically spaced high-frequency parts of the human auditory system are introduced to implement single- and cross-corpus speech emotion recognition experiments. Since the underlying dynamics and characteristics of speech recognition and speech emotion recognition differ too much, designing an emotion-recognition-specific filter bank is mandatory. The proposed features will formulate a novel filter bank strategy to construct 7 trapezoidal filter banks. These novel filter banks differ from Mel and Bark scales in shape and frequency regions and are targeted to generalize the feature space. Cross-corpus experimentation is a step forward in speech emotion recognition, but the researchers are usually chagrined at its results. Our goal is to create a feature set that is robust and resistant to cross-corporal variations using various feature selection algorithms. We will prove this by shrinking the dimension of the feature space from 6984 down to 128 while boosting the accuracy using SVM, RBM, and sVGG (small-VGG) classifiers. Although RBMs are considered no longer fashionable, we will show that they can achieve outstanding jobs when tuned properly. This paper discloses a striking 90.65% accuracy rate harnessing STER features on EmoDB.
Article
Facial emotion recognition (FER) has become an important topic in the fields of computer vision and artificial intelligence due to its great academic and commercial potential. Although FER can be performed using multiple sensors, this review focuses on studies using facial images exclusively, since visual expressions areone of the main channels of information in human communication. Automatic emotion recognition based onfacial expressions is an interesting research area that has been applied and applied in various fields such as safety, health and human-computer interface. Researchersin this field are interested in developing techniquesto interpret, encode facial expressions and extract these features for better prediction by computer. With the remarkable success of deep learning, different types of architectures of this technique are exploited to achieve better performance. The purpose of this paperisto conduct a study ofrecent work on automatic facial emotionrecognition (FER) via deep learning. We highlight these contributions, the architectures and the databases used, and we show the progress achieved by comparing the proposed methods and the obtained results. The purpose of this paper is to serve and guide researchers by reviewing recent work and providing insights to improve the field.
Chapter
Accurate emotional assessments of persons can be useful in the healthcare system and human interaction. Understanding emotions pique people’s curiosity because of the possible uses, as comprehending how someone else feels allows us to communicate and transfer information more effectively. This work proposes transformers and MobileNet for emotion classification using video data. Validation of the RAVEDESS dataset on the transformers and MobileNet is the idea of the proposed work. When compared to the top CNN transfer learning model like MobileNet, the study demonstrates that SWIN and vision transformer outperform MobileNet using the RAVDESS dataset. We achieved an accuracy of 92% on vision transformer, 91% on SWIN transformer, and 86% on MobileNet when tested on the RAVDESS dataset. The proposed robust emotion recognition system also enables shorter training and testing times, which is ideal for various applications.KeywordsEmotion recognitionDeep learningTransformersTransfer learningMobileNetSWIN transformerVision transformer
Article
Emotions can be expressed through multiple complementary modalities. This study selected speech and facial expressions as modalities by which to recognize emotions. Current audiovisual emotion recognition models perform supervised learning using signal-level inputs. Such models are presumed to characterize the temporal relationships in signals. In this study, supervised learning was performed on segment-level signals, which are more granular than signal-level signals, to precisely train an emotion recognition model. Effectively fusing multimodal signals is challenging. In this study, sequential segments of audiovisual signals were obtained, and features were extracted and applied to estimate segment-level attention weights according to the emotional consistency of the two modalities using a neural tensor network. A proposed bimodal Transformer Encoder was trained using signal-level and segment-level emotion labels in which temporal context was incorporated into the signals to improve upon existing emotion recognition models. In bimodal emotion recognition, the experimental results demonstrated that the proposed method achieved 74.31% accuracy (3.05% higher than the method of fusing correlation features) on the audio-visual emotion dataset BAUM-1, which is based on fivefold cross-validation, and 76.81% accuracy (2.57% higher than the Multimodal Transformer Encoder) on the multimodal emotion data set CMU-MOSEI, which is composed of training, validation, and testing sets.
Article
Increasing interest and advancement of internet and communication technologies have made network security rise as a vibrant research domain. Network intrusion detection systems (NIDSs) have developed as indispensable defense mechanisms in cybersecurity that are employed in discovery and prevention of malicious network activities. In the recent years, researchers have proposed deep learning approaches in the development of NIDSs owing to their ability to extract better representations from large corpus of data. In the literature, convolutional neural network architecture is extensively used for spatial feature learning, while the long short term memory networks are employed to learn temporal features. In this paper, a novel hybrid method that learn the discriminative spatial and temporal features from the network flow is proposed for detecting network intrusions. A two dimensional convolution neural network is proposed to intelligently extract the spatial characteristics whereas a bi-directional long short term memory is used to extract temporal features of network traffic data samples consequently, forming a deep hybrid neural network architecture for identification and classification of network intrusion samples. Extensive experimental evaluations were performed on two well-known benchmarks datasets: CIC-IDS 2017 and the NSL-KDD datasets. The proposed network model demonstrated state-of-the-art performance with experimental results showing that the accuracy and precision scores of the intrusion detection model are significantly better than those of other existing models. These results depicts the applicability of the proposed model in the spatial-temporal feature learning in network intrusion detection systems.
Article
Full-text available
Multimodal machine learning (MML) is a tempting multidisciplinary research area where heterogeneous data from multiple modalities and machine learning (ML) are combined to solve critical problems. Usually, research works use data from a single modality, such as images, audio, text, and signals. However, real-world issues have become critical now, and handling them using multiple modalities of data instead of a single modality can significantly impact finding solutions. ML algorithms play an essential role by tuning parameters in developing MML models. This paper reviews recent advancements in the challenges of MML, namely: representation, translation, alignment, fusion and co-learning, and presents the gaps and challenges. A systematic literature review (SLR) applied to define the progress and trends on those challenges in the MML domain. In total, 1032 articles were examined in this review to extract features like source, domain, application, modality, etc. This research article will help researchers understand the constant state of MML and navigate the selection of future research directions.
Article
In recent years, emotion recognition based on electroencephalography (EEG) signals has attracted plenty of attention. Most of the existing works focused on normal or depressed people. Due to the lack of hearing ability, it is difficult for hearing-impaired people to express their emotions through language in their social activities. In this work, we collected the EEG signals of hearing-impaired subjects when they were watching six kinds of emotional video clips (happiness, inspiration, neutral, anger, fear, and sadness) for emotion recognition. The biharmonic spline interpolation method was utilized to convert the traditional frequency domain features, Differential Entropy (DE), Power Spectral Density (PSD), and Wavelet Entropy (WE) into the spatial domain. The patch embedding (PE) method was used to segment the feature map into the same patch to obtain the differences in the distribution of emotional information among brain regions. For feature classification, a compact residual network with Depthwise convolution (DC) and Pointwise convolution (PC) is proposed to separate spatial and channel mixing dimensions to better extract information between channels. Dependent subject experiments based on 70% training sets and 30% testing sets were performed. The results showed that the average classification accuracies by PE (DE), PE (PSD), and PE (WE) were 91.75%, 85.53%, and 75.68%, respectively which were improved by 11.77%, 23.54%, and 16.61% compared with DE, PSD, and WE. Moreover, the comparison experiments were carried out on the SEED and DEAP datasets with PE (DE), which achieved average accuracies of 90.04% (positive, neutral, and negative) and 88.75% (high valence and low valence). By exploring the emotional brain regions, we found that the frontal, parietal, and temporal lobes of hearing-impaired people were associated with emotional activity compared to normal people whose main emotional brain area was the frontal lobe.
Article
The widespread popularity of Machine Learning (ML) models in healthcare solutions has increased the demand for their interpretability and accountability. In this paper, we propose the Physiologically-Informed Gaussian Process ( PhGP ) classification model, an interpretable machine learning model founded on the Bayesian nature of Gaussian Processes (GPs). Specifically, we inject problem-specific domain knowledge of inherent physiological mechanisms underlying the psycho-physiological states as a prior distribution over the GP latent space. Thus, to estimate the hyper-parameters in PhGP , we rely on the information from raw physiological signals as well as the designed prior function encoding the physiologically-inspired modelling assumptions. Alongside this new model, we present novel interpretability metrics that highlight the most informative input regions that contribute to the GP prediction. We evaluate the ability of PhGP to provide an accurate and interpretable classification on three different datasets, including electrodermal activity (EDA) signals collected during emotional, painful, and stressful tasks. Our results demonstrate that, for all three tasks, recognition performance is improved by using the PhGP model compared to competitive methods. Moreover, PhGP is able to provide physiological sound interpretations over its predictions.
Article
Full-text available
As deep networks constantly deepen to extract high-level abstract features, the significance of shallow features for the target task will inevitably diminish. To address this issue and provide novel technical support for current research in the field of facial expression recognition (FER), in this article, we propose a network that can increase the decision weight of the shallow and middle feature mappings through the neighbor block (Nei Block) and concentrate on the crucial areas for extracting necessary features through the optimized attention module (OAM), called NA-Resnet. Our work has several merits. First, to the best of our knowledge, NA-Resnet is the first network that directly utilizes surface features to assist image classification. Second, the suggested OAM is embedded into each layer of the network that can precisely extract critical information appropriate to the current stage. Third, our model achieves the best exhibition when using a single relatively lightweight network without a network ensemble on Fer2013. Extensive experiments have been conducted, and the results show that our model achieves much higher state-of-the-art performance than any single network on Fer2013. In particular, our NA-Resnet achieves 74.59% on Fer2013 and an average accuracy of 96.06% with a standard deviation of 2.9% through 10-fold-cross-validation on Ck+.
Conference Paper
Full-text available
Research in automatic affect recognition has come a long way. This paper describes the fifth Emotion Recognition in the Wild (EmotiW) challenge 2017. EmotiW aims at providing a common benchmarking platform for researchers working on different aspects of affective computing. This year there are two sub-challenges: a) Audio-video emotion recognition and b) group-level emotion recognition. These challenges are based on the acted facial expressions in the wild and group affect databases, respectively. The particular focus of the challenge is to evaluate method in `in the wild' settings. `In the wild' here is used to describe the various environments represented in the images and videos, which represent real-world (not lab like) scenarios. The baseline, data, protocol of the two challenges and the challenge participation are discussed in detail in this paper.
Conference Paper
Full-text available
This paper presents our approach for group-level emotion recognition in the Emotion Recognition in the Wild Challenge 2017. The task is to classify an image into one of the group emotion such as positive, neutral or negative. Our approach is based on two types of Convolutional Neural Networks (CNNs), namely individual facial emotion CNNs and global image based CNNs. For the individual facial emotion CNNs, we first extract all the faces in an image, and assign the image label to all faces for training. In particular, we utilize a large-margin softmax loss for discriminative learning and we train two CNNs on both aligned and non-aligned faces. For the global image based CNNs, we compare several recent state-of-the-art network structures and data augmentation strategies to boost performance. For a test image, we average the scores from all faces and the image to predict the final group emotion category. We win the challenge with accuracies 83.9% and 80.9% on the validation set and testing set respectively, which improve the baseline results by about 30%.
Article
Full-text available
Recurrent neural networks (RNNs) have been successfully applied to various natural language processing (NLP) tasks and achieved better results than conventional methods. However, the lack of understanding of the mechanisms behind their effectiveness limits further improvements on their architectures. In this paper, we present a visual analytics method for understanding and comparing RNN models for NLP tasks. We propose a technique to explain the function of individual hidden state units based on their expected response to input texts. We then co-cluster hidden state units and words based on the expected response and visualize co-clustering results as memory chips and word clouds to provide more structured knowledge on RNNs' hidden states. We also propose a glyph-based sequence visualization based on aggregate information to analyze the behavior of an RNN's hidden state at the sentence-level. The usability and effectiveness of our method are demonstrated through case studies and reviews from domain experts.
Conference Paper
Full-text available
Automatic emotion recognition is a challenging task which can make great impact on improving natural human computer interactions. In this paper, we present our effort for the Affect Subtask in the Audio/Visual Emotion Challenge (AVEC) 2017, which requires participants to perform continuous emotion prediction on three affective dimensions: Arousal, Valence and Likability based on the audiovisual signals. We highlight three aspects of our solutions: 1) we explore and fuse different hand-crafted and deep learned features from all available modalities including acoustic, visual, and textual modalities, and we further consider the interlocutor influence for the acoustic features; 2) we compare the effectiveness of non-temporal model SVR and temporal model LSTM-RNN and show that the LSTM-RNN can not only alleviate the feature engineering efforts such as construction of contextual features and feature delay, but also improve the recognition performance significantly; 3) we apply multi-task learning strategy for collaborative prediction of multiple emotion dimensions with shared representations according to the fact that different emotion dimensions are correlated with each other. Our solutions achieve the CCC of 0.675, 0.756 and 0.509 on arousal, valence, and likability respectively on the challenge testing set, which outperforms the baseline system with corresponding CCC of 0.375, 0.466, and 0.246 on arousal, valence, and likability.
Article
Full-text available
Dimensional affect recognition is a challenging topic and current techniques do not yet provide the accuracy necessary for HCI applications. In this work we propose two new methods. The first is a novel self-organizing model that learns from similarity between features and affects. This method produces a graphical representation of the multidimensional data which may assist the expert analysis. The second method uses extreme learning machines, an emerging artificial neural network model. Aiming for minimum intrusiveness, we use only the heart rate variability, which can be recorded using a small set of sensors. The methods were validated with two datasets. The first is composed of 16 sessions with different participants and was used to evaluate the models in a classification task. The second one was the publicly available Remote Collaborative and Affective Interaction (RECOLA) dataset, which was used for dimensional affect estimation. The performance evaluation used the kappa score, unweighted average recall and the concordance correlation coefficient. The concordance coefficient on the RECOLA test partition was 0.421 in arousal and 0.321 in valence. Results shows that our models outperform state-of-the-art models on the same data and provides new ways to analyze affective states.
Conference Paper
Full-text available
In this paper, we propose a multimodal deep learning architecture for emotion recognition in video regarding our participation to the audio-video based sub-challenge of the Emotion Recognition in the Wild 2017 challenge. Our model combines cues from multiple video modalities, including static facial features, motion patterns related to the evolution of the human expression over time, and audio information. Specifically, it is composed of three sub-networks trained separately: the first and second ones extract static visual features and dynamic patterns through 2D and 3D Convolutional Neural Networks (CNN), while the third one consists in a pretrained audio network which is used to extract useful deep acoustic signals from video. In the audio branch, we also apply Long Short Term Memory (LSTM) networks in order to capture the temporal evolution of the audio features. To identify and exploit possible relationships among different modalities, we propose a fusion network that merges cues from the different modalities in one representation. The proposed architecture outperforms the challenge baselines (38.81% and 40.47%): we achieve an accuracy of 50.39% and 49.92% respectively on the validation and the testing data.
Article
Full-text available
Recent advancements in human–computer interaction research has led to the possibility of emotional communication via brain–computer interface systems for patients with neuropsychiatric disorders or disabilities. In this study, we efficiently recognize emotional states by analyzing the features of electroencephalography (EEG) signals, which are generated from EEG sensors that non-invasively measure the electrical activity of neurons inside the human brain, and select the optimal combination of these features for recognition. In this study, the scalp EEG data of 21 healthy subjects (12–14 years old) were recorded using a 14-channel EEG machine while the subjects watched images with four types of emotional stimuli (happy, calm, sad, or scared). After preprocessing, the Hjorth parameters (activity, mobility, and complexity) were used to measure the signal activity of the time series data. We selected the optimal EEG features using a balanced one-way ANOVA after calculating the Hjorth parameters for different frequency ranges. Features selected by this statistical method outperformed univariate and multivariate features. The optimal features were further processed for emotion classification using support vector machine (SVM), k-nearest neighbor (KNN), linear discriminant analysis (LDA), Naive Bayes, Random Forest, deep-learning, and four ensembles methods (bagging, boosting, stacking, and voting). The results show that the proposed method substantially improves the emotion recognition rate with respect to the commonly used spectral power band method.
Article
Full-text available
Many paralinguistic tasks are closely related and thus representations learned in one domain can be leveraged for another. In this paper, we investigate how knowledge can be transferred between three paralinguistic tasks: speaker, emotion, and gender recognition. Further, we extend this problem to cross-dataset tasks, asking how knowledge captured in one emotion dataset can be transferred to another. We focus on progressive neural networks and compare these networks to the conventional deep learning method of pre-training and fine-tuning. Progressive neural networks provide a way to transfer knowledge and avoid the forgetting effect present when pre-training neural networks on different tasks. Our experiments demonstrate that: (1) emotion recognition can benefit from using representations originally learned for different paralinguistic tasks and (2) transfer learning can effectively leverage additional datasets to improve the performance of emotion recognition systems.
Article
Full-text available
Facial expressions play a significant role in human communication and behavior. Psychologists have long studied the relationship between facial expressions and emotions. Paul Ekman et al., devised the Facial Action Coding System (FACS) to taxonomize human facial expressions and model their behavior. The ability to recognize facial expressions automatically, enables novel applications in fields like human-computer interaction, social gaming, and psychological research. There has been a tremendously active research in this field, with several recent papers utilizing convolutional neural networks (CNN) for feature extraction and inference. In this paper, we employ CNN understanding methods to study the relation between the features these computational networks are using, the FACS and Action Units (AU). We verify our findings on the Extended Cohn-Kanade (CK+), NovaEmotions and FER2013 datasets. We apply these models to various tasks and tests using transfer learning, including cross-dataset validation and cross-task performance. Finally, we exploit the nature of the FER based CNN models for the detection of micro-expressions and achieve state-of-the-art accuracy using a simple long-short-term-memory (LSTM) recurrent neural network (RNN).
Article
Full-text available
Automatic affect recognition is a challenging task due to the various modalities emotions can be expressed with. Applications can be found in many domains including multimedia retrieval and human computer interaction. In recent years, deep neural networks have been used with great success in determining emotional states. Inspired by this success, we propose an emotion recognition system using auditory and visual modalities. To capture the emotional content for various styles of speaking, robust features need to be extracted. To this purpose, we utilize a Convolutional Neural Network (CNN) to extract features from the speech, while for the visual modality a deep residual network (ResNet) of 50 layers. In addition to the importance of feature extraction, a machine learning algorithm needs also to be insensitive to outliers while being able to model the context. To tackle this problem, Long Short-Term Memory (LSTM) networks are utilized. The system is then trained in an end-to-end fashion where - by also taking advantage of the correlations of the each of the streams - we manage to significantly outperform the traditional approaches based on auditory and visual handcrafted features for the prediction of spontaneous and natural emotions on the RECOLA database of the AVEC 2016 research challenge on emotion recognition.
Article
Full-text available
Facial expression recognition (FER) is increasingly gaining importance in various emerging affective computing applications. In practice, achieving accurate FER is challenging due to the large amount of inter-personal variations such as expression intensity variations. In this paper, we propose a new spatio-temporal feature representation learning for FER that is robust to expression intensity variations. The proposed method utilizes representative expression-states (e.g., onset, apex and offset of expressions) which can be specified in facial sequences regardless of the expression intensity. The characteristics of facial expressions are encoded in two parts in this paper. As the first part, spatial image characteristics of the representative expression-state frames are learned via a convolutional neural network. Five objective terms are proposed to improve the expression class separability of the spatial feature representation. In the second part, temporal characteristics of the spatial feature representation in the first part are learned with a long short-term memory of the facial expression. Comprehensive experiments have been conducted on a deliberate expression dataset (MMI) and a spontaneous micro-expression dataset (CASME II). Experimental results showed that the proposed method achieved higher recognition rates in both datasets compared to the state-of-the-art methods.
Conference Paper
Full-text available
Automatic emotion recognition from speech is a challenging task which relies heavily on the effectiveness of the speech features used for classification. In this work, we study the use of deep learning to automatically discover emotionally relevant features from speech. It is shown that using a deep recurrent neural network, we can learn both the short-time frame-level acoustic features that are emotionally relevant, as well as an appropriate temporal aggregation of those features into a compact utterance-level representation. Moreover, we propose a novel strategy for feature pooling over time which uses local attention in order to focus on specific regions of a speech signal that are more emotionally salient. The proposed solution is evaluated on the IEMOCAP corpus, and is shown to provide more accurate predictions compared to existing emotion recognition algorithms.
Article
Full-text available
Pain is an unpleasant feeling that has been shown to be an important factor for the recovery of patients. Since this is costly in human resources and difficult to do objectively, there is the need for automatic systems to measure it. In this paper, contrary to current state-of-the-art techniques in pain assessment, which are based on facial features only, we suggest that the performance can be enhanced by feeding the raw frames to deep learning models, outperforming the latest state-of-the-art results while also directly facing the problem of imbalanced data. As a baseline, our approach first uses convolutional neural networks (CNNs) to learn facial features from VGG_Faces, which are then linked to a long short-term memory to exploit the temporal relation between video frames. We further compare the performances of using the so popular schema based on the canonically normalized appearance versus taking into account the whole image. As a result, we outperform current state-of-the-art area under the curve performance in the UNBC-McMaster Shoulder Pain Expression Archive Database. In addition, to evaluate the generalization properties of our proposed methodology on facial motion recognition, we also report competitive results in the Cohn Kanade+ facial expression database.
Conference Paper
Full-text available
With rapid developments in the design of deep architecture models and learning algorithms, methods referred to as deep learning have come to be widely used in a variety of research areas such as pattern recognition, classification, and signal processing. Deep learning methods are being applied in various recognition tasks such as image, speech, and music recognition. Convolutional Neural Networks (CNNs) especially show remarkable recognition performance for computer vision tasks. In addition, Recurrent Neural Networks (RNNs) show considerable success in many sequential data processing tasks. In this study, we investigate the result of the Speech Emotion Recognition (SER) algorithm based on CNNs and RNNs trained using an emotional speech database. The main goal of our work is to propose a SER method based on concatenated CNNs and RNNs without using any traditional hand-crafted features. By applying the proposed methods to an emotional speech database, the classification result was verified to have better accuracy than that achieved using conventional classification methods.
Chapter
State-of-the-art visual perception models for a wide range of tasks rely on supervised pretraining. ImageNet classification is the de facto pretraining task for these models. Yet, ImageNet is now nearly ten years old and is by modern standards “small”. Even so, relatively little is known about the behavior of pretraining with datasets that are multiple orders of magnitude larger. The reasons are obvious: such datasets are difficult to collect and annotate. In this paper, we present a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images. Our experiments demonstrate that training for large-scale hashtag prediction leads to excellent results. We show improvements on several image classification and object detection tasks, and report the highest ImageNet-1k single-crop, top-1 accuracy to date: 85.4% (97.6% top-5). We also perform extensive experiments that provide novel empirical data on the relationship between large-scale pretraining and transfer learning performance.
Conference Paper
State-of-the-art approaches for the previous emotion recognition in the wild challenges are usually built on prevailing Convolutional Neural Networks (CNNs). Although there is clear evidence that CNNs with increased depth or width can usually bring improved predication accuracy, existing top approaches provide supervision only at the output feature layer, resulting in the insufficient training of deep CNN models. In this paper, we present a new learning method named Supervised Scoring Ensemble (SSE) for advancing this challenge with deep CNNs. We first extend the idea of recent deep supervision to deal with emotion recognition problem. Benefiting from adding supervision not only to deep layers but also to intermediate layers and shallow layers, the training of deep CNNs can be well eased. Second, we present a new fusion structure in which class-wise scoring activations at diverse complementary feature layers are concatenated and further used as the inputs for second-level supervision, acting as a deep feature ensemble within a single CNN architecture. We show our proposed learning method brings large accuracy gains over diverse backbone networks consistently. On this year's audio-video based emotion recognition task, the average recognition rate of our best submission is 60.34%, forming a new envelop over all existing records.
Article
Regularization is one of the crucial ingredients of deep learning, yet the term regularization has various definitions, and regularization methods are often studied separately from each other. In our work we present a systematic, unifying taxonomy to categorize existing methods. We distinguish methods that affect data, network architectures, error terms, regularization terms, and optimization procedures. We do not provide all details about the listed methods; instead, we present an overview of how the methods can be sorted into meaningful categories and sub-categories. This helps revealing links and fundamental similarities between them. Finally, we include practical recommendations both for users and for developers of new regularization methods.
Article
While logistic sigmoid neurons are more biologically plausable that hyperbolic tangent neurons, the latter work better for training multi-layer neural networks. This paper shows that rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbolic tangent networks in spite of the hard non-linearity and non-differentiability at zero, creating sparse representations with true zeros, which seem remarkably suitable for naturally sparse data. Even though they can take advantage of semi-supervised setups with extra-unlabelled data, deep rectifier networks can reach their best performance without requiring any unsupervised pre-training on purely supervised tasks with large labelled data sets. Hence, these results can be seen as a new milestone in the attempts at understanding the difficulty in training deep but purely supervised nueral networks, and closing the performance gap between neural networks learnt with and without unsupervised pre-training
Conference Paper
Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates $backslash$emphdeep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
Automated affective computing in the wild setting is a challenging problem in computer vision. Existing annotated databases of facial expressions in the wild are small and mostly cover discrete emotions (aka the categorical model). There are very limited annotated facial databases for affective computing in the continuous dimensional model (e.g., valence and arousal). To meet this need, we collected, annotated, and prepared for public distribution a new database of facial emotions in the wild (called AffectNet). AffectNet contains more than 1,000,000 facial images from the Internet by querying three major search engines using 1250 emotion related keywords in six different languages. About half of the retrieved images were manually annotated for the presence of seven discrete facial expressions and the intensity of valence and arousal. AffectNet is by far the largest database of facial expression, valence, and arousal in the wild enabling research in automated facial expression recognition in two different emotion models. Two baseline deep neural networks are used to classify images in the categorical model and predict the intensity of valence and arousal. Various evaluation metrics show that our deep neural network baselines can perform better than conventional machine learning methods and off-the-shelf facial expression recognition systems.
Article
Emotion recognition is challenging due to the emotional gap between emotions and audio-visual features. Motivated by the powerful feature learning ability of deep neural networks, this paper proposes to bridge the emotional gap by using a hybrid deep model, which first produces audio-visual segment features with Convolutional Neural Networks (CNN) and 3DCNN, then fuses audio-visual segment features in a Deep Belief Networks (DBN). The proposed method is trained in two stages. First, CNN and 3D-CNN models pre-trained on corresponding large-scale image and video classification tasks, are fine-tuned on emotion recognition tasks to learn audio and visual segment features, respectively. Second, the outputs of CNN and 3DCNN models are combined into a fusion network built with a DBN model. The fusion network is trained to jointly learn a discriminative audio-visual segment feature representation. After average-pooling segment features learned by DBN to form a fixedlength global video feature, a linear Support Vector Machine (SVM) is used for video emotion classification. Experimental results on three public audio-visual emotional databases, including the acted RML database, the acted eNTERFACE05 database, and the spontaneous BAUM-1s database, demonstrate the promising performance of the proposed method. To the best of our knowledge, this is an early work fusing audio and visual cues with CNN, 3D-CNN and DBN for audio-visual emotion recognition.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
This paper presents a novel and efficient Deep Fusion Convolutional Neural Network (DF-CNN) for multi-modal 2D+3D Facial Expression Recognition (FER). DF-CNN comprises a feature extraction subnet, a feature fusion subnet and a softmax layer. In particular, each textured 3D face scan is represented as six types of 2D facial attribute maps (i.e., geometry map, three normal maps, curvature map, and texture map), all of which are jointly fed into DF-CNN for feature learning and fusion learning, resulting in a highly concentrated facial representation (32- dimensional). Expression prediction is performed by two ways: 1) learning linear SVM classifiers using the 32-dimensional fused deep features; 2) directly performing softmax prediction using the 6-dimensional expression probability vectors. Different from existing 3D FER methods, DF-CNN combines feature learning and fusion learning into a single end-to-end training framework. To demonstrate the effectiveness of DF-CNN, we conducted comprehensive experiments to compare the performance of DFCNN with handcrafted features, pre-trained deep features, finetuned deep features, and state-of-the-art methods on three 3D face datasets (i.e., BU-3DFE Subset I, BU-3DFE Subset II, and Bosphorus Subset). In all cases, DF-CNN consistently achieved the best results. To the best of our knowledge, this is the first work of introducing deep CNN to 3D FER and deep learning based feature-level fusion for multi-modal 2D+3D FER.
Article
We present a new action recognition deep neural network which adaptively learns the best action velocities in addition to the classification. While deep neural networks have reached maturity for image understanding tasks, we are still exploring network topologies and features to handle the richer environment of video clips. Here, we tackle the problem of multiple velocities in action recognition, and provide state-of-the-art results for facial expression recognition, on known and new collected datasets. We further provide the training steps for our semi-supervised network, suited to learn from huge unlabeled datasets with only a fraction of labeled examples.
Article
Emotion analysis is a crucial problem to endow artifact machines with real intelligence in many large potential applications. As external appearances of human emotions, electroencephalogram (EEG) signals and video face signals are widely used to track and analyze human's affective information. According to their common characteristics of spatial-temporal volumes, in this paper we propose a novel deep learning framework named spatial-temporal recurrent neural network (STRNN) to unify the learning of two different signal sources into a spatial-temporal dependency model. In STRNN, to capture those spatially cooccurrent variations of human emotions, a multi-directional recurrent neural network (RNN) layer is employed to capture longrange contextual cues by traversing the spatial region of each time slice from multiple angles. Then a bi-directional temporal RNN layer is further used to learn discriminative temporal dependencies from the sequences concatenating spatial features of each time slice produced from the spatial RNN layer. To further select those salient regions of emotion representation, we impose sparse projection onto those hidden states of spatial and temporal domains, which actually also increases the model discriminant ability because of this global consideration. Consequently, such a two-layer RNN model builds spatial dependencies as well as temporal dependencies of the input signals. Experimental results on the public emotion datasets of EEG and facial expression demonstrate the proposed STRNN method is more competitive over those state-of-the-art methods.
Conference Paper
Most existing Speech Emotion Recognition (SER) systems rely on turn-wise processing, which aims at recognizing emotions from complete utterances and an overly-complicated pipeline marred by many preprocessing steps and hand-engineered features. To overcome both drawbacks, we propose a real-time SER system based on end-to-end deep learning. Namely, a Deep Neural Network (DNN) that recognizes emotions from a one second frame of raw speech spectrograms is presented and investigated. This is achievable due to a deep hierarchical architecture, data augmentation, and sensible regularization. Promising results are reported on two databases which are the eNTERFACE database and the Surrey Audio-Visual Expressed Emotion (SAVEE) database.
Article
Speech Emotion Recognition (SER) can be regarded as a static or dynamic classification problem, which makes SER an excellent test bed for investigating and comparing various deep learning architectures. We describe a frame-based formulation to SER that relies on minimal speech processing and end-to-end deep learning to model intra-utterance dynamics. We use the proposed SER system to empirically explore feed-forward and recurrent neural network architectures and their variants. Experiments conducted illuminate the advantages and limitations of these architectures in paralinguistic speech recognition and emotion recognition in particular. As a result of our exploration, we report state-of-the-art results on the IEMOCAP database for speaker-independent SER and present quantitative and qualitative assessments of the model’s performance.