Conference Paper

Segment-Based Speech Emotion Recognition Using Recurrent Neural Networks

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Recently, Recurrent Neural Networks (RNNs) have produced state-of-the-art results for Speech Emotion Recognition (SER). The choice of the appropriate timescale for Low Level Descriptors (LLDs) (local features) and statistical functionals (global features) is key for a high performing SER system. In this paper, we investigate both local and global features and evaluate the performance at various timescales (frame, phoneme, word or utterance). We show that for RNN models, extracting statistical functionals over speech segments that roughly correspond to the duration of a couple of words produces optimal accuracy. We report state-of-the-art SER performance on the IEMOCAP corpus at a significantly lower model and computational complexity.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Examples of widely used acoustic features are mel-frequency cepstral coefficients (MFCCs), linear prediction cepstral coefficients (LPCC), short-time energy, fundamental frequency (F0), formants [5,6], etc. Traditional classification techniques include probabilistic models such as the Gaussian mixture model (GMM) [6][7][8], hidden Markov model (HMM) [9], and support vector machine (SVM [10][11][12]. Over the years of research, also various artificial neural network architectures have been utilised, from the simplest multilayer perceptron (MLP) [8] through extreme learning machine (ELM) [13], convolutional neural networks (CNNs) [14,15], to deep architectures of residual neural networks (Res-Nets) [16] and recurrent neural networks (RNNs) [17,18]. In particular, long short-term memory (LSTM) and gated recurrent units (GRU)-based neural networks (NNs), as stateof-the-art solutions in time-sequence modelling, have been widely utilised in speech signal modelling. ...
... Examples of widely used acoustic features are mel-frequency cepstral coefficients (MFCCs), linear prediction cepstral coefficients (LPCC), short-time energy, fundamental frequency (F0), formants [5,6], etc. Traditional classification techniques include probabilistic models such as the Gaussian mixture model (GMM) [6][7][8], hidden Markov model (HMM) [9], and support vector machine (SVM [10][11][12]. Over the years of research, also various artificial neural network architectures have been utilised, from the simplest multilayer perceptron (MLP) [8] through extreme learning machine (ELM) [13], convolutional neural networks (CNNs) [14,15], to deep architectures of residual neural networks (ResNets) [16] and recurrent neural networks (RNNs) [17,18]. In particular, long short-term memory (LSTM) and gated recurrent units (GRU)-based neural networks (NNs), as state-of-the-art solutions in time-sequence modelling, have been widely utilised in speech signal modelling. ...
... Tzinis and Potamianos [17] studied the effects of variable sequence lengths for LSTMbased recognition (see Section 4 for RNN-LSTM description). Recognition on sequences concatenated at frame-level yielded better results on phoneme length (90 ms). ...
Article
Full-text available
Emotions are an integral part of human interactions and are significant factors in determining user satisfaction or customer opinion. speech emotion recognition (SER) modules also play an important role in the development of human–computer interaction (HCI) applications. A tremendous number of SER systems have been developed over the last decades. Attention-based deep neural networks (DNNs) have been shown as suitable tools for mining information that is unevenly time distributed in multimedia content. The attention mechanism has been recently incorporated in DNN architectures to emphasise also emotional salient information. This paper provides a review of the recent development in SER and also examines the impact of various attention mechanisms on SER performance. Overall comparison of the system accuracies is performed on a widely used IEMOCAP benchmark database.
... Multiple According to the opinions of doctors, the most prevalent auditory factors to consider in selecting a Teager Energy Operator (TEO), the zero crossings Rate (ZCR), the MFCC, and the filter bank's energetic properties and attributes .All feelings may be categorised as energies (FBE). We make available a method for detecting feelings that is based on Support Vector Machines that have the features number 39 In this study, MFCC, HNR, ZCR, and TEO are referred to as [9]. We are going to examine the performance of two different systems. ...
... [15]The present investigation makes use of the and kinds of auto-encoders.AE fundamentals in addition to the stacking autoencoder. [9]The total number of concealed layers varies depending on the two different kinds; the standard AE only has one, whereas the Layered autoencoders consist of two or more than two encoders. ...
Article
Emotional state identification based on analysis of vocalisations is a challenging subject in the field of HumanComputer Interaction (HCI). In the research that has been done on speech emotion recognition (SER). A wide range of research approaches has been used in order to extract feelings from a variety of inputs, including a number of well- known ways to speech analysis and categorization that are already known. Recent research has suggested the use of deep learning algorithms. as potential alternatives to the approaches that are traditionally used in SER. This article offers a summary of more in- depth topics learning methodologies, as well as current research employing it, are discussed to identify the feelings conveyed by verbal expressions. The analysis will consider the feelings that were recorded in the databases that were utilised were: the contributions to both speech and emotion that were removed the restrictions that were found, as well as the discoveries that were made discovered. Keywords: Speech emotions, Real-time Speech Classification, Transfer Learning, HCI Bandwidth Reduction, SER, LSTM
... Furthermore, the speech emotion recognition is studied in [14] [15] [16] [17] in which the authors focused on proposing an SER algorithm based on concatenated convolutional neural network (CNN) and RNN without using any hand-crafted extracting method. Authors in [18] studied speech emotion recognition by extracting the statistical features over segments of speech. The segments are extracted according to the matching of a couple of words. ...
... The second one uses deep learning algorithms. However, nowadays, most of the classification processes are done using deep neural networks as they can do feature selection and extraction at the same time [18]. Table III presents the common classifiers that are used in recognizing emotions by using speech. ...
Preprint
Full-text available
Recognizing the patient's emotions using deep learning techniques has attracted significant attention recently due to technological advancements. Automatically identifying the emotions can help build smart healthcare centers that can detect depression and stress among the patients in order to start the medication early. Using advanced technology to identify emotions is one of the most exciting topics as it defines the relationships between humans and machines. Machines learned how to predict emotions by adopting various methods. In this survey, we present recent research in the field of using neural networks to recognize emotions. We focus on studying emotions' recognition from speech, facial expressions, and audio-visual input and show the different techniques of deploying these algorithms in the real world. These three emotion recognition techniques can be used as a surveillance system in healthcare centers to monitor patients. We conclude the survey with a presentation of the challenges and the related future work to provide an insight into the applications of using emotion recognition.
... Within this context, the feature representation is used to train different machine learning systems, such as support vector machines [23][24][25], 1D-convolutional neural networks (1D-CNN) [26,27], 2D-CNN [28][29][30][31][32] or recurrent neural networks (RNN) [33][34][35][36]. ...
... The low-level feature descriptors extracted from both the audio and video streams are concatenated and fed to a support vector machine (SVM) classifier in order to obtain the final prediction. Another SER framework, based on RNN, that incorporates both low-level descriptors (local features) and statistical characteristics, (global features) is introduced in [35]. The extracted features are used to determine emotions at various temporal scales, including frame, phoneme, word and utterance. ...
Article
Full-text available
Emotion is a form of high-level paralinguistic information that is intrinsically conveyed by human speech. Automatic speech emotion recognition is an essential challenge for various applications; including mental disease diagnosis; audio surveillance; human behavior understanding; e-learning and human–machine/robot interaction. In this paper, we introduce a novel speech emotion recognition method, based on the Squeeze and Excitation ResNet (SE-ResNet) model and fed with spectrogram inputs. In order to overcome the limitations of the state-of-the-art techniques, which fail in providing a robust feature representation at the utterance level, the CNN architecture is extended with a trainable discriminative GhostVLAD clustering layer that aggregates the audio features into compact, single-utterance vector representation. In addition, an end-to-end neural embedding approach is introduced, based on an emotionally constrained triplet loss function. The loss function integrates the relations between the various emotional patterns and thus improves the latent space data representation. The proposed methodology achieves 83.35% and 64.92% global accuracy rates on the RAVDESS and CREMA-D publicly available datasets, respectively. When compared with the results provided by human observers, the gains in global accuracy scores are superior to 24%. Finally, the objective comparative evaluation with state-of-the-art techniques demonstrates accuracy gains of more than 3%.
... Moreover, this technique combines the power of the recently established LSM, SNN, and the source-filter model of human speech production. Tzinis and Alexandro [28] used local and global features, i.e., LLD(Low-Level Descriptor) and statistical functions, to train the LSTM unit and to probe the correct decision time scale for each present unit. The suggested model used a segment-wise learning approach combined with global and local features to provide further accuracy for the decision of emotional context. ...
Article
Full-text available
Accurate emotion detection from speech utterances has been a challenging and active research affair recently. Speech emotion recognition (SER) systems play an essential role in Human-machine interaction, virtual reality, emergency services, and many other real-time systems. It is an open-ended problem as subjects from different regions and lingual backgrounds convey emotions altogether differently. The conventional approach used low-level periodic features from audio samples like energy, pitch, etc., for classification but was not efficient enough to detect emotions accurately and not generalized. With the recent advancements in computer vision and neural networks extracting high-level features and more accurate recognition can be achieved. This study proposes an ensemble deep CNN + Bi-LSTM-based framework for speech emotion recognition and classification of seven different emotions. The paralinguistic log Mel-frequency spectral coefficients (MFSC) is used as a feature to train the proposed architecture. The proposed Hybrid model is validated with TESS and SAVEE datasets. Experimental results have indicated a classification accuracy of 96.36%. The proposed model is compared with existing models, proving the superiority of the proposed hybrid deep CNN and Bi-LSTM model.
... Traditional classification algorithms for the SER tasks consist of probabilistic models such as Hidden Markov Model, Gaussian Mixture Model (GMM) [6], or Support Vector Machine (SVM) [7]. Recently, with the continuous improvement of machine learning and deep learning models, various methods have been implemented to solve the SER challenge, such as ELM [8], CNN [9], [10], ResNets [11], LSTM, RNNs [12], [13], etc. The Mel-Spectrogram were extracted from speeches and singing records as input for a multi-task gated residual network and has gotten a 65.97% accuracy value [14]. ...
Conference Paper
Full-text available
Speech is the most direct method of human communication with a high level of efficiency, which contains a lot of information about the speaker’s feelings. The ability to recognize and distinguish between different emotions with sentences is a necessary component in intelligent applications of human-computer interaction (HCI). For the purpose of creating a more natural and intuitive way of communication between humans and automation control systems, emotional expressions conveyed through signal forms need to be recognized and processed accordingly. In this paper, the authors propose to appropriately apply parallel Deep Learning SENet, CNN block, and Transformer with Multi-head Attention method to effectively distinguish the features of different emotional states in the user voice recording data. The speech record samples from an open-source RAVDESS dataset were applied to assess the performance of the training model during the research. The highest results of the proposed model have achieved 82.67% of average accuracy on the test set. [KEYWORDs] Speech emotion recognition, parallel deep learning, Mel-Spectrogram, Multi-head attention, Transformer. *Nguyen Gia Minh Thao is the corresponding author.
... In addition, RNNs without any other ANN networks are also used for many different problems, especially sequential ones. RNNs are widely used for speech emotion recognition by different researchers (Lee and Tashev, 2015;Tzinis and Potamianos, 2017;Li et al., 2021). Table 1 shows the full forms of the acronyms used in this survey. ...
Article
Full-text available
A deepfake is content or material that is synthetically generated or manipulated using artificial intelligence (AI) methods, to be passed off as real and can include audio, video, image, and text synthesis. The key difference between manual editing and deepfakes is that deepfakes are AI generated or AI manipulated and closely resemble authentic artifacts. In some cases, deepfakes can be fabricated using AI-generated content in its entirety. Deepfakes have started to have a major impact on society with more generation mechanisms emerging everyday. This article makes a contribution in understanding the landscape of deepfakes, and their detection and generation methods. We evaluate various categories of deepfakes especially in audio. The purpose of this survey is to provide readers with a deeper understanding of (1) different deepfake categories; (2) how they could be created and detected; (3) more specifically, how audio deepfakes are created and detected in more detail, which is the main focus of this paper. We found that generative adversarial networks (GANs), convolutional neural networks (CNNs), and deep neural networks (DNNs) are common ways of creating and detecting deepfakes. In our evaluation of over 150 methods, we found that the majority of the focus is on video deepfakes, and, in particular, the generation of video deepfakes. We found that for text deepfakes, there are more generation methods but very few robust methods for detection, including fake news detection, which has become a controversial area of research because of the potential heavy overlaps with human generation of fake content. Our study reveals a clear need to research audio deepfakes and particularly detection of audio deepfakes. This survey has been conducted with a different perspective, compared to existing survey papers that mostly focus on just video and image deepfakes. This survey mainly focuses on audio deepfakes that are overlooked in most of the existing surveys. This article's most important contribution is to critically analyze and provide a unique source of audio deepfake research, mostly ranging from 2016 to 2021. To the best of our knowledge, this is the first survey focusing on audio deepfakes generation and detection in English.
... For example, Alexandra et al. propose a deep learning-based end to end model for implementing an emotion recognition system. In her work, she trained every modal particulate network (Visual network and Speech network-CNN) before multi-modal model training [39]. For evaluating their model, they use audible and video data from the RECOLA database. ...
Conference Paper
If we talk about our daily life emotions or effects are a vital part of our lives. Human emotions are very complex to understand due to variations but researchers and psychologists had done a variety of attempts to illustrate and percept human emotions in different aspects. Despite this, several psychologists and researchers from past years proposed several emotion theories to understand human emotions and emotional states in different ways, these theories make several attempts to evaluate emotion in specific aspects. This paper presents a systematic and chronological review of emotions initially which include a historical review of emotional theories, categorical and dimensional emotion model analysis and difference, analysis of dissimilar unimodal emotion methodology along a review of multimodal emotion methodologies. From this historical review, we get a suggestion that is for logical emotional expressions there are deficient proofs available that prove or disprove the presence of coherent emotions.
... The emotional content of speech changes over time, so it is appropriate to employ temporal modeling approaches for SER. Tzinis and Potamianos [18] investigated the effect of variable-length utterances on emotion recognition using LSTM. Recognition accuracy on sequences of utterances chained at the frame level gave better results with a total input length of 90 ms (phoneme level). ...
... More specifically, some of these models used the ability of Convolutional Neural Networks (CNN) to learn features from input signals (Bertero et Fung, 2017;Mekruksavanich et al., 2020). Another type of model makes use of the sequential nature of the speech signals and utilized Recurrent Neural Networks (RNN) architectures like long short-term memory (LSTM) (Tzinis et Potamianos, 2017;Fayek et al., 2017). Some models combined both types of architectures like in ConvLSTM (Kurpukdee et al., 2017). ...
Preprint
Full-text available
In this paper, we propose a new methodology for emotional speech recognition using visual deep neural network models. We employ the transfer learning capabilities of the pre-trained computer vision deep models to have a mandate for the emotion recognition in speech task. In order to achieve that, we propose to use a composite set of acoustic features and a procedure to convert them into images. Besides, we present a training paradigm for these models taking into consideration the different characteristics between acoustic-based images and regular ones. In our experiments, we use the pre-trained VGG-16 model and test the overall methodology on the Berlin EMO-DB dataset for speaker-independent emotion recognition. We evaluate the proposed model on the full list of the seven emotions and the results set a new state-of-the-art.
... They achieved a maximum unweighted accuracy of 63.98% by utilizing an HSF-CRNN system. In another experiment involving an RNN, the authors at [26] used an LSTM Neural Network to achieve an unweighted accuracy of 60.02%. The authors in [27][28][29] used similar RNN and LSTM systems to approach SER, achieving accuracies of 63.5%, 58.7%, and 63.89%, respectively. ...
Article
Full-text available
Recognizing human emotions by machines is a complex task. Deep learning models attempt to automate this process by rendering machines to exhibit learning capabilities. However, identifying human emotions from speech with good performance is still challenging. With the advent of deep learning algorithms, this problem has been addressed recently. However, most research work in the past focused on feature extraction as only one method for training. In this research, we have explored two different methods of extracting features to address effective speech emotion recognition. Initially, two-way feature extraction is proposed by utilizing super convergence to extract two sets of potential features from the speech data. For the first set of features, principal component analysis (PCA) is applied to obtain the first feature set. Thereafter, a deep neural network (DNN) with dense and dropout layers is implemented. In the second approach, mel-spectrogram images are extracted from audio files, and the 2D images are given as input to the pre-trained VGG-16 model. Extensive experiments and an in-depth comparative analysis over both the feature extraction methods with multiple algorithms and over two datasets are performed in this work. The RAVDESS dataset provided significantly better accuracy than using numeric features on a DNN.
... Han et al. [4] constructed utterance-level features from segment-level probability distributions using deep neural networks. Tzinis et al. [5] introduced recurrent neural networks to capture emotional temporal information, which only considered final state of the recurrent layers. Chao et al. [6] compared different pooling methods and highlighted the advantage of mean pooling. ...
... Finally, they estimated statistics over these curves, which were used as the sentence-level feature representation of a static classifier implemented with extreme learning machine (ELM). Tzinis and Potamianos [17] employed HLD to represent segment-wise global features from LLDs, which obtains better performance compare to LLDs under a LSTM model. Tarantino et al. [18] built a self-attention model by setting different step size of input data chunks. ...
... RNNs has achieved state-of-the-art performance in a variety of NLP tasks such as language modeling [148][149][150][151], text-to-speech [152][153][154], machine translation [155], speaker diarization [156,157], natural language generation [158,159], natural language understanding [160,161], question answering [162,163], chatbot application [164][165][166]. More recently, RNNs have achieved state-of-the-art results for speech emotion recognition [167][168][169], emotion detection [170], and many other tasks. ...
Article
Full-text available
The main objective of this paper is to provide a stateof- the-art survey on deep learning methods and applications. It starts with a short introduction to deep learning and its three main types of deep learning approaches including supervised learning, unsupervised learning and reinforcement learning. In the following deep learning is presented along with a review of state-of-the-art methods including feed forward neural networks, recurrent neural networks, convolutional neural networks and their extended variants. Then a brief overview on the application of deep neural networks in various domains of science and industry is given. Finally, conclusions are drawn in the last section.
... Global features also typically lead to algorithms that are less computationally complex than local features, and that may have fewer trainable parameters, and hence be less susceptible to overfitting. Tzinis and Potamianos (2017) found a mechanism to integrate the best features of global and local features: global features (functional statistics of the low-level descriptors) were computed over windows of 3 s duration, overlapping by half, then processed by an LSTM in order to compute a single emotion score for the duration of an 8s utterance. Applying this algorithm to the IEMOCAP corpus (Busso et al., 2008) resulted in a UAR below the state of the art, but a weighted accuracy above the state of the art (64.16%). ...
Article
Full-text available
Classification of infant and parent vocalizations, particularly emotional vocalizations, is critical to understanding how infants learn to regulate emotions in social dyadic processes. This work is an experimental study of classifiers, features, and data augmentation strategies applied to the task of classifying infant and parent vocalization types. Our data were recorded both in the home and in the laboratory. Infant vocalizations were manually labeled as cry, fus (fuss), lau (laugh), bab (babble) or scr (screech), while parent (mostly mother) vocalizations were labeled as ids (infant-directed speech), ads (adult-directed speech), pla (playful), rhy (rhythmic speech or singing), lau (laugh) or whi (whisper). Linear discriminant analysis (LDA) was selected as a baseline classifier, because it gave the highest accuracy in a previously published study covering part of this corpus. LDA was compared to two neural network architectures: a two-layer fully-connected network (FCN), and a convolutional neural network with self-attention (CNSA). Baseline features extracted using the OpenSMILE toolkit were augmented by extra voice quality, phonetic, and prosodic features, each targeting perceptual features of one or more of the vocalization types. Three web data augmentation and transfer learning methods were tested: pre-training of network weights for a related task (adult emotion classification), augmentation of under-represented classes using data uniformly sampled from other corpora, and augmentation of under-represented classes using data selected by a minimum cross-corpus information difference criterion. Feature selection using Fisher scores and experiments of using weighted and unweighted samplers were also tested. Two datasets were evaluated: a benchmark dataset (CRIED) and our own corpus. In terms of unweighted-average recall of CRIED dataset, the CNSA achieved the best UAR compared with previous studies. In terms of classification accuracy, weighted F1, and macro F1 of our own dataset, the neural networks both significantly outperformed LDA; the FCN slightly (but not significantly) outperformed the CNSA. Cross-examining features selected by different feature selection algorithms permits a type of post-hoc feature analysis, in which the most important acoustic features for each binary type discrimination are listed. Examples of each vocalization type of overlapped features were selected, and their spectrograms are presented, and discussed with respect to the type-discriminative acoustic features selected by various algorithms. MFCC, log Mel Frequency Band Energy, LSP frequency, and F1 are found to be the most important spectral envelope features; F0 is found to be the most important prosodic feature.
... From speech, Han et al. [2] extracted 238 Low-Level Descriptors (LLDs) at speech frame level using openSMILE [3]and had them automatically aligned with emotional labels with a recurrent neural network based Connectionist Temporal Classification (CTC) model. Tzinis et al. [4] compared the impact of choosing time-scales and a variety of LLDs (local features) that are relevant to global features applied to an RNN. As in other fields, there has been an explosion of deep learning techniques applied for emotion recognition to extract high-level features from raw data. ...
Article
Full-text available
Human communication includes rich emotional content, thus the development of multimodal emotion recognition plays an important role in communication between humans and computers. Because of the complex emotional characteristics of a speaker, emotional recognition remains a challenge, particularly in capturing emotional cues across a variety of modalities, such as speech, facial expressions, and language. Audio and visual cues are particularly vital for a human observer in understanding emotions. However, most previous work on emotion recognition has been based solely on linguistic information, which can overlook various forms of nonverbal information. In this paper, we present a new multimodal emotion recognition approach that improves the BERT model for emotion recognition by combining it with heterogeneous features based on language, audio, and visual modalities. Specifically, we improve the BERT model due to the heterogeneous features of the audio and visual modalities.We introduce the Self-Multi-Attention Fusion module, Multi-Attention fusion module, and Video Fusion module, which are attention based multimodal fusion mechanisms using the recently proposed transformer architecture. We explore the optimal ways to combine fine-grained representations of audio and visual features into a common embedding while combining a pre-trained BERT model with modalities for fine-tuning. In our experiment, we evaluate the commonly used CMU-MOSI, CMU-MOSEI, and IEMOCAP datasets for multimodal sentiment analysis. Ablation analysis indicates that the audio and visual components make a significant contribution to the recognition results, suggesting that these modalities contain highly complementary information for sentiment analysis based on video input. Our method shows that we achieve state-of-the-art performance on the CMU-MOSI, CMU-MOSEI, and IEMOCAP dataset.
... They demonstrated the improved performance of segmentlevel models by comparing to results with the ones obtained by modeling the entire sentence. Tzinis and Potamianos [32] employed HLDs to represent segment-wise global features from LLDs, which obtained better performance compared to LLDs under a LSTM model. Tarantino et al. [33] and Sahoo et al. [34] found that a smaller step size (i.e., more overlap between chunks) can increase the discrimination of the feature representation in the network. ...
Article
Full-text available
A critical issue of current speech-based sequence-to-one learning tasks, such as speech emotion recognition(SER), is the dynamic temporal modeling for speech sentences with different durations. The goal is to extract an informative representation vector of the sentence from acoustic feature sequences with varied length. Traditional methods rely on static descriptions such as statistical functions or a universal background model (UBM), which are not capable of characterizing dynamic temporal changes. Recent advances in deep learning architectures provide promising results, directly extracting sentence-level representations from frame-level features. However, conventional cropping and padding techniques that deal with varied length sequences are not optimal, since they truncate or artificially add sentence-level information. Therefore, we propose a novel dynamic chunking approach, which maps the original sequences of different lengths into a fixed number of chunks that have the same duration by adjusting their overlap. This simple chunking procedure creates a flexible framework that can incorporate different feature extractions and sentence-level temporal aggregation approaches to cope, in a principled way, with different sequence-to-one tasks. Our experimental results based on three databases demonstrate that the proposed framework provides: 1) improvement in recognition accuracy, 2) robustness toward different temporal length predictions, and 3) high model computational efficiency advantages.
... As discussed in the introduction, several existing studies have applied deep learning (e.g., RNNs and 44 44 CNNs) to SER tasks [22][23][24] . The advantages of RNNs have been reported in the context of SER [1][2][3]25,26] . ...
Article
Full-text available
Background A crucial element of human–machine interaction, the automatic detection of emotional states from human speech has long been regarded as a challenging task for machine learning models. One vital challenge in speech emotion recognition (SER) is how to learn robust and discriminative representations from speech. Meanwhile, although machine learning methods have been widely applied in SER research, the inadequate amount of available annotated data has become a bottleneck that impedes the extended application of techniques (e.g., deep neural networks). To address this issue, we present a deep learning method that combines knowledge transfer and self-attention for SER tasks. Here, we apply the log-Mel spectrogram with deltas and delta-deltas as input. Moreover, given that emotions are time-dependent, we apply Temporal Convolutional Neural Networks (TCNs) to model the variations in emotions. We further introduce an attention transfer mechanism, which is based on a self-attention algorithm in order to learn long-term dependencies. The Self-Attention Transfer Network (SATN) in our proposed approach, takes advantage of attention autoencoders to learn attention from a source task, and then from speech recognition, followed by transferring this knowledge into SER. Evaluation built on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) demonstrates the effectiveness of the novel model.
... However, different deep learning acoustic models have their own pros and cons. For example, recurrent neural networks (RNN) are good at dealing with time series information [18][19][20], convolution neural networks (CNN) do well in capturing spatial information [21][22][23], and the deep residual network (DRN) can tackle the problems of gradient exploding or gradient vanishing which become popular with the deepening of the network layers [24][25][26]. Some representative deep learning acoustic models are summarized in detail as follows. ...
Article
Full-text available
Speech emotion recognition is a challenging task in natural language processing. It relies heavily on the effectiveness of speech features and acoustic models. However, existing acoustic models may not handle speech emotion recognition efficiently for their built-in limitations. In this work, a novel deep-learning acoustic model called attention-based skip convolution bi-directional long short-term memory, abbreviated as SCBAMM, is proposed to recognize speech emotion. It has eight hidden layers, namely, two dense layers, a convolutional layer, a skip layer, a mask layer, a Bi-LSTM layer, an attention layer, and a pooling layer. SCBAMM makes better use of spatiotemporal information and captures emotion-related features more effectively. In addition, it solves the problems of gradient exploding and gradient vanishing in deep learning to some extent. On the databases EMO-DB and CASIA, the proposed model SCBAMM achieves an accuracy rate of 94.58% and 72.50%, respectively. As far as we know, compared with peer models, this is the best accuracy rate.
... Coupled with speech variations among speakers and dynamical features with low saliency, SER is a challenging problem. Previous speech emotion studies have used handcrafted features consisting of low-level descriptors (LLDs), local features such as Mel-frequency cepstral coefficients (MFCCs), energy, or pitch, and have also considered global features by calculating local feature statistics [6,7]. However, extracting handcrafted features requires expensive manual labor and the data quality depends on expert knowledge of labelers. ...
Article
Full-text available
Speech emotion recognition predicts the emotional state of a speaker based on the person’s speech. It brings an additional element for creating more natural human–computer interactions. Earlier studies on emotional recognition have been primarily based on handcrafted features and manual labels. With the advent of deep learning, there have been some efforts in applying the deep-network-based approach to the problem of emotion recognition. As deep learning automatically extracts salient features correlated to speaker emotion, it brings certain advantages over the handcrafted-feature-based methods. There are, however, some challenges in applying them to the emotion recognition problem, because data required for properly training deep networks are often lacking. Therefore, there is a need for a new deep-learning-based approach which can exploit available information from given speech signals to the maximum extent possible. Our proposed method, called “Fusion-ConvBERT”, is a parallel fusion model consisting of bidirectional encoder representations from transformers and convolutional neural networks. Extensive experiments were conducted on the proposed model using the EMO-DB and Interactive Emotional Dyadic Motion Capture Database emotion corpus, and it was shown that the proposed method outperformed state-of-the-art techniques in most of the test configurations.
Chapter
Automatic recognition of human emotions is a relatively new field, and is attracting significant attention in research and development areas because of the major contribution it could make to real applications. The current study focuses on far-field speech emotion recognition using the state-of-the-art spontaneous IEMOCAP emotional data. For classification, a method based on deep convolutional neural networks (DCNN) and extremely randomized trees is proposed. The method is also compared to support vector machines (SVM) and probabilistic linear discriminant analysis (PLDA) classifiers in the i-vector paradigm. When reverberant speech was classified using the proposed method, the classification rates were comparable to those obtained when using clean data. In the case of PLDA and SVM classifiers, the classification rates were significantly decreased. To further improve the performance of far-field speech emotion recognition, a method based on multi-style training is proposed, which results in significant improvements in the classification rates.
Chapter
Artificial intelligence technology has already been applied in the education scene, and the automatic detecting technology of learning state has attracted the attention of many researchers. This paper summarizes the main types of learning state that researchers pay attention to at present, including affect, engagement, attention, and cognitive load. Based on four typical learning scenarios: computer-based learning, mobile learning, traditional classroom-based learning, and individual computer-free learning, this paper discusses the shortcomings and development trends of detecting hardware and methods used in this field, and the social problems in obtaining a large amount of personal privacy data.
Chapter
Language is an effective way to express human emotions, but emotions are difficult to describe and judge with computers, so it is an important task to analyze their emotions through speech. We summarized the current situation of speech emotion recognition from five aspects: the development of speech emotion recognition, emotion description model, emotion speech database, feature extraction, and emotion recognition algorithm. By summarizing and analyzing these five aspects, we can predict the future development trend of speech emotion recognition and improve recognition accuracy combined with other information for joint analysis. In this paper, we summarized the commonly used emotion database, compared the popular attention mechanism in speech emotion recognition, and finally, proposed the prospect of speech emotion recognition.KeywordsSpeech emotion recognitionRecognition algorithmAttention mechanism
Article
Speech emotion recognition is the crucial stream in emotional computing and also create few issues owing to its complication in processing. The efficiency of the acoustic methods and their speech features are improved using various existing methods. Yet, the conventional acoustic methods are not effective in handling speech emotion recognition because of their drawbacks. The main intend of this research is to implement a new speech emotion recognition using the hybrid deep learning model. Initially, few speech emotion recognition dataset is gathered from the public sources and is put forwarded for pre-processing using artifacts removal and filtering techniques. Then, the feature extraction of the speech signals is performed by the Mel-Frequency Cepstral Coefficients (MFCC), mel-scale spectrogram, tonal power, and spectral flux. In the aim of decreasing the feature size for boosting up the learning performance, for selecting the optimal feature is adopted by the Deer Hunting with Adaptive Search (DH-AS) algorithm. These optimal features are used for the emotion classification by the Hybrid Deep Learning (HDL) with “Deep Neural Network (DNN) and Recurrent Neural Network (RNN)”. These two networks are enhanced by the developed DH-AS, thus could reach high classification accuracy while classifying the emotions like “happy, sad, anger, fear, calm etc”. The performance of the suggested DH-AS-HDL correspondingly improves 3.15%, 5.37%, 4.25% and 4.81% better accuracy than the PSO-HDL, GWO-HDL, WOA-HDL and DHOA-HDL, when the learning rate as 85. The achieved results prove that the developed model obtains superior performance by evaluating its performance through various performance metrics.
Article
Full-text available
Automatic Emotion Speech Recognition (ESR) is considered as an active research field in the Human-Computer Interface (HCI). Typically, the ESR system is consisting of two main parts: Front-End (features extraction) and Back-End (classification). However, most previous ESR systems have been focused on the features extraction part only and ignored the classification part. Whilst the classification process is considered an essential part in ESR systems, where its role is to map out the extracted features from audio samples to determine its corresponding emotion. Moreover, the evaluation of most ESR systems has been conducted based on Subject Independent (SI) scenario only. Therefore, in this paper, we are focusing on the Back-End (classification), where we have adopted our recent developed Extreme Learning Machine (ELM), called Optimized Genetic Algorithm- Extreme Learning Machine (OGA-ELM). In addition, we used the Mel Frequency Cepstral Coefficients (MFCC) method in order to extract the features from the speech utterances. This work proves the significance of the classification part in ESR systems, where it improves the ESR performance in terms of achieving higher accuracy. The performance of the proposed model was evaluated based on Berlin Emotional Speech (BES) dataset which consists of 7 emotions (neutral, happiness, boredom, anxiety, sadness, anger, and disgust). Four different evaluation scenarios have been conducted such as Subject Dependent (SD), SI, Gender Dependent Female (GD-Female), and Gender Dependent Male (GD-Male). The highest performance of the OGA-ELM was very impressive in the four different scenarios and achieved an accuracy of 93.26%, 100.00%, 96.14% and 97.10% for SI, SD, GD-Male, and GD-Female scenarios, respec-tively. Besides, the proposed ESR system has shown a fast execution time in all experiments to identify the emotions.
Article
Speech emotion recognition is an important aspect of emotional state recognition in human-machine interaction. Approaches using speech-to-image transforms have become popular in recent years because they can utilise deep neural network models that have proven to be successful in the image processing domain. In this paper, we propose a new speech-to-image transform, CyTex, that maps the raw speech signal directly to a textured image by using calculations based on the fundamental frequency of each speech frame. The textured RGB images resulting from the CyTex transform can then be classified using standard deep neural network models for the recognition of different classes of emotion. Using this approach, we can report an improvement of classification accuracies over the previous state-of-the-art results by 0.81% for the Emo-DB database, and also by 0.5% for the IEMOCAP database.
Conference Paper
Emotional intelligence (EI) is a collection of quasi skills, attitudes, and talents that influence one's ability to respond quickly to environmental changes and stresses. Nevertheless, it is not always possible to monitor the effect of a multitude of variables involved in behavioral phenomena. In several different industries, Emotion Perception or Artificial Emotional Intelligence is now a $20 billion research area with applications. In a variety of ways, artificial emotional intelligence can operate through industries. Emotional readings may also be used by AI as relating to decision, such as in advertising campaigns. In terms of temporal dynamics, a powerful learning method is required to extract high-level representations of emotional responses. In terms of spatial dispersion, a learning process strategy is required to extract high-level assessment of emotional states. The recurrent neural network outperforms linear models in terms of prediction accuracy. A recurrent model is proposed in this paper to forecast the pattern between the variables of age, gender, occupation, marital status, and education in order to predict the EI. The appropriate recurrent model is capable of predicting EI with important correlations in most of its dimensions and could demonstrate the advantage over regression models in predicting EI using sociological parameters. This model will estimate the level of EI in the various occupational, professional, gender and age groups and provide a planning basis for addressing possible deficiencies in each group.
Article
For human-machine communication to be as effective as human-tohuman communication, research on speech emotion recognition is essential. Among the models and the classifiers used to recognize emotions, neural networks appear to be promising due to the network's ability to learn and the diversity in configuration. Following the convolutional neural network, a capsule neural network (CapsNet) with inputs and outputs that are not scalar quantities but vectors allows the network to determine the part-whole relationships that are specific 6 for an object. This paper performs speech emotion recognition based on CapsNet. The corpora for speech emotion recognition have been augmented by adding white noise and changing voices. The feature parameters of the recognition system input are mel spectrum images along with the characteristics of the sound source, vocal tract and prosody. For the German emotional corpus EMODB, the average accuracy score for 4 emotions, neutral, boredom, anger and happiness, is 99.69%. For Vietnamese emotional corpus BKEmo, this score is 94.23% for 4 emotions, neutral, sadness, anger and happiness. The accuracy score is highest when combining all the above feature parameters, and this score increases significantly when combining mel spectrum images with the features directly related to the fundamental frequency.
Article
Full-text available
In today's world, affective computing is very important in the relationship between man and machine. In this paper, a multi-stage system for speech emotion recognition (SER) based on speech signal is proposed, which uses new techniques in different stages of processing. The system consists of three stages: feature extraction, feature selection/dimension reduction, and finally feature classification. In the first stage, a complex set of long-term-statistics features is extracted from both the speech signal and the glottal-waveform signal using a combination of new and diverse features such as prosodic features, spectral features, and spectro-temporal features. One of the challenges of the SER systems is to distinguish correlated emotions. These features are good discriminators for speech emotions and increase the SER's ability to recognize similar and different emotions. The data augmentation technique is also used to increase the number of training samples. This feature vector with a large number of dimensions naturally has redundancy. In the second stage, using classical feature selection techniques as well as a new quantum-inspired technique to reduce the feature vector dimensionality (proposed by the authors), the number of feature vector dimensions is reduced. In the third stage, the optimized feature vector is classified by a weighted deep sparse extreme learning machine (ELM) classifier. The classifier performs classification in three steps: sparse random feature learning, orthogonal random projection using the singular value decomposition (SVD) technique, and discriminative classification in the last step using the generalized Tikhonov regularization technique. Also, many existing emotional datasets suffer from the problem of data imbalanced distribution, which in turn increases the classification error and decreases system performance. In this paper, a new weighting method has also been proposed to deal with class imbalance, which is more efficient than existing weighting methods. The proposed method is evaluated on three standard emotional databases EMODB, SAVEE, and IEMOCAP. According to our latest information, the system proposed in this paper is more accurate in recognizing emotions than the latest state-of-the-art methods.
Article
Full-text available
Echo state network (ESN) is a powerful and efficient tool for displaying dynamic data. However, many existing ESNs have limitations for properly modeling high-dimensional data. The most important limitation of these networks is the high memory consumption due to their reservoir structure, which has prevented the increase of reservoir units and the maximum use of special capabilities of this type of networks. One way to solve this problem is to use quaternion algebra. Because quaternions have four different dimensions, high-dimensional data are easily represented and, using Hamilton's multiplication, with fewer parameters than real numbers, make external relations between the multidimensional features easier. In addition to the memory problem in the ESN network, the linear output of the ESN network poses an indescribable limit to its processing capacity, as it cannot effectively utilize higher-order statistics of features provided by the nonlinear dynamics of reservoir neurons. In this research, a new structure based on ESN is presented, in which quaternion algebra is used to compress the network data with the simple split function, and the output linear combiner is replaced by a multidimensional bilinear filter. This filter will be used for nonlinear calculations of the output layer of the ESN. In addition, the two-dimensional principal component analysis (2dPCA) technique is used to reduce the number of data transferred to the bilinear filter. In this study, the coefficients and the weights of the quaternion nonlinear ESN (QNESN) are optimized using genetic algorithm (GA). In order to prove the effectiveness of the proposed model compared to the previous methods, experiments for speech emotion recognition (SER) have been performed on EMODB, SAVEE and IEMOCAP speech emotional datasets. Comparisons show that the proposed QNESN network performs better than the ESN and most currently SER systems.
Preprint
Full-text available
The echo state network (ESN) is a powerful and efficient tool for displaying dynamic data. However, many existing ESNs have limitations for properly modeling high-dimensional data. The most important limitation of these networks is the high memory consumption due to their reservoir structure, which has prevented the increase of reservoir units and the maximum use of special capabilities of this type of network. One way to solve this problem is to use quaternion algebra. Because quaternions have four different dimensions, high-dimensional data are easily represented and, using Hamilton multiplication, with fewer parameters than real numbers, make external relations between the multidimensional features easier. In addition to the memory problem in the ESN network, the linear output of the ESN network poses an indescribable limit to its processing capacity, as it cannot effectively utilize higher-order statistics of features provided by the nonlinear dynamics of reservoir neurons. In this research, a new structure based on ESN is presented, in which quaternion algebra is used to compress the network data with the simple split function, and the output linear combiner is replaced by a multidimensional bilinear filter. This filter will be used for nonlinear calculations of the output layer of the ESN. In addition, the two-dimensional principal component analysis technique is used to reduce the number of data transferred to the bilinear filter. In this study, the coefficients and the weights of the quaternion nonlinear ESN (QNESN) are optimized using the genetic algorithm. In order to prove the effectiveness of the proposed model compared to the previous methods, experiments for speech emotion recognition have been performed on EMODB, SAVEE, and IEMOCAP speech emotional datasets. Comparisons show that the proposed QNESN network performs better than the ESN and most currently SER systems.
Preprint
Full-text available
Affective computing is very important in the relationship between man and machine. In this paper, a system for speech emotion recognition (SER) based on speech signal is proposed, which uses new techniques in different stages of processing. The system consists of three stages: feature extraction, feature selection, and finally feature classification. In the first stage, a complex set of long-term statistics features is extracted from both the speech signal and the glottal-waveform signal using a combination of new and diverse features such as prosodic, spectral, and spectro-temporal features. One of the challenges of the SER systems is to distinguish correlated emotions. These features are good discriminators for speech emotions and increase the SER's ability to recognize similar and different emotions. This feature vector with a large number of dimensions naturally has redundancy. In the second stage, using classical feature selection techniques as well as a new quantum-inspired technique to reduce the feature vector dimensionality, the number of feature vector dimensions is reduced. In the third stage, the optimized feature vector is classified by a weighted deep sparse extreme learning machine (ELM) classifier. The classifier performs classification in three steps: sparse random feature learning, orthogonal random projection using the singular value decomposition (SVD) technique, and discriminative classification in the last step using the generalized Tikhonov regularization technique. Also, many existing emotional datasets suffer from the problem of data imbalanced distribution, which in turn increases the classification error and decreases system performance. In this paper, a new weighting method has also been proposed to deal with class imbalance, which is more efficient than existing weighting methods. The proposed method is evaluated on three standard emotional databases.
Conference Paper
Recognizing the patient's emotions using deep learning techniques has attracted significant attention recently due to technological advancements. Automatically identifying the emotions can help build smart healthcare centers that can detect depression and stress among the patients in order to start the medication early. Using advanced technology to identify emotions is one of the most exciting topics as it defines the relationships between humans and machines. Machines learned how to predict emotions by adopting various methods. In this survey, we present recent research in the field of using neural networks to recognize emotions. We focus on studying emotions' recognition from speech, facial expressions, and audio-visual input and show the different techniques of deploying these algorithms in the real world. These three emotion recognition techniques can be used as a surveillance system in healthcare centers to monitor patients. We conclude the survey with a presentation of the challenges and the related future work to provide an insight into the applications of using emotion recognition.
Article
In real-life communication, nonverbal vocalization such as laughter, cries or other emotion interjections, within an utterance play an important role for emotion expression. In previous studies, only few emotion recognition systems consider nonverbal vocalization, which naturally exists in our daily conversation. In this work, both verbal and nonverbal sounds within an utterance are considered for emotion recognition of real-life affective conversations. Firstly, a support vector machine (SVM)-based verbal and nonverbal sound detector is developed. A prosodic phrase auto-tagger is further employed to extract the verbal/nonverbal sound segments. For each segment, the emotion and sound feature embeddings are respectively extracted using the deep residual networks (ResNets). Finally, a sequence of the extracted feature embeddings for the entire dialog turn are fed to an attentive long short-term memory (LSTM)-based sequence-to-sequence model to output an emotional sequence as recognition result. The NNIME corpus (The NTHU-NTUA Chinese interactive multimodal emotion corpus), which consists of verbal and nonverbal sounds, was adopted for system training and testing. 4766 single speaker dialogue turns in the audio data of the NNIME corpus were selected for evaluation. The experimental results showed that nonverbal vocalization was helpful for speech emotion recognition. For comparison, the proposed method based on decision-level fusion achieved an accuracy of 61.92% for speech emotion recognition outperforming the traditional methods as well as the feature-level and model-level fusion approaches.
Article
A challenging issue in the field of the automatic recognition of emotion from speech is the efficient modelling of long temporal contexts. Moreover, when incorporating long-term temporal dependencies between features, recurrent neural network (RNN) architectures are typically employed by default. In this work, we aim to present an efficient deep neural network architecture incorporating Connectionist Temporal Classification (CTC) loss for discrete speech emotion recognition (SER). Moreover, we also demonstrate the existence of further opportunities to improve SER performance by exploiting the properties of convolutional neural networks (CNNs) when modelling contextual information. Our proposed model uses parallel convolutional layers (PCN) integrated with Squeeze-and-Excitation Network (SEnet), a system herein denoted as PCNSE, to extract relationships from 3D spectrograms across timesteps and frequencies; here, we use the log-Mel spectrogram with deltas and delta-deltas as input. In addition, a self-attention Residual Dilated Network (SADRN) with CTC is employed as a classification block for SER. To the best of the authors’ knowledge, this is the first time that such a hybrid architecture has been employed for discrete SER. We further demonstrate the effectiveness of our proposed approach on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) and FAU-Aibo Emotion corpus (FAU-AEC). Our experimental results reveal that the proposed method is well-suited to the task of discrete SER, achieving a weighted accuracy (WA) of 73.1% and an unweighted accuracy (UA) of 66.3% on IEMOCAP, as well as a UA of 41.1% on the FAU-AEC dataset.
Article
As an important branch of affective computing, Speech Emotion Recognition (SER) plays a vital role in human-computer interaction. In order to mine the relevance of signals in audios an increase the diversity of information, Bi-directional Long-Short Term Memory with Directional Self-Attention (BLSTM-DSA) is proposed in this paper. Long Short-Term Memory (LSTM) can learn long-term dependencies from learned local features. Moreover, Bi-directional Long-Short Term Memory (BLSTM) can make the structure more robust by direction mechanism because that the directional analysis can better recognize the hidden emotions in sentence. At the same time, autocorrelation of speech frames can be used to deal with the lack of information, so that Self-Attention mechanism is introduced into SER. The attention weight of each frame is calculated with the output of the forward and backward LSTM respectively rather than calculated after adding them together. Thus, the algorithm can automatically annotate the weights of speech frames to correctly select frames with emotional information in temporal network. When evaluate it on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database and Berlin database of emotional speech (EMO-DB), the BLSTM-DSA demonstrates satisfactory performance on the task of speech emotion recognition. Especially in emotion recognizing of happiness and anger, BLSTM-DSA achieves the highest recognition accuracies.
Article
With the development of both hardware and deep neural network technologies, tremendous improvements have been achieved in the performance of automatic emotion recognition (AER) based on the video data. However, AER is still a challenging task due to subtle expression, the abstract concept of emotion, and the representation of multi-modal information. Most proposed approaches focus on the multi-modal feature learning and fusion strategy, which pay more attention to the characteristic of a single video and ignore the correlation among the videos. To explore this correlation, in this paper, we propose a novel correlation-based graph convolutional network (C-GCN) for AER, which can comprehensively consider the correlation of the intra-class and inter-class videos for feature learning and information fusion. More specififically, we introduce the graph model to represent the correlation among the videos. This correlated information can help to improve the discrimination of node features in the progress of graph convolutional network. Meanwhile, the multi-head attention mechanism is applied to predict the hidden relationship among the videos, which can strengthen the inter-class correlation to improve the performance of classifiers. The C-GCN is evaluated on the AFEW datasets and eNTERFACE’ 05 dataset. The final experimental results demonstrate the superiority of our proposed method over the state-of-the-art methods.
Article
In recent years, studies on speech signals have increasingly paid attention to emotional information. The most challenging aspect in speech emotion recognition (SER) is choosing the optimal speech feature representation. According to the statistical analysis, the roles of each speech feature differ under different emotions, indicating that different features have different abilities in distinguishing emotions. This study proposes an emotional-category based feature weighting (ECFW) method, which aims at finding the prominence of each feature under different emotions and applying this prominence as priori knowledge. Furthermore, previous studies have paid little attention to matching the relationship between speech features and models. This study argues that different combinations of models and features result in large differences in the performance of SER, which are evaluated by several experiments. Features must be modeled with appropriate approaches to extract the most valuable information for emotional representation. Then, the best combinations of features and models are selected to test our method. The method is applied on three commonly used speech emotion databases, IEMOCAP, MASC, and EMO-DB. The results show that ECFW significantly improves the performance of SER tasks.
Conference Paper
Full-text available
Accurately recognizing speaker emotion and age/gender from speech can provide better user experience for many spoken dialogue systems. In this study, we propose to use deep neural networks (DNNs) to encode each utterance into a fixed-length vector by pooling the activations of the last hidden layer over time. The feature encoding process is designed to be jointly trained with the utterance-level classifier for better classification. A kernel extreme learning machine (ELM) is further trained on the encoded vectors for better utterance-level classification. Experiments on a Mandarin dataset demonstrate the effectiveness of our proposed methods on speech emotion and age/gender recognition tasks.
Conference Paper
Full-text available
Automatic emotion recognition from speech is a challenging task which relies heavily on the effectiveness of the speech features used for classification. In this work, we study the use of deep learning to automatically discover emotionally relevant features from speech. It is shown that using a deep recurrent neural network, we can learn both the short-time frame-level acoustic features that are emotionally relevant, as well as an appropriate temporal aggregation of those features into a compact utterance-level representation. Moreover, we propose a novel strategy for feature pooling over time which uses local attention in order to focus on specific regions of a speech signal that are more emotionally salient. The proposed solution is evaluated on the IEMOCAP corpus, and is shown to provide more accurate predictions compared to existing emotion recognition algorithms.
Conference Paper
Full-text available
This paper presents a speech emotion recognition system using a recurrent neural network (RNN) model trained by an efficient learning algorithm. The proposed system takes into account the long-range context effect and the uncertainty of emotional label expressions. To extract high-level representation of emotional states with regard to its temporal dynamics, a powerful learning method with a bidirectional long short-term memory (BLSTM) model is adopted. To overcome the uncertainty of emotional labels , such that all frames in the same utterance are mapped into the same emotional label, it is assumed that the label of each frame is regarded as a sequence of random variables. Then, the sequences are trained by the proposed learning algorithm. The weighted accuracy of the proposed emotion recognition system is improved up to 12% compared to the DNN-ELM based emotion recognition system used as a baseline.
Article
Full-text available
Recurrent sequence generators conditioned on input data through an attention mechanism have recently shown very good performance on a range of tasks in- cluding machine translation, handwriting synthesis and image caption gen- eration. We extend the attention-mechanism with features needed for speech recognition. We show that while an adaptation of the model used for machine translation in reaches a competitive 18.7% phoneme error rate (PER) on the TIMIT phoneme recognition task, it can only be applied to utterances which are roughly as long as the ones it was trained on. We offer a qualitative explanation of this failure and propose a novel and generic method of adding location-awareness to the attention mechanism to alleviate this issue. The new method yields a model that is robust to long inputs and achieves 18% PER in single utterances and 20% in 10-times longer (repeated) utterances. Finally, we propose a change to the at- tention mechanism that prevents it from concentrating too much on single frames, which further reduces PER to 17.6% level.
Conference Paper
Full-text available
Speech emotion recognition is a challenging problem partly because it is unclear what features are effective for the task. In this paper we propose to utilize deep neural networks (DNNs) to extract high level features from raw data and show that they are effective for speech emotion recognition. We first produce an emotion state probability distribution for each speech segment using DNNs. We then construct utterance-level features from segment-level probability distributions. These utterance level features are then fed into an extreme learning machine (ELM), a special simple and efficient single-hidden-layer neural network, to identify utterance-level emotions. The experimental results demonstrate that the proposed approach effectively learns emotional information from low-level features and leads to 20% relative accuracy improvement compared to the stateof-the-art approaches.
Conference Paper
Full-text available
We introduce the openSMILE feature extraction toolkit, which unites feature extraction algorithms from the speech processing and the Music Information Retrieval communities. Audio low-level descriptors such as CHROMA and CENS features, loudness, Mel-frequency cepstral coefficients, perceptual linear predictive cepstral coefficients, linear predictive coefficients, line spectral frequencies, fundamental frequency, and formant frequencies are supported. Delta regression and various statistical functionals can be applied to the low-level descriptors. openSMILE is implemented in C++ with no third-party dependencies for the core functionality. It is fast, runs on Unix and Windows platforms, and has a modular, component based architecture which makes extensions via plug-ins easy. It supports on-line incremental processing for all implemented features as well as off-line and batch processing. Numeric compatibility with future versions is ensured by means of unit tests. openSMILE can be downloaded from http://opensmile.sourceforge.net/.
Article
Full-text available
During expressive speech, the voice is enriched to convey not only the intended semantic message but also the emotional state of the speaker. The pitch contour is one of the important properties of speech that is affected by this emotional modulation. Although pitch features have been commonly used to recognize emotions, it is not clear what aspects of the pitch contour are the most emotionally salient. This paper presents an analysis of the statistics derived from the pitch contour. First, pitch features derived from emotional speech samples are compared with the ones derived from neutral speech, by using symmetric Kullback-Leibler distance. Then, the emotionally discriminative power of the pitch features is quantified by comparing nested logistic regression models. The results indicate that gross pitch contour statistics such as mean, maximum, minimum, and range are more emotionally prominent than features describing the pitch shape. Also, analyzing the pitch statistics at the utterance level is found to be more accurate and robust than analyzing the pitch statistics for shorter speech regions (e.g., voiced segments). Finally, the best features are selected to build a binary emotion detection system for distinguishing between emotional versus neutral speech. A new two-step approach is proposed. In the first step, reference models for the pitch features are trained with neutral speech, and the input features are contrasted with the neutral model. In the second step, a fitness measure is used to assess whether the input speech is similar to, in the case of neutral speech, or different from, in the case of emotional speech, the reference models. The proposed approach is tested with four acted emotional databases spanning different emotional categories, recording settings, speakers and languages. The results show that the recognition accuracy of the system is over 77% just with the pitch features (baseline 50%). When compared to conventional classification schemes, th- - e proposed approach performs better in terms of both accuracy and robustness.
Article
Full-text available
Automatic recognition of emotion is becoming an increasingly important component in the design process for affect-sensitive human-machine interaction (HMI) systems. Well-designed emotion recognition systems have the potential to augment HMI systems by providing additional user state details and by informing the design of emotionally relevant and emotionally targeted synthetic behavior. This paper describes an emotion classification paradigm, based on emotion profiles (EPs). This paradigm is an approach to interpret the emotional content of naturalistic human expression by providing multiple probabilistic class labels, rather than a single hard label. EPs provide an assessment of the emotion content of an utterance in terms of a set of simple categorical emotions: anger; happiness; neutrality; and sadness. This method can accurately capture the general emotional label (attaining an accuracy of 68.2% in our experiment on the IEMOCAP data) in addition to identifying underlying emotional properties of highly emotionally ambiguous utterances. This capability is beneficial when dealing with naturalistic human emotional expressions, which are often not well described by a single semantic label.
Conference Paper
Full-text available
Most paralinguistic analysis tasks are lacking agreed-upon evaluation procedures and comparability, in contrast to more 'traditional' disciplines in speech analysis. The INTERSPEECH 2010 Paralinguistic Challenge shall help overcome the usually low compatibility of results, by addressing three selected sub-challenges. In the Age Sub-Challenge, the age of speakers has to be determined in four groups. In the Gender Sub-Challenge, a three-class classification task has to be solved and finally, the Affect Sub-Challenge asks for speakers' interest in ordinal representation. This paper introduces the conditions, the Challenge corpora "aGender" and "TUM AVIC" and standard feature sets that may be used. Further, baseline results are given.
Article
Full-text available
Since emotions are expressed through a combination of verbal and non-verbal channels, a joint analysis of speech and gestures is required to understand expressive human communication. To facilitate such investigations, this paper describes a new corpus named the “interactive emotional dyadic motion capture database” (IEMOCAP), collected by the Speech Analysis and Interpretation Laboratory (SAIL) at the University of Southern California (USC). This database was recorded from ten actors in dyadic sessions with markers on the face, head, and hands, which provide detailed information about their facial expressions and hand movements during scripted and spontaneous spoken communication scenarios. The actors performed selected emotional scripts and also improvised hypothetical scenarios designed to elicit specific types of emotions (happiness, anger, sadness, frustration and neutral state). The corpus contains approximately 12 h of data. The detailed motion capture information, the interactive setting to elicit authentic emotions, and the size of the database make this corpus a valuable addition to the existing databases in the community for the study and modeling of multimodal and expressive human communication.
Article
Full-text available
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
Article
Speech Emotion Recognition (SER) can be regarded as a static or dynamic classification problem, which makes SER an excellent test bed for investigating and comparing various deep learning architectures. We describe a frame-based formulation to SER that relies on minimal speech processing and end-to-end deep learning to model intra-utterance dynamics. We use the proposed SER system to empirically explore feed-forward and recurrent neural network architectures and their variants. Experiments conducted illuminate the advantages and limitations of these architectures in paralinguistic speech recognition and emotion recognition in particular. As a result of our exploration, we report state-of-the-art results on the IEMOCAP database for speaker-independent SER and present quantitative and qualitative assessments of the model’s performance.
Article
LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Conference Paper
The Denoising autoencoder (DAE) has been successfully applied to acoustic emotion recognition lately. In this paper, we adopt the framework of the modified DAE introduced in that projects the input signal to two different hidden representations, for neutral and emotional speech respectively, and uses the emotional representation for the classification task. We propose to model gender information for more robust emotional representation in this work. For neutral representation, male and female dependent DAEs are built using non-emotional speech with the aim of capturing distinct information between the two genders. The emotional hidden representation is shared for the two genders in order to model more emotion specific characteristics, and is used as features in a back-end classifier for emotion recognition. We propose different optimization objectives in training the DAEs. Our experimental results show improvement on unweighted accuracy compared with previous work using the modified DAE method and the classifiers using the standard static features. Further performance gain can be achieved by structural level system combination.
Conference Paper
Emotion recognition is the process of identifying the affective characteristics of an utterance given either static or dynamic descriptions of its signal content. This requires the use of units, windows over which the emotion variation is quantified. However, the appropriate time scale for these units is still an open question. Traditionally, emotion recognition systems have relied upon units of fixed length, whose variation is then modeled over time. This paper takes the view that emotion is expressed over units of variable length. In this paper, variable-length units are introduced and used to capture the local dynamics of emotion at the sub-utterance scale. The results demonstrate that subsets of these local dynamics are salient with respect to emotion class. These salient units provide insight into the natural variation in emotional speech and can be used in a classification framework to achieve performance comparable to the state-of-the-art. This hints at the existence of building blocks that may underlie natural human emotional communication.
Article
Emotion recognition from speech has emerged as an important research area in the recent past. In this regard, review of existing work on emotional speech processing is useful for carrying out further research. In this paper, the recent literature on speech emotion recognition has been presented considering the issues related to emotional speech corpora, different types of speech features and models used for recognition of emotions from speech. Thirty two representative speech databases are reviewed in this work from point of view of their language, number of speakers, number of emotions, and purpose of collection. The issues related to emotional speech databases used in emotional speech recognition are also briefly discussed. Literature on different features used in the task of emotion recognition from speech is presented. The importance of choosing different classification models has been discussed along with the review. The important issues to be considered for further emotion recognition research in general and in specific to the Indian context have been highlighted where ever necessary.
Article
Human emotional expression tends to evolve in a structured manner in the sense that certain emotional evolution patterns, i.e., anger to anger, are more probable than others, e. g., anger to happiness. Furthermore, the perception of an emotional display can be affected by recent emotional displays. Therefore, the emotional content of past and future observations could offer relevant temporal context when classifying the emotional content of an observation. In this work, we focus on audio-visual recognition of the emotional content of improvised emotional interactions at the utterance level. We examine context-sensitive schemes for emotion recognition within a multimodal, hierarchical approach: bidirectional Long Short-Term Memory (BLSTM) neural networks, hierarchical Hidden Markov Model classifiers (HMMs), and hybrid HMM/BLSTM classifiers are considered for modeling emotion evolution within an utterance and between utterances over the course of a dialog. Overall, our experimental results indicate that incorporating long-term temporal context is beneficial for emotion recognition systems that encounter a variety of emotional manifestations. Context-sensitive approaches outperform those without context for classification tasks such as discrimination between valence levels or between clusters in the valence-activation space. The analysis of emotional transitions in our database sheds light into the flow of affective expressions, revealing potentially useful patterns.
Article
Recently, increasing attention has been directed to the study of the emotional content of speech signals, and hence, many systems have been proposed to identify the emotional content of a spoken utterance. This paper is a survey of speech emotion classification addressing three important aspects of the design of a speech emotion recognition system. The first one is the choice of suitable features for speech representation. The second issue is the design of an appropriate classification scheme and the third issue is the proper preparation of an emotional speech database for evaluating system performance. Conclusions about the performance and limitations of current speech emotion recognition systems are discussed in the last section of this survey. This section also suggests possible ways of improving speech emotion recognition systems.
Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks
  • Z.-Q Wang
  • I Tashev
Wang, Z.-Q. and Tashev, I., (in press), " Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks, " ICASSP, 2017.
Automatic Speech Emotion Recognition Using Recurrent Neural Networks With Local Attention
  • S Mirsamadi
  • E Barsoum
  • C Zhang
Mirsamadi, S., Barsoum, E. and Zhang, C., (in press), " Automatic Speech Emotion Recognition Using Recurrent Neural Networks With Local Attention, " ICASSP, 2017.
  • B Schuller
  • S Steidl
  • A Batliner
  • F Burkhardt
  • L Devillers
  • C Müller
  • S Narayanan
Schuller, B., Steidl, S., Batliner, A., Burkhardt, F., Devillers, L., Müller, C., Narayanan, S., "The INTERSPEECH 2010 Paralinguistic Challenge," INTERSPEECH, pp. 2794-2797, 2010
Attention-based models for speech recognition
  • J Chorowski
  • K Bahdanau
  • D Serdyuk
  • D Cho
  • K Bengio
Chorowski, J., K., Bahdanau, D., Serdyuk, D., Cho, K. and Bengio, Y., "Attention-based models for speech recognition," Advances in Neural Information Processing Systems, pp. 577-585, 2015.
Representation Learning for Speech Emotion Recognition
  • S Ghosh
  • E Laksana
  • L.-P Morency
  • S Scherer
Ghosh, S., Laksana, E., Morency, L.-P. and Scherer, S., "Representation Learning for Speech Emotion Recognition," INTERSPEECH, pp. 3603-3607, 2016.
Hierarchical modeling of temporal course in emotional expression for speech emotion recognition
  • C Wu
  • H Liang
  • B Cheng
  • C Lin
Wu, C., H., Liang, W., B., Cheng, K., C. and Lin, J., C., "Hierarchical modeling of temporal course in emotional expression for speech emotion recognition," ACII, pp. 810-814, 2015.
  • J Bergstra
  • F Bastien
  • O Breuleux
  • P Lamblin
  • R Pascanu
  • O Delalleau
  • G Desjardins
  • D Warde-Farley
  • I Goodfellow
  • A Bergeron
Bergstra, J., Bastien, F., Breuleux, O., Lamblin, P., Pascanu, R., Delalleau, O., Desjardins, G., Warde-Farley, D., Goodfellow, I., Bergeron, A., et al., "Theano: Deep learning on gpus with python," NIPS, BigLearning Workshop, Granada, Spain, 2011.
On fast dropout and its applicability to recurrent networks
  • J Bayer
  • C Osendorfer
  • D Korhammer
  • N Chen
  • S Urban
  • P Van Der Smagt
Bayer, J., Osendorfer, C., Korhammer, D., Chen, N., Urban, S. and van der Smagt, P., "On fast dropout and its applicability to recurrent networks," arXiv preprint arXiv:1311.0701, 2013.
Incorporating Nesterov momentum into Adam
  • dozat
Theano: Deep learning on gpus with python
  • bergstra