Conference Paper

Speech Representation Learning for Emotion Recognition Using End-to-End ASR with Factorized Adaptation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Most current studies propose training deep learning models to extract those feature sets from the data [8,9]. Although these approaches have yielded satisfactory results, two problems remain. ...
... Following previous works [3, 8,9], we consider only the utterances which are given the same label by at least two annotators, and we merge the utterances labelled as "Happy" and "Excited" into the "Happy" category. We further select only the utterances with the labels "Angry", "Neutral", "Sad" and "Happy", resulting in 5,531 utterances, which is approximately 7 hours of data. ...
... Most current studies propose training deep learning models to extract those feature sets from the data [8,9]. Although these approaches have yielded satisfactory results, two problems remain. ...
... Following previous works [3, 8,9], we consider only the utterances which are given the same label by at least two annotators, and we merge the utterances labelled as "Happy" and "Excited" into the "Happy" category. We further select only the utterances with the labels "Angry", "Neutral", "Sad" and "Happy", resulting in 5,531 utterances, which is approximately 7 hours of data. ...
Preprint
Full-text available
Automatic emotion recognition is one of the central concerns of the Human-Computer Interaction field as it can bridge the gap between humans and machines. Current works train deep learning models on low-level data representations to solve the emotion recognition task. Since emotion datasets often have a limited amount of data, these approaches may suffer from overfitting, and they may learn based on superficial cues. To address these issues, we propose a novel cross-representation speech model, inspired by disentanglement representation learning, to perform emotion recognition on wav2vec 2.0 speech features. We also train a CNN-based model to recognize emotions from text features extracted with Transformer-based models. We further combine the speech-based and text-based results with a score fusion approach. Our method is evaluated on the IEMOCAP dataset in a 4-class classification problem, and it surpasses current works on speech-only, text-only, and multimodal emotion recognition.
... The use of input pre-trained features, i.e., embeddings extracted with neural models trained for speech processing tasks different from SER, are currently extensively used. The advantage of such an approach is to benefit from a large amount of data from a different task such as automatic speech recognition (ASR) [17] or speaker recognition [18]. ...
Preprint
Full-text available
Speech data carries a range of personal information, such as the speaker's identity and emotional state. These attributes can be used for malicious purposes. With the development of virtual assistants, a new generation of privacy threats has emerged. Current studies have addressed the topic of preserving speech privacy. One of them, the VoicePrivacy initiative aims to promote the development of privacy preservation tools for speech technology. The task selected for the VoicePrivacy 2020 Challenge (VPC) is about speaker anonymization. The goal is to hide the source speaker's identity while preserving the linguistic information. The baseline of the VPC makes use of a voice conversion. This paper studies the impact of the speaker anonymization baseline system of the VPC on emotional information present in speech utterances. Evaluation is performed following the VPC rules regarding the attackers' knowledge about the anonymization system. Our results show that the VPC baseline system does not suppress speakers' emotions against informed attackers. When comparing anonymized speech to original speech, the emotion recognition performance is degraded by 15\% relative to IEMOCAP data, similar to the degradation observed for automatic speech recognition used to evaluate the preservation of the linguistic information.
... However, despite significant progress, SER also still faces challenges, such as dealing with individual differences in emotional expression and identifying emotions in noisy environments. Several studies [11][12][13] have explored the integration of ASR and SER to improve the performance of both tasks. One approach [14] is to use the latent features extracted from a pre-trained ASR model to perform SER. ...
... This method is an end-to-end deep neural network model for speech emotion recognition based on wav2vec [22]. In 2020, Yeh used end-to-end ASR to extract ASR-based representations for speech emotion recognition, and designed a decomposition domain adaptation method on a pre-trained ASR model to improve speech recognition rate and recognition accuracy of the target emotion corpus [23]. In 2020, Bakhshi used raw speech time-domain signals and frequency-domain information as inputs to the deep Conv-RNN network, effectively extracting emotional representations of speech signals and achieving end-to-end dimensional emotion recognition [24]. ...
Article
Full-text available
Speech emotion recognition (SER) technology is significant for human–computer interaction, and this paper studies the features and modeling of SER. Mel-spectrogram is introduced and utilized as the feature of speech, and the theory and extraction process of mel-spectrogram are presented in detail. A deep residual shrinkage network with bi-directional gated recurrent unit (DRSN-BiGRU) is proposed in this paper, which is composed of convolution network, residual shrinkage network, bi-directional recurrent unit, and fully-connected network. Through the self-attention mechanism, DRSN-BiGRU can automatically ignore noisy information and improve the ability to learn effective features. Network optimization, verification experiment is carried out in three emotional datasets (CASIA, IEMOCAP, and MELD), and the accuracy of DRSN-BiGRU are 86.03%, 86.07%, and 70.57%, respectively. The results are also analyzed and compared with DCNN-LSTM, CNN-BiLSTM, and DRN-BiGRU, which verified the superior performance of DRSN-BiGRU.
... For instance, the previous review on SER (Akçay and Oguz, 2020) highlighted accuracies of 54% (2014) and 63.5% (2017). In a recent study, Yeh et al. (2020) achieved 66% of accuracy using listen, attend, and spell (LAS) model for multitasking ASR and SER. In contrast, the fusion of acoustic and linguistic information performed by Lian et al. (2020) topped 82% of weighted accuracy (WA). ...
Article
Full-text available
Speech emotion recognition (SER) is traditionally performed using merely acoustic information. Acoustic features, commonly are extracted per frame, are mapped into emotion labels using classifiers such as support vector machines for machine learning or multi-layer perceptron for deep learning. Previous research has shown that acoustic-only SER suffers from many issues, mostly on low performances. On the other hand, not only acoustic information can be extracted from speech but also linguistic information. The linguistic features can be extracted from the transcribed text by an automatic speech recognition system. The fusion of acoustic and linguistic information could improve the SER performance. This paper presents a survey of the works on bimodal emotion recognition fusing acoustic and linguistic information. Five components of bimodal SER are reviewed: emotion models, datasets, features, classifiers, and fusion methods. Some major findings, including state-of-the-art results and their methods from the commonly used datasets, are also presented to give insights for the current research and to surpass these results. Finally, this survey proposes the remaining issues in the bimodal SER research for future research directions.
... The use of input pre-trained features, i.e., embeddings extracted with neural models trained for speech processing tasks different from SER, are currently extensively used. The advantage of such an approach is to benefit from a large amount of data from a different task such as automatic speech recognition (ASR) [17] or speaker recognition [18]. ...
Conference Paper
Full-text available
During the last decade, the applications of signal processing have drastically improved with deep learning. However areas of affecting computing such as emotional speech synthesis or emotion recognition from spoken language remains challenging. In this paper, we investigate the use of a neural Automatic Speech Recognition (ASR) as a feature extractor for emotion recognition. We show that these features outperform the eGeMAPS feature set to predict the valence and arousal emotional dimensions, which means that the audio-to-text mapping learned by the ASR system contains information related to the emotional dimensions in spontaneous speech. We also examine the relationship between first layers (closer to speech) and last layers (closer to text) of the ASR and valence/arousal.
Conference Paper
Full-text available
In this paper we present our findings on how representation learning on large unlabeled speech corpora can be beneficially utilized for speech emotion recognition (SER). Prior work on representation learning for SER mostly focused on the relatively small emotional speech datasets without making use of additional unlabeled speech data. We show that integrating representations learnt by an unsupervised autoencoder into a CNN-based emotion classifier improves the recognition accuracy. To gain insights about what those models learn, we analyze visualizations of the different representations using t-distributed neighbor embeddings (t-SNE). We evaluate our approach on IEMOCAP and MSP-IMPROV by means of within- and cross-corpus testing.
Conference Paper
Full-text available
Deep Neural Networks trained on large datasets can be easily transferred to new domains with far fewer labeled examples by a process called fine-tuning. This has the advantage that representations learned in the large source domain can be exploited on smaller target domains. However, networks designed to be optimal for the source task are often prohibitively large for the target task. In this work we address the compression of networks after domain transfer. We focus on compression algorithms based on low-rank matrix decomposition. Existing methods base compression solely on learned network weights and ignore the statistics of network activations. We show that domain transfer leads to large shifts in network activations and that it is desirable to take this into account when compressing. We demonstrate that considering activation statistics when compressing weights leads to a rank-constrained regression problem with a closed-form solution. Because our method takes into account the target domain, it can more optimally remove the redundancy in the weights. Experiments show that our Domain Adaptive Low Rank (DALR) method significantly outperforms existing low-rank compression techniques. With our approach, the fc6 layer of VGG19 can be compressed more than 4x more than using truncated SVD alone – with only a minor or no loss in accuracy. When applied to domain-transferred networks it allows for compression down to only 5-20% of the original number of parameters with only a minor drop in performance.
Conference Paper
Full-text available
Automatic emotion recognition from speech is a challenging task which relies heavily on the effectiveness of the speech features used for classification. In this work, we study the use of deep learning to automatically discover emotionally relevant features from speech. It is shown that using a deep recurrent neural network, we can learn both the short-time frame-level acoustic features that are emotionally relevant, as well as an appropriate temporal aggregation of those features into a compact utterance-level representation. Moreover, we propose a novel strategy for feature pooling over time which uses local attention in order to focus on specific regions of a speech signal that are more emotionally salient. The proposed solution is evaluated on the IEMOCAP corpus, and is shown to provide more accurate predictions compared to existing emotion recognition algorithms.
Article
Full-text available
Work on voice sciences over recent decades has led to a proliferation of acoustic parameters that are used quite selectively and are not always extracted in a similar fashion. With many independent teams working in different research areas, shared standards become an essential safeguard to ensure compliance with state-of-the-art methods allowing appropriate comparison of results across studies and potential integration and combination of extraction and recognition systems. In this paper we propose a basic standard acoustic parameter set for various areas of automatic voice analysis, such as paralinguistic or clinical speech analysis. In contrast to a large brute-force parameter set, we present a minimalistic set of voice parameters here. These were selected based on a) their potential to index affective physiological changes in voice production, b) their proven value in former studies as well as their automatic extractability, and c) their theoretical significance. The set is intended to provide a common baseline for evaluation of future research and eliminate differences caused by varying parameter sets or even different implementations of the same parameters. Our implementation is publicly available with the openSMILE toolkit. Comparative evaluations of the proposed feature set and large baseline feature sets of INTERSPEECH challenges show a high performance of the proposed set in relation to its size.
Conference Paper
Full-text available
The INTERSPEECH 2013 Computational Paralinguistics Challenge provides for the first time a unified test-bed for Social Signals such as laughter in speech. It further introduces conflict in group discussions as a new task and deals with autism and its manifestations in speech. Finally, emotion is revisited as task, albeit with a broader range of overall twelve enacted emotional states. In this paper, we describe these four Sub-Challenges, their conditions, baselines, and a new feature set by the openSMILE toolkit, provided to the participants.
Conference Paper
Full-text available
Most paralinguistic analysis tasks are lacking agreed-upon evaluation procedures and comparability, in contrast to more 'traditional' disciplines in speech analysis. The INTERSPEECH 2010 Paralinguistic Challenge shall help overcome the usually low compatibility of results, by addressing three selected sub-challenges. In the Age Sub-Challenge, the age of speakers has to be determined in four groups. In the Gender Sub-Challenge, a three-class classification task has to be solved and finally, the Affect Sub-Challenge asks for speakers' interest in ordinal representation. This paper introduces the conditions, the Challenge corpora "aGender" and "TUM AVIC" and standard feature sets that may be used. Further, baseline results are given.
Article
Full-text available
Since emotions are expressed through a combination of verbal and non-verbal channels, a joint analysis of speech and gestures is required to understand expressive human communication. To facilitate such investigations, this paper describes a new corpus named the “interactive emotional dyadic motion capture database” (IEMOCAP), collected by the Speech Analysis and Interpretation Laboratory (SAIL) at the University of Southern California (USC). This database was recorded from ten actors in dyadic sessions with markers on the face, head, and hands, which provide detailed information about their facial expressions and hand movements during scripted and spontaneous spoken communication scenarios. The actors performed selected emotional scripts and also improvised hypothetical scenarios designed to elicit specific types of emotions (happiness, anger, sadness, frustration and neutral state). The corpus contains approximately 12 h of data. The detailed motion capture information, the interactive setting to elicit authentic emotions, and the size of the database make this corpus a valuable addition to the existing databases in the community for the study and modeling of multimodal and expressive human communication.
Article
Neural models have become ubiquitous in automatic speech recognition systems. While neural networks are typically used as acoustic models in more complex systems, recent studies have explored end-to-end speech recognition systems based on neural networks, which can be trained to directly predict text from input acoustic features. Although such systems are conceptually elegant and simpler than traditional systems, it is less obvious how to interpret the trained models. In this work, we analyze the speech representations learned by a deep end-to-end model that is based on convolutional and recurrent layers, and trained with a connectionist temporal classification (CTC) loss. We use a pre-trained model to generate frame-level features which are given to a classifier that is trained on frame classification into phones. We evaluate representations from different layers of the deep model and compare their quality for predicting phone labels. Our experiments shed light on important aspects of the end-to-end model such as layer depth, model complexity, and other design choices.
Article
Formally, the problem that we present is that of identifying the hidden attributes of the system that modulates the body's signals, uncovered through novel signal processing and machine learning on large-scale multimodal data (Figure 1). Signal processing is the keystone that supports this mapping from data to representations of behaviors and mental states. The pipeline first begins with raw signals, such as from visual, auditory, and physiological sensors. Then, we need to localize information coming from corresponding behavioral channels, such as the face, body, and voice. Next, the signals are denoised and modeled to extract meaningful information like the words that are said and patterns of how they are spoken. The coordination of channels can also be assessed via time-series modeling techniques. Moreover, since an individual's behavior is not isolated, but influenced by a communicative partners' actions and the environment (e.g., interview versus casual discussion, home versus clinic), temporal modeling must account for these contextual effects. Finally, having achieved a representation of behavior derived from the signals, machine learning is used to make inferences on mental states to support human or autonomous decision making.
Article
Recently proposed deep neural network (DNN) obtains significant accuracy improvements in many large vocabulary continuous speech recognition (LVCSR) tasks. However, DNN requires much more parameters than traditional systems, which brings huge cost during online evaluation, and also limits the application of DNN in a lot of scenarios. In this paper we present our new effort on DNN aiming at reducing the model size while keeping the accuracy improvements. We apply singular value decomposition (SVD) on the weight matrices in DNN, and then restructure the model based on the inherent sparseness of the original matrices. After restructuring we can reduce the DNN model size significantly with negligible accuracy loss. We also fine-tune the restructured model using the regular back-propagation method to get the accuracy back when reducing the DNN model size heavily. The proposed method has been evaluated on two LVCSR tasks, with context-dependent DNN hidden Markov model (CD-DNN-HMM). Experimental results show that the proposed approach dramatically reduces the DNN model size by more than 80% without losing any accuracy.
Conference Paper
In this paper, we present improvements made to the TED-LIUM corpus we released in 2012. These enhancements fall into two categories. First, we describe how we filtered publicly available monolingual data and used it to estimate well-suited language models (LMs), using open-source tools. Then, we describe the process of selection we applied to new acoustic data from TED talks, providing additions to our previously released corpus. Finally, we report some experiments we made around these improvements.
Conference Paper
The large number of parameters in deep neural networks (DNN) for automatic speech recognition (ASR) makes speaker adaptation very challenging. It also limits the use of speaker personalization due to the huge storage cost in large-scale deployments. In this paper we address DNN adaptation and personalization issues by presenting two methods based on the singular value decomposition (SVD). The first method uses an SVD to replace the weight matrix of a speaker independent DNN by the product of two low rank matrices. Adaptation is then performed by updating a square matrix inserted between the two low-rank matrices. In the second method, we adapt the full weight matrix but only store the delta matrix - the difference between the original and adapted weight matrices. We decrease the footprint of the adapted model by storing a reduced rank version of the delta matrix via an SVD. The proposed methods were evaluated on short message dictation task. Experimental results show that we can obtain similar accuracy improvements as the previously proposed Kullback-Leibler divergence (KLD) regularized method with far fewer parameters, which only requires 0.89% of the original model storage.
Unsupervised learning approach to feature analysis for automatic speech emotion recognition
  • S E Eskimez
  • Z Duan
  • W Heinzelman
S. E. Eskimez, Z. Duan, and W. Heinzelman, "Unsupervised learning approach to feature analysis for automatic speech emotion recognition," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5099-5103.
Reusing neural speech representations for auditory emotion recognition
  • E Lakomkin
  • C Weber
  • S Magg
  • S Wermter
E. Lakomkin, C. Weber, S. Magg, and S. Wermter, "Reusing neural speech representations for auditory emotion recognition," arXiv preprint arXiv:1803.11508, 2018.
Wavenet: A generative model for raw audio
  • A V Oord
  • S Dieleman
  • H Zen
  • K Simonyan
  • O Vinyals
  • A Graves
  • N Kalchbrenner
  • A Senior
  • K Kavukcuoglu
A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, "Wavenet: A generative model for raw audio," arXiv preprint arXiv:1609.03499, 2016.
Superseded-cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit
  • C Veaux
  • J Yamagishi
  • K Macdonald
C. Veaux, J. Yamagishi, K. MacDonald et al., "Superseded-cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit," 2016.
Deep speech: Scaling up end-to-end speech recognition
  • A Hannun
  • C Case
  • J Casper
  • B Catanzaro
  • G Diamos
  • E Elsen
  • R Prenger
  • S Satheesh
  • S Sengupta
  • A Coates
A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., "Deep speech: Scaling up end-to-end speech recognition," arXiv preprint arXiv:1412.5567, 2014.
Neural machine translation by jointly learning to align and translate
  • D Bahdanau
  • K Cho
  • Y Bengio
D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:1409.0473, 2014.
Attention-based sequence-to-sequence model for speech recognition: development of state-of-the-art system on librispeech and its application to non-native english
  • Y Yin
  • R Prieto
  • B Wang
  • J Zhou
  • Y Gu
  • Y Liu
  • H Lin
Y. Yin, R. Prieto, B. Wang, J. Zhou, Y. Gu, Y. Liu, and H. Lin, "Attention-based sequence-to-sequence model for speech recognition: development of state-of-the-art system on librispeech and its application to non-native english," arXiv preprint arXiv:1810.13088, 2018.
Domain adaptation using factorized hidden layer for robust automatic speech recognition
  • K C Sim
  • A Narayanan
  • A Misra
  • A Tripathi
  • G Pundak
  • T N Sainath
  • P Haghani
  • B Li
  • M Bacchiani
K. C. Sim, A. Narayanan, A. Misra, A. Tripathi, G. Pundak, T. N. Sainath, P. Haghani, B. Li, and M. Bacchiani, "Domain adaptation using factorized hidden layer for robust automatic speech recognition." in Interspeech, 2018, pp. 892-896.