Conference Paper

Mingling or Misalignment? Temporal Shift for Speech Emotion Recognition with Pre-Trained Representations

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Fueled by recent advances of self-supervised models, pre-trained speech representations proved effective for the downstream speech emotion recognition (SER) task. Most prior works mainly focus on exploiting pre-trained representations and just adopt a linear head on top of the pre-trained model, neglecting the design of the downstream network. In this paper, we propose a temporal shift module to mingle channel-wise information without introducing any parameter or FLOP. With the temporal shift module, three designed baseline building blocks evolve into corresponding shift variants, i.e. ShiftCNN, ShiftLSTM, and Shiftformer. Moreover, to balance the trade-off between mingling and misalignment, we propose two technical strategies, placement of shift and proportion of shift. The family of temporal shift models all outperforms the state-of-the-art methods on the benchmark IEMOCAP dataset under both finetun-ing and feature extraction settings. Our code is available at https://github.com/ECNU-Cross-Innovation-Lab/ShiftSER. [cite]S. Shen, F. Liu and A. Zhou, "Mingling or Misalignment? Temporal Shift for Speech Emotion Recognition with Pre-Trained Representations," ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1-5, doi: 10.1109/ICASSP49357.2023.10095193.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Prior studies on SER were devoted to exploring the emotionally discriminative acoustic features in a hand-crafted way [2][3] . In recent years, deep learning algorithms [4][5][6][7] have become the mainstream in SER, owing to the capacity of automatic feature abstraction from raw speech spectrogram. Despite the advancements, SER models still suffer from severe performance degradation when tested on unseen data. ...
Article
Full-text available
One persistent challenge in deep learning based speech emotion recognition (SER) is the unconscious encoding of emotion-irrelevant factors (e.g., speaker or phonetic variability), which limits the generalization of SER in practical use. In this paper, we propose DSNet, a disentangled Siamese network with neutral calibration, to meet the demand for a more robust and explainable SER model. Specifically, we introduce an orthogonal feature disentanglement module to explicitly project the high-level representation into two distinct subspaces. Later, we propose a novel neutral calibration mechanism to encourage one subspace to capture sufficient emotion-irrelevant information. In this way, the other one can better isolate and emphasize the emotion-relevant information within speech signals. Experimental results on two popular benchmark datasets demonstrate the superiority of DSNet over various state-of-the-art methods for speaker-independent SER.
... This facilitates the capturing of long-distance dependencies within audio signals, thereby improving the extraction of both local and high-level features, Table 1 The variation in computational efficiency and performance of TFC-SpeechFormer and Transformer using HCE across different datasets.The symbols "(+)" indicate improvement, while "(-)" indicates a decrease. [31] 0.647 0.655 -CA-MSER [32] 0.698 0.711 -SpeechFormer [24] 0.629 0.645 -ISNet [33] 0.704 0.65 -DST [22] 0.718 0.736 -ShiftSER [34] 0.721 0.727 -TFC-SpeechFormer(Ours) 0.746 0.751 0.743 ...
Preprint
Full-text available
Speech emotion recognition aims to automatically identify emotions in human speech, enabling machines to understand and engage emotional communication. In recent years, Transformers have demonstrated strong adaptability and significant effectiveness in speech recognition. However, Transformer models are proficient at capturing local features, struggle to extract fine-grained details, and often lead to computational redundancy, increasing time complexity. This paper proposes a Speech Emotion Recognition model named Temporal Fusion Convolution SpeechFormer (TFCS).The model comprises a Hybrid Convolutional Extractor (HCE) and multiple Temporal Fusion Convolution SpeechFormer Blocks (TFCSBs). HCE, consisting of an encoder and convolutional modules, enhances speech signals to capture local features and texture information, extracting frame-level features. TFCSBs utilize a newly proposed Temporal Fusion Convolution Module and Speech Multi-Head Attention Module to capture correlations between adjacent elements in speech sequences. TFCSBs utilize feature information captured by HCE to sequentially form frame, phoneme, word, and sentence structures, and integrates them to establish global and local relationships, enhancing local data capture and computational efficiency. Performance evaluation on the IEMOCAP and DAIC-WOZ datasets demonstrates the effectiveness of HCE in extracting fine-grained local speech features, with TFCS outperforming Transformer and other advanced models overall.
... Prior studies on SER were devoted to exploring the emotionally discriminative acoustic features in a hand-crafted way [2,3]. In recent years, deep learning algorithms [4][5][6][7] have become the mainstream in SER, owning to the capacity of automatic feature abstraction from raw speech spectrogram. Despite the advancements, SER models still suffer from severe performance degradation when tested on unseen data. ...
... In [65], natural language processing is used to analyze posts from Internet to get group health state. And for the detection of loneliness, deep learning is applied to detect and analyze dynamic facial expression emotion data [66], speech emotion data [67,68], multi-modal [69,70] emotion data even computational affection based prediction of personality [71], and make early warnings to those who has suicidal tendencies [64,65]. ...
Article
Full-text available
The COVID-19 pandemic has had a profound impact on public mental health, leading to a surge in loneliness, depression, and anxiety. And these public psychological issues increasingly become a factor affecting social order. As researchers explore ways to address these issues, artificial intelligence (AI) has emerged as a powerful tool for understanding and supporting mental health. In this paper, we provide a thorough literature review on the emotions(EMO) of loneliness, depression, and anxiety (EMO-LDA) before and during the COVID-19 pandemic. Additionally, we evaluate the application of AI in EMO-LDA research from 2018 to 2023(AI-LDA) using Latent Dirichlet Allocation (LDA) topic modeling. Our analysis reveals a significant increase in the proportion of literature on EMO-LDA and AI-LDA before and during the COVID-19 pandemic. We also observe changes in research hotspots and trends in both field. Moreover, our results suggest that the collaborative research of EMO-LDA and AI-LDA is a promising direction for future research. In conclusion, our review highlights the urgent need for effective interventions to address the mental health challenges posed by the COVID-19 pandemic. Our findings suggest that the integration of AI in EMO-LDA research has the potential to provide new insights and solutions to support individuals facing loneliness, depression, and anxiety. And we hope that our study will inspire further research in this vital and revelant domin.
... We are also interested in the performance of utilizing welldesigned downstream models that have been optimized specifically for the speech emotion recognition task. In this subsection, we replace the simple classifier used in the SUPERB benchmark with two state-of-the-art approaches, namely, Shiftformer [42] and SpeechFormer [65], to further evaluate the effectiveness of Vesper. The experimental results are shown in Table IV. ...
Preprint
This paper presents a paradigm that adapts general large-scale pretrained models (PTMs) to speech emotion recognition task. Although PTMs shed new light on artificial general intelligence, they are constructed with general tasks in mind, and thus, their efficacy for specific tasks can be further improved. Additionally, employing PTMs in practical applications can be challenging due to their considerable size. Above limitations spawn another research direction, namely, optimizing large-scale PTMs for specific tasks to generate task-specific PTMs that are both compact and effective. In this paper, we focus on the speech emotion recognition task and propose an improved emotion-specific pretrained encoder called Vesper. Vesper is pretrained on a speech dataset based on WavLM and takes into account emotional characteristics. To enhance sensitivity to emotional information, Vesper employs an emotion-guided masking strategy to identify the regions that need masking. Subsequently, Vesper employs hierarchical and cross-layer self-supervision to improve its ability to capture acoustic and semantic representations, both of which are crucial for emotion recognition. Experimental results on the IEMOCAP, MELD, and CREMA-D datasets demonstrate that Vesper with 4 layers outperforms WavLM Base with 12 layers, and the performance of Vesper with 12 layers surpasses that of WavLM Large with 24 layers.
... Future studies should consider relevant factors to gain more accurate results. In addition, this research could further be optimized by the relevant deep learning techniques used in this study, such as updating better DFER [89] and speech emotion recognition techniques [90] , to increase the accuracy of recognition from a multi-modal perspective. ...
Article
Full-text available
How to cultivate innovative talents has become an important educational issue nowadays. In China's long-term mentorship education environment, supervisor-student relationship often affects students' creativity. From the perspective of students' psychology, we explore the influence mechanism of supervisor-student relationship on creativity by machine learning and questionnaire survey. In Study 1, based on video interviews with 16 postgraduate students, we use the machine learning method to analyze the emotional states exhibited by the postgraduate students in the videos when associating them with the supervisor-student interaction scenario, finding that students have negative emotions in bad supervisor-student relationship. Subsequently, we further explore the impact of supervisor-student relationship on postgraduate students' development in supervisor-student interaction scenarios at the affective level. In Study 2, a questionnaire survey is conducted to explore the relationship between relevant variables, finding that a good supervisor-student relationship can significantly reduce power stereotype threat, decrease emotional labor surface behaviors, and promote creativity expression. The above results theoretically reveal the internal psychological processes by which supervisor-student relationship affects creativity, and have important implications for reducing emotional labor and enhancing creativity expression of postgraduate students. ------------------------------- Cite: J. Hu, Y. Liu, Q. Zheng and F. Liu*, "Emotional Mechanisms in Supervisor-Student Relationship: Evidence from Machine Learning and Investigation," in Journal of Social Computing, vol. 4, no. 1, pp. 30-45, March 2023, doi: 10.23919/JSC.2023.0005.
... Understanding the emotions of others through their facial expressions is critical during conversations. Thus, automated recognition of facial expressions is a significant challenge in various fields, such as human-computer interaction (HCI) [25,34], mental health diagnosis [12], driver fatigue monitoring [24], and metahuman [6]. While significant progress has been made in Static Facial Expression Recognition (SFER) [23,43,44,55], there is increasing attention on Dynamic Facial Expression Recognition. ...
Conference Paper
Full-text available
Dynamic Facial Expression Recognition (DFER) is a rapidly developing field that focuses on recognizing facial expressions in video format. Previous research has considered non-target frames as noisy frames, but we propose that it should be treated as a weakly supervised problem. We also identify the imbalance of short-and long-term temporal relationships in DFER. Therefore, we introduce the Multi-3D Dynamic Facial Expression Learning (M3DFEL) framework, which utilizes Multi-Instance Learning (MIL) to handle inexact labels. M3DFEL generates 3D-instances to model the strong short-term temporal relationship and utilizes 3DCNNs for feature extraction. The Dynamic Long-term Instance Aggregation Module (DLIAM) is then utilized to learn the long-term temporal relationships and dynamically aggregate the instances. Our experiments on DFEW and FERV39K datasets show that M3DFEL outperforms existing state-of-the-art approaches with a vanilla R3D18 backbone. The source code is available at https://github.com/faceeyes/M3DFEL. [cite]Hanyang Wang, Bo Li, Shuang Wu, Siyuan Shen, Feng Liu, Shouhong Ding, Aimin Zhou.Rethinking the Learning Paradigm for Dynamic Facial Expression Recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 17958-17968.
Conference Paper
Transformers have been used successfully in a variety of settings, including Speech Emotion Recognition (SER). However, use of the latest transformer base models in domain generalization (DG) settings has mostly been unexplored or only weakly explored. We present here our state-of-the-art results in discrete emotion recognition across a variety of datasets, including acted and non-acted datasets, showing that Whisper is a powerful base Transformer model for this task. We show that our approach to DG with Whisper results in accuracy surpassing all previously published results, with an Unweighted Average Recall (UAR) of 74.5% averaged across the 6 distinct datasets used. We discuss some of the possible reasons behind Whisper’s superior performance to other Transformer models, though all 3 Transformer models evaluated here (HuBERT, WavLM, Whisper) show an ability to generalize as well as learn paralinguistic information successfully through fine-tuning with relatively few examples.
Article
Full-text available
Recent advances in self-supervised models have led to effective pretrained speech representations in downstream speech emotion recognition tasks. However, previous research has primarily focused on exploiting pretrained representations by simply adding a linear head on top of the pretrained model, while overlooking the design of the downstream network. In this paper, we propose a temporal shift module with pretrained representations to integrate channel-wise information without introducing additional parameters or floating-point operations per second. By incorporating the temporal shift module, we developed corresponding shift variants for 3 baseline building blocks: ShiftCNN, ShiftLSTM, and Shiftformer. Furthermore, we propose 2 technical strategies, placement and proportion of shift, to balance the trade-off between mingling and misalignment. Our family of temporal shift models outperforms state-of-the-art methods on the benchmark Interactive Emotional Dyadic Motion Capture dataset in fine-tuning and feature-extraction scenarios. In addition, through comprehensive experiments using wav2vec 2.0 and Hidden-Unit Bidirectional Encoder Representations from Transformers representations, we identified the behavior of the temporal shift module in downstream models, which may serve as an empirical guideline for future exploration of channel-wise shift and downstream network design.
Article
Full-text available
In this work, we introduce a novel interpretation of residual networks showing they are exponential ensembles. This observation is supported by a large-scale lesion study that demonstrates they behave just like ensembles at test time. Subsequently, we perform an analysis showing these ensembles mostly consist of networks that are each relatively shallow. For example, contrary to our expectations, most of the gradient in a residual network with 110 layers comes from an ensemble of very short networks, i.e., only 10-34 layers deep. This suggests that in addition to describing neural networks in terms of width and depth, there is a third dimension: multiplicity, the size of the implicit ensemble. Ultimately, residual networks do not resolve the vanishing gradient problem by preserving gradient flow throughout the entire depth of the network - rather, they avoid the problem simply by ensembling many short networks together. This insight reveals that depth is still an open research question and invites the exploration of the related notion of multiplicity.
Article
Full-text available
Since emotions are expressed through a combination of verbal and non-verbal channels, a joint analysis of speech and gestures is required to understand expressive human communication. To facilitate such investigations, this paper describes a new corpus named the “interactive emotional dyadic motion capture database” (IEMOCAP), collected by the Speech Analysis and Interpretation Laboratory (SAIL) at the University of Southern California (USC). This database was recorded from ten actors in dyadic sessions with markers on the face, head, and hands, which provide detailed information about their facial expressions and hand movements during scripted and spontaneous spoken communication scenarios. The actors performed selected emotional scripts and also improvised hypothetical scenarios designed to elicit specific types of emotions (happiness, anger, sadness, frustration and neutral state). The corpus contains approximately 12 h of data. The detailed motion capture information, the interactive setting to elicit authentic emotions, and the size of the database make this corpus a valuable addition to the existing databases in the community for the study and modeling of multimodal and expressive human communication.
Article
Full-text available
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
Article
Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., learning universal representations for all speech tasks is challenging. To tackle the problem, we propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks. WavLM jointly learns masked speech prediction and denoising in pre-training. By this means, WavLM does not only keep the speech content modeling capability by the masked speech prediction, but also improves the potential to non-ASR tasks by the speech denoising. In addition, WavLM employs gated relative position bias for the Transformer structure to better capture the sequence ordering of input speech. We also scale up the training dataset from 60 k hours to 94 k hours. WavLM Large achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks.
Article
Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three problems, we propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. A key ingredient of our approach is applying the prediction loss over the masked regions only, which forces the model to learn a combined acoustic and language model over the continuous inputs. HuBERT relies primarily on the consistency of the unsupervised clustering step rather than the intrinsic quality of the assigned cluster labels. Starting with a simple k-means teacher of 100 clusters, and using two iterations of clustering, the HuBERT model either matches or improves upon the state-of-the-art wav2vec 2.0 performance on the Librispeech (960 h) and Libri-light (60,000 h) benchmarks with 10 min, 1 h, 10 h, 100 h, and 960 h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets. <sup>1</sup> <sup>2</sup></xref
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
Many machine learning tasks can be expressed as the transformation---or \emph{transduction}---of input sequences into output sequences: speech recognition, machine translation, protein secondary structure prediction and text-to-speech to name but a few. One of the key challenges in sequence transduction is learning to represent both the input and output sequences in a way that is invariant to sequential distortions such as shrinking, stretching and translating. Recurrent neural networks (RNNs) are a powerful sequence learning architecture that has proven capable of learning such representations. However RNNs traditionally require a pre-defined alignment between the input and output sequences to perform transduction. This is a severe limitation since \emph{finding} the alignment is the most difficult aspect of many sequence transduction problems. Indeed, even determining the length of the output sequence is often challenging. This paper introduces an end-to-end, probabilistic sequence transduction system, based entirely on RNNs, that is in principle able to transform any input sequence into any finite, discrete output sequence. Experimental results for phoneme recognition are provided on the TIMIT speech corpus.
  • Alex Graves
Alex Graves, "Sequence transduction with recurrent neural networks," arXiv preprint arXiv:1211.3711, 2012.
Tsm: Temporal shift module for efficient video understanding
  • liu
Metaformer is actually what you need for vision
  • ye
A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition
  • Srividya Tirunellai Rajamani
  • T Kumar
  • Adria Rajamani
  • Shuo Mallol-Ragolta
  • Björn Liu
  • Schuller
wav2vec 2.0: A framework for self-supervised learning of speech representations
  • Alexei Baevski
  • Yuhao Zhou
  • Abdelrahman Mohamed
  • Michael Auli
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli, "wav2vec 2.0: A framework for self-supervised learning of speech representations," Advances in Neural Information Processing Systems, vol. 33, pp. 12449-12460, 2020.
A con-vnet for the 2020s
  • Zhuang Liu
  • Hanzi Mao
  • Chao-Yuan Wu
  • Christoph Fe-Ichtenhofer
  • Trevor Darrell
  • Saining Xie
wav2vec 2.0: A framework for self-supervised learning of speech representations
  • baevski