Table 1 - uploaded by Jeng-Lin Li
Content may be subject to copyright.
A summary of the experiment results for the various model structure, Up-Samp means up-sampling the minority class samples, Aug. means general Data augmentation. The accuracy presented is evaluated on the development set with metric of UAR
Contexts in source publication
Context 1
... conduct and compare our recognition results with the following list of models, and all of the evaluation results are computed on the development set using the metric of unweighted average recall (UAR): The Input Fc means that the inputted low-level descriptors are passed though a fully-connected layer before feeding it into the BLSTM training. Table 1 summarizes the performances of each model. In short, two classification models are used in our work, which is SVM and BLSTM. ...
Context 2
... imbalance class distribution in this sub-challenge leads to worse classification on minority class (low). From Table 1, we observe an increased improvement for UAR of class Low (24.05% to 54.43%) by up-sampling method. However, the UAR scores of class high and medium drop slightly compared with the original method without the up-sampling method. ...
Context 3
... con- duct and compare our recognition results with the following list of models, and all of the evaluation results are computed on the development set using the metric of unweighted average recall (UAR): The Input Fc means that the inputted low-level descriptors are passed though a fully-connected layer before feeding it into the BLSTM training. Table 1 summarizes the performances of each model. In short, two classification models are used in our work, which is SVM and BLSTM. ...
Context 4
... imbalance class distribu- tion in this sub-challenge leads to worse classification on mi- nority class (low). From Table 1, we observe an increased improvement for UAR of class Low (24.05% to 54.43%) by up-sampling method. However, the UAR scores of class high and medium drop slightly compared with the original method without the up-sampling method. ...
Citations
... However, these models are complex and difficult to train, posing significant challenges in the field of speech emotion recognition. With the rapid advancement of deep learning, convolutional neural networks (CNN) [7,8], recurrent neural networks (RNN) [9,10], Long Short-Term Memory networks (LSTM) [11,12], and Bidirectional LSTM networks (Bi-LSTM) [13,14] have been proposed successively, garnering more attention in the study of speech emotion recognition. Wang et al. [15] proposed the Dual Sequence LSTM (DS-LSTM) model, which utilizes two Mel-spectrograms with different time-frequency resolutions to predict emotions, achieving good results. ...
Speech emotion recognition aims to automatically identify emotions in human speech, enabling machines to understand and engage emotional communication. In recent years, Transformers have demonstrated strong adaptability and significant effectiveness in speech recognition. However, Transformer models are proficient at capturing local features, struggle to extract fine-grained details, and often lead to computational redundancy, increasing time complexity. This paper proposes a Speech Emotion Recognition model named Temporal Fusion Convolution SpeechFormer (TFCS).The model comprises a Hybrid Convolutional Extractor (HCE) and multiple Temporal Fusion Convolution SpeechFormer Blocks (TFCSBs). HCE, consisting of an encoder and convolutional modules, enhances speech signals to capture local features and texture information, extracting frame-level features. TFCSBs utilize a newly proposed Temporal Fusion Convolution Module and Speech Multi-Head Attention Module to capture correlations between adjacent elements in speech sequences. TFCSBs utilize feature information captured by HCE to sequentially form frame, phoneme, word, and sentence structures, and integrates them to establish global and local relationships, enhancing local data capture and computational efficiency. Performance evaluation on the IEMOCAP and DAIC-WOZ datasets demonstrates the effectiveness of HCE in extracting fine-grained local speech features, with TFCS outperforming Transformer and other advanced models overall.
... Due to the limitations of traditional acoustic features, the trained network representation ability is insufficient, and it is difficult to achieve high recognition accuracy. Su et al. [30] trained the BILSTM network and SVM models separately and further fused the classification results using a decision-level score fusion scheme to integrate all developed models. Fang et al. [31] proposed a speech deception detection strategy combining the semi-supervised method and the full-supervised method, and constructed a hybrid model combining semisupervised DAE and fully supervised LSTM network, effectively improving the accuracy of semi-supervised speech deception detection. ...
Human lying is influenced by cognitive neural mechanisms in the brain, and conducting research on lie detection in speech can help to reveal the cognitive mechanisms of the human brain. Inappropriate deception detection features can easily lead to dimension disaster and make the generalization ability of the widely used semi-supervised speech deception detection model worse. Because of this, this paper proposes a semi-supervised speech deception detection algorithm combining acoustic statistical features and time-frequency two-dimensional features. Firstly, a hybrid semi-supervised neural network based on a semi-supervised autoencoder network (AE) and a mean-teacher network is established. Secondly, the static artificial statistical features are input into the semi-supervised AE to extract more robust advanced features, and the three-dimensional (3D) mel-spectrum features are input into the mean-teacher network to obtain features rich in time-frequency two-dimensional information. Finally, a consistency regularization method is introduced after feature fusion, effectively reducing the occurrence of over-fitting and improving the generalization ability of the model. This paper carries out experiments on the self-built corpus for deception detection. The experimental results show that the highest recognition accuracy of the algorithm proposed in this paper is 68.62% which is 1.2% higher than the baseline system and effectively improves the detection accuracy.
... a) OLR-LID tasks • Augmentation: Augmentation (e.g., velocity, volume perturbations) is widely used. Some systems use background noise extracted from the training data, mp3/mp4a [20], GRU, BLSTM [21], attention structure, attentive pooling, Global Context Network (GC-Net) [22] and SpecAugment [23], NetVLAD [24], Vector of Locally Aggregated Descriptors (VLAD) [25]. • Auxiliary information: The introduction of ASR to help language recognition is investigated by top teams. ...
... In 2019, Jalal et al. [60] implemented a hybrid model based on BLSTM, a 1D Conv-Cap, and capsule routing layers for SER. Ng and Liu [61] used a capsule-network-based model to encode spatial information from speech spectrograms and analyze the performance under various loss functions on several datasets. ...
Recognizing the speaker’s emotional state from speech signals plays a very crucial role in human–computer interaction (HCI). Nowadays, numerous linguistic resources are available, but most of them contain samples of a discrete length. In this article, we address the leading challenge in Speech Emotion Recognition (SER), which is how to extract the essential emotional features from utterances of a variable length. To obtain better emotional information from the speech signals and increase the diversity of the information, we present an advanced fusion-based dual-channel self-attention mechanism using convolutional capsule (Conv-Cap) and bi-directional gated recurrent unit (Bi-GRU) networks. We extracted six spectral features (Mel-spectrograms, Mel-frequency cepstral coefficients, chromagrams, the contrast, the zero-crossing rate, and the root mean square). The Conv-Cap module was used to obtain Mel-spectrograms, while the Bi-GRU was used to obtain the rest of the spectral features from the input tensor. The self-attention layer was employed in each module to selectively focus on optimal cues and determine the attention weight to yield high-level features. Finally, we utilized a confidence-based fusion method to fuse all high-level features and pass them through the fully connected layers to classify the emotional states. The proposed model was evaluated on the Berlin (EMO-DB), Interactive Emotional Dyadic Motion Capture (IEMOCAP), and Odia (SITB-OSED) datasets to improve the recognition rate. During experiments, we found that our proposed model achieved high weighted accuracy (WA) and unweighted accuracy (UA) values, i.e., 90.31% and 87.61%, 76.84% and 70.34%, and 87.52% and 86.19%, respectively, demonstrating that the proposed model outperformed the state-of-the-art models using the same datasets.
... Some systems used background noise extracted from the training data, white noise and random artificial band-pass filters. Most systems applied the SpecAugment strategy [23] [25], GRU, BLSTM [26], attention structure, attentive pooling, Global Context Network (GC-Net) [27], NetVLAD [28] or inspired Vector of Locally Aggregated Descriptors (VLAD) [29]. • Auxiliary information: The introduction of ASR to help language recognition was investigated by top teams (two out of five top teams used E2E ASR technologies). ...
... Some systems used background noise extracted from the training data, white noise and random artificial band-pass filters. Most systems applied the SpecAugment strategy [24] [26], GRU, BLSTM [27], attention structure, attentive pooling, Global Context Network (GC-Net) [28], NetVLAD [29] or inspired Vector of Locally Aggregated Descriptors (VLAD) [30]. • Auxiliary information: The introduction of ASR to help language recognition was investigated by top teams (two out of five top teams used E2E ASR technologies). ...
The fifth Oriental Language Recognition (OLR) Challenge focuses on language recognition in a variety of complex environments to promote its development. The OLR 2020 Challenge includes three tasks: (1) cross-channel language identification, (2) dialect identification, and (3) noisy language identification. We choose Cavg as the principle evaluation metric, and the Equal Error Rate (EER) as the secondary metric. There were 58 teams participating in this challenge and one third of the teams submitted valid results. Compared with the best baseline, the Cavg values of Top 1 system for the three tasks were relatively reduced by 82%, 62% and 48%, respectively. This paper describes the three tasks, the database profile, and the final results. We also outline the novel approaches that improve the performance of language recognition systems most significantly, such as the utilization of auxiliary information.
... Recurrent Stage: Gated recurrent units (GRU) and long short-term memory units (LSTM) [54] are the two most common recurrent types in paralinguistics [9], [18], [19], [55], [56]. Unidirectional [9], [18] as well as bidirectional [19], [56] networks are popular. ...
... Recurrent Stage: Gated recurrent units (GRU) and long short-term memory units (LSTM) [54] are the two most common recurrent types in paralinguistics [9], [18], [19], [55], [56]. Unidirectional [9], [18] as well as bidirectional [19], [56] networks are popular. We used the CuDNN implementations 3 to reduce the training time of the recurrent units. ...
... Temporal Integration Stage: The temporal integration operations were adapted from Mirsamadi et al [59]. Particularly attention pooling was widely employed in the INTERSPECCH 2018 ComParE challenge [55], [56]. All pooling types incorporate outputs at every time step, while "Last Step" means that only the output of the last time step is forwarded to the next layer (alias "many-to-one"-prediction). As for convolutional stages last-step-integration is not applicable as units do not carry internal states, we used flattening corresponding to the vertical integration operation of VGGNet [41]. ...
In this study we compared various neural network types for the task of automatic infant vocalization classification, i.e convolutional, recurrent and fully-connected networks as well as combinations of thereof. The goal was to first determine the optimal configuration for each network type to then identify the type with the highest overall performance. This investigation helps to employ neural networks more effectively to infant vocalization classification tasks, which typically offer low amounts of training data. To this end, we defined a unified neural network architecture scheme for audio classification from which we derived various network types. For each type we performed a semi-random hyperparameter search which employed regression trees to both focus the search space as well as derive insights on the most influential parameters. We finally compared the test performances of the best performing configurations in an contest-like setup. Our key findings are: (1) Networks with convolutional stages reached the highest performance, regardless of being combined with fully-connected or recurrent layers. (2) The most influential architectural hyperparameter for all types were the integration operations for reducing tensor dimensionality between network stages. The best performing configurations reached test performances of 75% unweighted average recall, surpassing previously published benchmarks.
... Rather than using only a single type of information, we implemented a fusion strategy on the various sources of information to make a decision in classification, and a more sensible conclusion could be reached. The fusion methods of speech emotion features can be roughly separated into two categories: feature-level fusion [28,29] and decision-level fusion [30,31]. In feature-level fusion, emotion features extracted by different models are combined to generate a more informative representation for classification. ...
... Both RNN and SVM generated the probabilities of concerned classes, which were used as the confidence score for these two models. The confidence scores of RNN and SVM were averaged for each concerned class as the fusion confidence score for the integrated model [30]. Considering the insufficiency of speech emotion corpus, in this study, we adopted decision-level fusion to achieve high performance on the emotion recognition task. ...
... After we obtained the outputs from the three different classifiers -each used a different form of feature for certain speech utterances as input, we incorporated the three models to improve the ultimate recognition performance. Specifically, we developed confidencebased decision-level fusion using the sum of confidence scores referred to the study [30]. The confidence scores were separately generated from the sof tmax layer in three individual classifiers. ...
Speech emotion recognition plays an increasingly important role in emotional computing and is still a challenging task due to its complexity. In this study, we developed a framework integrating three distinctive classifiers: a deep neural network (DNN), a convolution neural network (CNN), and a recurrent neural network (RNN). The framework was used for categorical recognition of four discrete emotions (i.e., angry, happy, neutral and sad). Frame-level low-level descriptors (LLDs), segment-level mel-spectrograms (MS), and utterance-level outputs of high-level statistical functions (HSFs) on LLDs were passed to RNN, CNN, and DNN, separately. Three individual models of LLD-RNN, MS-CNN, and HSF-DNN were obtained. In the models of MS-CNN and LLD-RNN, the attention mechanism based weighted-pooling method was utilized to aggregate the CNN and RNN outputs. To effectively utilize the interdependencies between the two approaches of emotion description (discrete emotion categories and continuous emotion attributes), a multi-task learning strategy was implemented in these three models to acquire generalized features by simultaneously operating classification of discrete categories and regression of continuous attributes. Finally, a confidence-based fusion strategy was developed to integrate the power of different classifiers in recognizing different emotional states. Three experiments on emotion recognition based on the IEMOCAP corpus were conducted. Our experimental results show that the weighted pooling method based on attention mechanism endowed the neural networks with the capability to focus on emotionally salient parts. The generalized features learned in the multi-task learning helped the neural networks to achieve higher accuracies in the tasks of emotion classification. Furthermore, our proposed fusion system achieved weighted accuracy of 57.1% and unweighted accuracy of 58.3%, which were significantly higher than those of each individual classifier. The effectiveness of the proposed approach based on classifier fusion was thus validated.
Research on speech recognition (SR) for the Kashmiri language using deep learning (DL) remains less explored due to the complexities inherent in its complex grammatical structure, contextual ambiguity, and the scarcity of standardized databases. Developing SR for the Kashmiri language is to broaden technological applications and explore additional communication means. However, the development of such systems poses significant challenges owing to the Kashmiri language's complexity compared to other languages and the lack of widely available databases. In this study, we collected and employed a large new Kashmiri spoken numeral dataset comprising 7,200 samples of spoken words, each representing distinct classes from one to twenty. Subsequently, Mel-spectrograms were initially extracted from each speech signal, and the resulting feature matrix was reshaped into a suitable format for input into the proposed model. We implemented a Bi-directional Long Short-Term Memory (LSTM) network on the Kashmiri spoken numeral dataset and the average cross-validation accuracy obtained is 85.28% on the dataset. The model's performance underscores the importance of optimizing various parameters to achieve the best accuracy. Among the features tested—spectrogram, Mel-spectrogram, MFCC, and delta-MFCC, the Mel-spectrogram yields the highest accuracy of 85.28%. In Mel-spectrogram feature extraction, testing different frame sizes (15, 20, 25, 30) and overlap sizes (10, 15, 20, 30) revealed that a frame size of 25 milliseconds and an overlap size of 20 milliseconds provided optimal results. Further refinement involved experimenting with various combinations of Bi-directional LSTM layers and fully connected (FC) layers, such as (4, 2), (3, 2), (3, 1), (2, 2), and (2, 1), to determine the most effective architecture. After thorough experimentation, the optimal performance was achieved with a configuration consisting of two Bi-directional LSTM layers and one FC layer. Testing four optimizers (SGD, Adamax, RMSprop, Adam) during training revealed that Adam provided the best results. All configurations were systematically tested to optimize model performance. These recognition systems are applicable in voice interfaces for keyword detection, offering significant advantages for embedded and mobile devices.