Conference PaperPDF Available

Self-Assessed Affect Recognition Using Fusion of Attentional BLSTM and Static Acoustic Features

Self-Assessed Affect Recognition using Fusion of Attentional BLSTM and
Static Acoustic Features
Bo-Hao Su1,2, Sung-Lin Yeh1,2, Ming-Ya Ko1,2, Huan-Yu Chen1,2, Shun-Chang Zhong1,2, Jeng-Lin
Li1,2, Chi-Chun Lee1,2
1Department of Electrical Engineering, National Tsing Hua University, Taiwan
2MOST Joint Research Center for AI Technology and All Vista Healthcare, Taiwan
In this study, we present a computational framework to partici-
pate in the Self-Assessed Affect Sub-Challenge in the INTER-
SPEECH 2018 Computation Paralinguistics Challenge. The
goal of this sub-challenge is to classify the valence scores given
by the speaker themselves into three different levels, i.e., low,
medium, and high. We explore fusion of Bi-directional LSTM
with baseline SVM models to improve the recognition accuracy.
In specifics, we extract frame-level acoustic LLDs as input to
the BLSTM with a modified attention mechanism, and separate
SVMs are trained using the standard ComParE 16 baseline fea-
ture sets with minority class upsampling. These diverse predic-
tion results are then further fused using a decision-level score
fusion scheme to integrate all of the developed models. Our
proposed approach achieves a 62.94% and 67.04% unweighted
average recall (UAR), which is an 6.24% and 1.04% absolute
improvement over the best baseline provided by the challenge
organizer. We further provide a detailed comparison analysis
between different models.
Index Terms: computational paralinguistics, BLSTM, affect
recognition, attention mechanism
1. Introduction
Computing paralinguistic attributes from speech is becoming
more prevalent across a variety of tasks. Aside from focus-
ing solely on automatic speech recognition, modeling speech
signals to extract a variety of other relevant attributes of hu-
man states and traits (e.g., cold and snoring [1], Alzheimer dis-
ease [2, 3], and Autism diagnoses [4], etc.) has sparked many
technical research effort - many of these works show that these
higher-level human attributes could indeed be estimated from
speech signals. The potential application scenario is vast; in
fact, a series challenges have been proposed to tackle the is-
sues of robust recognition for different human states and traits.
The ComParE 2018 Challenge consists of four sub-challenges
as following: Atypical Affect Sub-Challenge, Self-Assessed
Affect Sub-Challenge, Crying Sub-Challenge and Heart Beats
Sub-Challenge. In this work, we present our algorithm in the
participation of Self-Assessed Affect Sub-Challenge.
Many of these real-life tasks suffer naturally from limited
data samples, and also the exact mechanism in the manifesta-
tion of these attributes in the speech signal is often complex
and intertwined with other unwanted factors, e.g., individual id-
iosyncratic factors, environmental noise, other human attributes
and traits, etc. In this work, we focus on affect recognition. Par-
ticularly, these are self-assessed affect states (instead of conven-
tionally perceptual-based affect states recognition). In scenar-
ios of mental illness such as depression, the patient’s emotion
would influence the outcome throughout the therapy process or
the morbidity of the illness. If the degradation of the patient’s
emotion well-being continue to worsen, the patients may even
lose the ability to do anything in their daily life. Research has
indicated that if a patient’s self-assessed affect states improves
with therapy, it indeed could create a substantial impact in im-
proving his/her quality of life [5, 6] .
While being an important health indicator, in practice, most
of these people tend not to self assess and disclose their own
affective states. The ability to automatically sense and detect
these individuals’ self-assessed valence states using unobtru-
sive and easily-obtainable behavior signals, such as speech and
facial expressions, is becoming more and more important, es-
pecially in the health-related applications.. In this work, we
present a technical framework in fusing various approaches to
achieve robust self-assessed valence attributes recognition from
speech. In specifics, we utilize two different types of model:
time-series model and static model. The static model is ob-
tained by training SVM classifier using the ComParE 16 fea-
ture set with functional encoding (6373 dimensional features).
In order to further improve the recall accuracy on the minority
class (low), we also train static SVM using ComParE 16 feature
set with minority class upsampling.
In terms of the time-series model, we first compute the
low-level descriptors part of the ComParE 16 feature set (130
dimensions per frame). We utilize the Bi-directional LSTM
as our model, which captures the forward and backward time-
dependent acoustic information, to perform affect recognition.
We also include the use of attention mechanism together with
BLSTM, and we modify the conventional structure of atten-
tion weights by inserting a dense layer (fully-connected layer)
in the computation of the attention weights for each time step
of BLSTM. This additional non-linear transformation of dense
layer helps in improving the recognition rates. Finally, the prob-
ability outputted obtained from each of these models are aver-
aged to perform the final fused recognition. Overall, our pro-
posed approach achieves a 62.94% and 67.04% unweighted av-
erage recall (UAR) in this three class recognition task, which is
an 6.24% and 1.04% absolute improvement over the best base-
line provided by the challenge organizer in the development and
the blind test, respectively.
The rest of this paper is organized as follows. In section 2,
we will elaborate the methods used in this work. In section 3,
we will present the experimental results and discussions. In the
last section, we conclude with future works.
2. Methodology
There are multiple components in our proposed approach. We
will describe each in the following section.
Interspeech 2018
2-6 September 2018, Hyderabad
536 10.21437/Interspeech.2018-2261
Figure 1: The complete schematic of our framework: upsampling minority class in our database, training both time-series model
(BLSTM with modified attention mechanism) and a static model (SVM with ComParE 16 features), and finally integrating diverse
models in a decision-level fusion scheme
2.1. BLSTM with Modified Attention Mechanism
2.1.1. Bi-directional LSTM
Long Short-Term Memory (LSTM) Neural Network is first pro-
posed by Hochreiter et al. [7]. LSTM preserves long term con-
textual information from data inputs in its hidden state. LSTM
is an improvement over recurrent neural network (RNN) by in-
troducing three control gates: input gate, output gate, and forget
gate controlling write, read and reset operations for the hidden
cells. This helps eliminate the gradient explosion and vanish-
ing gradient problems for RNN. Conventional forward LSTM
is uni-directional, i.e., the information can only flow from the
past to the future due to the forward propagation of the network
structure. Bidirectional LSTM (BLSTM) networks is an im-
provement over standard forward LSTM model that is capable
of operating a sequence of features in both forward and back-
ward directions.
The original LSTM state:
ot=σ(Wxoxt+Who ht1+Wcoct+bo)
where σis the logistic sigmoid function, and i,f,oand care
input gate, forget gate, output gate and cell state.
The Bidirectional LSTM state:
hi= [
Using the combined hidden states allows us to preserve infor-
mation from both past and future information at any given time
step. This particular methodology has been shown to be useful
for modeling tasks involving sequence modeling [8]. Another
modification to LSTM is Gated Recurrent Unit (GRU) [9]. Sim-
ilar to LSTM, GRU aims at tracking long-term dependencies ef-
fectively to prevent the vanishing/exploding gradient problems.
The key difference is that GRU uses only two gates (reset and
update gates). The relatively simpler structure of GRU help
achieve faster training; however, the trade-off is that GRU re-
members only shorter sequences in tasks requiring modeling
long-distance relations.
2.1.2. Modified Attention Mechanism
Attention mechanism is a widely used in sequence based
encoder-decoder model. Due to the fixed length input vector
to the encoder, the encoder-decoder architecture has superior
performance on short sequences but not the long ones. As the
sequence grows longer, the information contained inside often
becomes more complex where a fixed length input vector can
no longer support. A simple encoder model results in learning
an unreliable representation for such long sequence, leading to
poor decoder output. Attention mechanism helps mitigate such
an issue by applying weights on the intermediate outputs from
each step [10]; in other words, the outputs are generated under
a selection mechanism from inputs.
In this work, we also apply an attention mechanism in
the building of our time-series BLSTM model. Specifically,
the time pooling technique applied to our BLSTM model is
performed by computing weighted sum over time [11]. The
standard method to use attention mechanism for BLSTM is to
choose a simple logistic-regression-like weighted sum as the
pooling layer. This weighted sum is the inner product com-
puted between the frame-wise outputs of the BLSTM, yt, and
weights ubeing a vector of parameters as in an attention model.
To keep the weight summation as unity, we apply softmax func-
tion to the inner product.
After obtaining the weights, we can calculate the weighted sum
over time to get the hidden representation to integrate attention
mechanism in our BLSTM.
In our approach, we modify this attention mechanism by
adding a fully-connected layer in the computation of attention,
i.e., instead of directly computing dot product between feature
output and the label, we enhance the modeling power of atten-
tion weights by introducing the use of a more sophisticated non-
linear transformation (see Figure 1 for its network structure).
Finally, the newly weighted hidden representation (with modi-
fied attention weights, α0
t), z0, is later fed into another softmax
dense layer to compute the final probability of each class. The
entire network is jointly optimized over these modules.
Table 1: A summary of the experiment results for the various model structure, Up-Samp means up-sampling the minority class samples,
Aug. means general Data augmentation. The accuracy presented is evaluated on the development set with metric of UAR
Baseline Model 1 Up-Samp Data Aug. Model 2 Model 3 Model 4 Model 5 Model 6
Low Recall 37.97 24.05 54.43 18.98 29.11 18.98 53.16 19.98 29.11
Medium Recall 60.32 74.19 51.29 67.09 59.67 65.80 72.58 51.93 45.16
High Recall 71.10 89.51 69.97 77.05 64.87 62.88 58.38 53.54 73.37
Average Recall 56.50 64.24 57.48 54.77 51.22 49.22 61.50 48.27 49.21
where w,b,Gmeans the weight, bias and activation function of
softmax respectively.
2.2. Up-Sampling
In the Self-Assessed Affect database, the imbalance of class
distribution negatively impacts the recognition accuracy. Re-
sampling is a method to alleviate this problem by balancing
class distribution [12]. There are usually two different meth-
ods in resampling: up-sampling or down-sampling. Since the
database only includes a limited number of utterances, down-
sampling while efficient woud result in a loss of modeling
power in our models. In our approach, we choose to directly
up-sampling (duplicating data samples) the minority class in the
2.3. Decision Score Fusion
In order to combine various models to obtain a better prediction,
we use confidence-based decision-level method, which is sim-
ilar to decision score fusion to generate our final results [13].
The confidence score from the time-series model is obtained
from softmax layer, and the estimated probabilities from the
SVM classifications of the static model is used as the confi-
dence score. These confidence scores, i.e., one for each class,
predicted from multiple models are then further summed up to-
gether. The class with the highest confidence sum is our final
prediction for each instance.
3. Experimental Results and Discussions
3.1. Experimental Setup
We extract standard ComParE features set as our low level de-
scriptors every 10 msec. These low-level descriptors are used in
the BLSTM model, which consisting 130 dimensions. This fea-
ture set includes voicing, energy, spectral related features and
their derivatives [14]. The functionals of these LLDs are re-
garded as the static acoustic representation for SVM model, and
the learned output from attention BLSTM with the LLDs as in-
puts are the time-series model.
The architectures of our BLSTM models are: a bidirec-
tional LSTM layer with 64 cells (32 for each direction) followed
by a fully connected layer with 64 nodes. The activation func-
tion is ReLU, and 50% of dropout [15] is utilized to prevent
over-fitting, which is applied to the fully-connected layer. The
parameters of BLSTM models are optimized using learning rate
of 0.0005, batch size as 256 and gradient clipping as 1 to limit
the magnitude of the gradient during training process. We con-
duct and compare our recognition results with the following list
of models, and all of the evaluation results are computed on the
development set using the metric of unweighted average recall
Model 1 : SVM
Mdoel 2 : BLSTM method with Attention
Model 3 : B-GRU method with Attention
Model 4 : BLSTM + Modified Attention
Model 5 : Input Fc + BLSTM + Attention
Model 6 : Input Fc + BLSTM + Modified Attention
The Input Fc means that the inputted low-level descriptors are
passed though a fully-connected layer before feeding it into the
BLSTM training.
3.2. Experimental Results
Table 1 summarizes the performances of each model. In short,
two classification models are used in our work, which is SVM
and BLSTM. We observe that SVM is better at the medium and
high class recall but performs poorly on the low class. Up-
sampling data when classified using SVM helps improve the
recall rate on low. The BLSTM method, on the other hand, per-
forms well on low and medium class but not on high class.
Due to the difference in the these modeling characteristics,
we propose the fusion models of static and time-series model.
The final fusion model used, determined empirically as:
Fusion : Model 1 + Up-sampled Model 1 + Model 4
After fusing these three models (SVM, SVM-with-
Upsample, BLSTM-Modified-Attention), we obtain the best
recognition rates. The confusion matrix of this model on the de-
velopment set is shown in Figure 2. In summary, our best fused
model obtains a 62.94% and 67.04% unweighted average recall
(UAR) in the three-class recognition tasks of Self-Assessed Af-
fect task(as shown in Table 2). We obtain an 6.24% and 1.04%
absolute improvement over the best baseline provided by the
3.3. Model Comparison and Analysis
In this section, we provides various comparison between differ-
ent models used in our work.
3.3.1. Model 1 v.s. Baseline
The Baseline model uses ComParE 2016 functional features
to train a linear SVM model. The imbalance class distribu-
tion in this sub-challenge leads to worse classification on mi-
nority class (low). From Table 1, we observe an increased
Table 2: Comparison between baseline model and our best
fused model (Model 1 + Up-sampled Model 1 + Model 4)
Baseline Our best fused model
Dev UAR 56.7% 62.94%
Test UAR 66.0% 67.04%
Figure 2: Confusion Matrix of the Best Fused Model on the
Development Set
improvement for UAR of class Low (24.05% to 54.43%) by
up-sampling method. However, the UAR scores of class high
and medium drop slightly compared with the original method
without the up-sampling method. Note that there is a trade-
off between low and medium/high performance. Finally, data-
augmentation means to generate data samples (not specific to a
particular class) by corrupting original data samples with Gaus-
sian noise. This methodology introduces more noises into our
dataset and effectively decrease the recognition accuracy.
3.3.2. Model 2 v.s. Model 3
We further compare the performance between bi-directional
GRU and bi-directional LSTM with a standard attention mech-
anism in each model. While the GRU cells show faster con-
vergence rate during training process, the model with BLSTM
cells obtains 2% to 3% higher UAR in average compared to bi-
directional LSTM. The bidirectional LSTM with an attention
layer achieves not only a high UAR of 61.5% but also shows
better performance in both low and medium class recall rates.
3.3.3. Extension of Fully-Connected Layer
The effect of using additional fully-connected layer in our
recognition architecture is also analyzed.
Model 2 v.s. Model 4
The use of dense layer in the computation of attention
weights brings about 5% to 8% improvements in the UAR when
comparing BLSTM using modified attention versus BLSTM us-
ing standard attention mechanism.
Model 2 v.s. Model 5
In this comparison, we examine the difference of recogni-
tion rates obtained by placing the fully-connected layer in the
attention weight computation or right after the input LLDs be-
fore feeding them into BLSTM. Model 5 shows an decrease
in the recall rate in the low class around 10%, which indicates
that the fully-connected layer should be placed in the attention
mechanism not directly at the input space.
Model 5 v.s. Model 6
By comparing between Model 5 and 6, we see that by
adding additional dense layer is indeed beneficial in obtaining
the higher recognition rates. Although, in general, these two
models do not perform well due to the initial dense layer ap-
plied to the inputs before feeding into the BLSTM.
4. Conclusions and Future Works
In this work, we present our recognition framework in the par-
ticipation of the Self-Assessed Affect Challenge. Our frame-
work is composed of two parts: a standard utterance level
baseline ComParE 16 features with SVM trained on original
database and up-sampled database, and a BLSTM model with
a novel modified attention mechanism. In order to alleviate
the issue of data imbalance, we employ a straightforward up-
sampling technique. This framework achieves an improved
recognition rates for both the development set and the blind
testing set. The introduction of modified attention mechanism,
i.e., adding a fully-connected layer in the computation of atten-
tion weights, is beneficial in improving utilizing sequence based
model in affect recognition from speech.
In our future work, we will continue to investigate advanced
methods in integrating static-dynamic acoustic representation
and learning model for complex human states and trait recog-
nition. Since many of these higher-level internal states and
traits are often complexly manifested in the recorded behavior
signals, additional technical endeavor is required develop au-
tomatic system in consistently and reliably tracking these at-
tributes. The continuous advancement in computational par-
alinguistics (e.g., complex affective phenomenon recognition)
will further help create a tangible impact, especially relevant
on applications domains of affective disorders and other related
mental health.
5. References
[1] B. Schuller, S. Steidl, A. Batliner, E. Bergelson, J. Krajewski,
C. Janott, A. Amatuni, M. Casillas, A. Seidl, M. Soderstrom et al.,
“The interspeech 2017 computational paralinguistics challenge:
Addressee, cold & snoring,” in Computational Paralinguistics
Challenge (ComParE), Interspeech 2017, 2017, pp. 3442–3446.
[2] J. Drapeau, N. Gosselin, L. Gagnon, I. Peretz, and D. Lorrain,
“Emotional recognition from face, voice, and music in dementia
of the alzheimer type,” Annals of the New York Academy of Sci-
ences, vol. 1169, no. 1, pp. 342–345, 2009.
[3] G. A. Gates, A. Beiser, T. S. Rees, R. B. D’agostino, and P. A.
Wolf, “Central auditory dysfunction may precede the onset of
clinical dementia in people with probable alzheimer’s disease,
Journal of the American Geriatrics Society, vol. 50, no. 3, pp.
482–488, 2002.
[4] J. I. Alc´
antara, E. J. Weisblatt, B. C. Moore, and P. F. Bolton,
“Speech-in-noise perception in high-functioning individuals with
autism or asperger’s syndrome,Journal of Child Psychology and
Psychiatry, vol. 45, no. 6, pp. 1107–1114, 2004.
[5] S. K. Mittal, L. Ahern, E. Flaster, J. K. Maesaka, and S. Fishbane,
“Self-assessed physical and mental function of haemodialysis pa-
tients,” Nephrology Dialysis Transplantation, vol. 16, no. 7, pp.
1387–1394, 2001.
[6] Y. Benyamini, E. L. Idler, H. Leventhal, and E. A. Leventhal,
“Positive affect and function as influences on self-assessments of
health: Expanding our view beyond illness and disability,The
Journals of Gerontology Series B: Psychological Sciences and So-
cial Sciences, vol. 55, no. 2, pp. P107–P116, 2000.
[7] S. Hochreiter and J. Schmidhuber, “Long short-term memory,
Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[8] A. Graves and J. Schmidhuber, “Framewise phoneme classifica-
tion with bidirectional lstm and other neural network architec-
tures,” Neural Networks, vol. 18, no. 5-6, pp. 602–610, 2005.
[9] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evalu-
ation of gated recurrent neural networks on sequence modeling,”
arXiv preprint arXiv:1412.3555, 2014.
[10] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans-
lation by jointly learning to align and translate,” arXiv preprint
arXiv:1409.0473, 2014.
[11] S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech
emotion recognition using recurrent neural networks with local
attention,” in Acoustics, Speech and Signal Processing (ICASSP),
2017 IEEE International Conference on. IEEE, 2017, pp. 2227–
[12] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE
Transactions on knowledge and data engineering, vol. 21, no. 9,
pp. 1263–1284, 2009.
[13] A. Sinha, H. Chen, D. Danu, T. Kirubarajan, and M. Farooq, “Es-
timation and decision fusion: A survey,” Neurocomputing, vol. 71,
no. 13-15, pp. 2650–2656, 2008.
[14] F. Weninger, F. Eyben, B. W. Schuller, M. Mortillaro, and K. R.
Scherer, “On the acoustics of emotion in audio: what speech, mu-
sic, and sound have in common,Frontiers in psychology, vol. 4,
p. 292, 2013.
[15] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov, “Dropout: A simple way to prevent neural net-
works from overfitting,The Journal of Machine Learning Re-
search, vol. 15, no. 1, pp. 1929–1958, 2014.
... Recurrent Stage: Gated recurrent units (GRU) and long short-term memory units (LSTM) [54] are the two most common recurrent types in paralinguistics [9], [18], [19], [55], [56]. Unidirectional [9], [18] as well as bidirectional [19], [56] networks are popular. ...
... Recurrent Stage: Gated recurrent units (GRU) and long short-term memory units (LSTM) [54] are the two most common recurrent types in paralinguistics [9], [18], [19], [55], [56]. Unidirectional [9], [18] as well as bidirectional [19], [56] networks are popular. We used the CuDNN implementations 3 to reduce the training time of the recurrent units. ...
... Temporal Integration Stage: The temporal integration operations were adapted from Mirsamadi et al [59]. Particularly attention pooling was widely employed in the INTERSPECCH 2018 ComParE challenge [55], [56]. All pooling types incorporate outputs at every time step, while "Last Step" means that only the output of the last time step is forwarded to the next layer (alias "many-to-one"-prediction). As for convolutional stages last-step-integration is not applicable as units do not carry internal states, we used flattening corresponding to the vertical integration operation of VGGNet [41]. ...
Full-text available
In this study we compared various neural network types for the task of automatic infant vocalization classification, i.e convolutional, recurrent and fully-connected networks as well as combinations of thereof. The goal was to first determine the optimal configuration for each network type to then identify the type with the highest overall performance. This investigation helps to employ neural networks more effectively to infant vocalization classification tasks, which typically offer low amounts of training data. To this end, we defined a unified neural network architecture scheme for audio classification from which we derived various network types. For each type we performed a semi-random hyperparameter search which employed regression trees to both focus the search space as well as derive insights on the most influential parameters. We finally compared the test performances of the best performing configurations in an contest-like setup. Our key findings are: (1) Networks with convolutional stages reached the highest performance, regardless of being combined with fully-connected or recurrent layers. (2) The most influential architectural hyperparameter for all types were the integration operations for reducing tensor dimensionality between network stages. The best performing configurations reached test performances of 75% unweighted average recall, surpassing previously published benchmarks.
... Rather than using only a single type of information, we implemented a fusion strategy on the various sources of information to make a decision in classification, and a more sensible conclusion could be reached. The fusion methods of speech emotion features can be roughly separated into two categories: feature-level fusion [28,29] and decision-level fusion [30,31]. In feature-level fusion, emotion features extracted by different models are combined to generate a more informative representation for classification. ...
... Both RNN and SVM generated the probabilities of concerned classes, which were used as the confidence score for these two models. The confidence scores of RNN and SVM were averaged for each concerned class as the fusion confidence score for the integrated model [30]. Considering the insufficiency of speech emotion corpus, in this study, we adopted decision-level fusion to achieve high performance on the emotion recognition task. ...
... After we obtained the outputs from the three different classifiers -each used a different form of feature for certain speech utterances as input, we incorporated the three models to improve the ultimate recognition performance. Specifically, we developed confidencebased decision-level fusion using the sum of confidence scores referred to the study [30]. The confidence scores were separately generated from the sof tmax layer in three individual classifiers. ...
Full-text available
Speech emotion recognition plays an increasingly important role in emotional computing and is still a challenging task due to its complexity. In this study, we developed a framework integrating three distinctive classifiers: a deep neural network (DNN), a convolution neural network (CNN), and a recurrent neural network (RNN). The framework was used for categorical recognition of four discrete emotions (i.e., angry, happy, neutral and sad). Frame-level low-level descriptors (LLDs), segment-level mel-spectrograms (MS), and utterance-level outputs of high-level statistical functions (HSFs) on LLDs were passed to RNN, CNN, and DNN, separately. Three individual models of LLD-RNN, MS-CNN, and HSF-DNN were obtained. In the models of MS-CNN and LLD-RNN, the attention mechanism based weighted-pooling method was utilized to aggregate the CNN and RNN outputs. To effectively utilize the interdependencies between the two approaches of emotion description (discrete emotion categories and continuous emotion attributes), a multi-task learning strategy was implemented in these three models to acquire generalized features by simultaneously operating classification of discrete categories and regression of continuous attributes. Finally, a confidence-based fusion strategy was developed to integrate the power of different classifiers in recognizing different emotional states. Three experiments on emotion recognition based on the IEMOCAP corpus were conducted. Our experimental results show that the weighted pooling method based on attention mechanism endowed the neural networks with the capability to focus on emotionally salient parts. The generalized features learned in the multi-task learning helped the neural networks to achieve higher accuracies in the tasks of emotion classification. Furthermore, our proposed fusion system achieved weighted accuracy of 57.1% and unweighted accuracy of 58.3%, which were significantly higher than those of each individual classifier. The effectiveness of the proposed approach based on classifier fusion was thus validated.
... a) OLR-LID tasks • Augmentation: Augmentation (e.g., velocity, volume perturbations) is widely used. Some systems use background noise extracted from the training data, mp3/mp4a [20], GRU, BLSTM [21], attention structure, attentive pooling, Global Context Network (GC-Net) [22] and SpecAugment [23], NetVLAD [24], Vector of Locally Aggregated Descriptors (VLAD) [25]. • Auxiliary information: The introduction of ASR to help language recognition is investigated by top teams. ...
... In 2019, Jalal et al. [60] implemented a hybrid model based on BLSTM, a 1D Conv-Cap, and capsule routing layers for SER. Ng and Liu [61] used a capsule-network-based model to encode spatial information from speech spectrograms and analyze the performance under various loss functions on several datasets. ...
Full-text available
Recognizing the speaker’s emotional state from speech signals plays a very crucial role in human-computer interaction (HCI). Nowadays, numerous linguistic resources are available, but most of them contain samples of a discrete length. In this article, we address the leading challenge in Speech Emotion Recognition (SER), which is how to extract the essential emotional features from utterances of a variable length. To obtain better emotional information from the speech signals and increase the diversity of the information, we present an advanced fusion-based dual-channel self-attention mechanism using convolutional capsule (Conv-Cap) and bi-directional gated recurrent unit (Bi-GRU) networks. We extracted six spectral features (Mel-spectrograms, Mel-frequency cepstral coefficients, chromagrams, the contrast, the zero-crossing rate, and the root mean square). The Conv-Cap module was used to obtain Mel-spectrograms, while the Bi-GRU was used to obtain the rest of the spectral features from the input tensor. The self-attention layer was employed in each module to selectively focus on optimal cues and determine the attention weight to yield high-level features. Finally, we utilized a confidence-based fusion method to fuse all high-level features and pass them through the fully connected layers to classify the emotional states. The proposed model was evaluated on the Berlin (EMO-DB), Interactive Emotional Dyadic Motion Capture (IEMOCAP), and Odia (SITB-OSED) datasets to improve the recognition rate. During experiments, we found that our proposed model achieved high weighted accuracy (WA) and unweighted accuracy (UA) values, i.e., 90.31% and 87.61%, 76.84%, and 70.34%, and 87.52% and 86.19%, respectively, demonstrating that the proposed model outperformed the state-of-the-art models using the same datasets
... Some systems used background noise extracted from the training data, white noise and random artificial band-pass filters. Most systems applied the SpecAugment strategy [23] [25], GRU, BLSTM [26], attention structure, attentive pooling, Global Context Network (GC-Net) [27], NetVLAD [28] or inspired Vector of Locally Aggregated Descriptors (VLAD) [29]. • Auxiliary information: The introduction of ASR to help language recognition was investigated by top teams (two out of five top teams used E2E ASR technologies). ...
... Some systems used background noise extracted from the training data, white noise and random artificial band-pass filters. Most systems applied the SpecAugment strategy [24] [26], GRU, BLSTM [27], attention structure, attentive pooling, Global Context Network (GC-Net) [28], NetVLAD [29] or inspired Vector of Locally Aggregated Descriptors (VLAD) [30]. • Auxiliary information: The introduction of ASR to help language recognition was investigated by top teams (two out of five top teams used E2E ASR technologies). ...
The fifth Oriental Language Recognition (OLR) Challenge focuses on language recognition in a variety of complex environments to promote its development. The OLR 2020 Challenge includes three tasks: (1) cross-channel language identification, (2) dialect identification, and (3) noisy language identification. We choose Cavg as the principle evaluation metric, and the Equal Error Rate (EER) as the secondary metric. There were 58 teams participating in this challenge and one third of the teams submitted valid results. Compared with the best baseline, the Cavg values of Top 1 system for the three tasks were relatively reduced by 82%, 62% and 48%, respectively. This paper describes the three tasks, the database profile, and the final results. We also outline the novel approaches that improve the performance of language recognition systems most significantly, such as the utilization of auxiliary information.
Accurately recognizing emotion from speech is a necessary yet challenging task due to the variability in speech and emotion. In this paper, a novel method combined self-attention mechanism and multi-scale fusion framework is proposed for multi-modal SER by using speech and text information. A self-attentional bidirectional contextual LSTM (bc-LSTM) is proposed to learn the context-sensitive dependences from speech. Specifically, the BLSTM layer is applied to learn long-term dependencies and utterance-level contextual information and the multi-head self-attention layer makes the model focus on the features that are most related to the emotions. A self-attentional multi-channel CNN (MCNN), which takes advantage of static and dynamic channels, is applied for learning general and thematic features from text. Finally, a multi-scale fusion strategy, including feature-level fusion and decision-level fusion, is applied to improve the overall performance. Experimental results on the benchmark dataset IEMOCAP demonstrate that our method gains an absolute improvement of 1.48% and 3.00% over state-of-the-art strategies in terms of weighted accuracy (WA) and unweighted accuracy (UA), respectively.
Speech based deception detection using deep learning is one of the technologies to realize a deception detection system with high recognition rate in the future. Multi-network feature extraction technology can effectively improve the recognition performance of the system, but due to the limited labeled data and the lack of effective feature fusion methods, the performance of the network is limited. Based on this, a novel hybrid network model based on attentional multi-feature fusion (HN-AMFF) is proposed. Firstly, the static features of large amounts of unlabeled speech data are input into DAE for unsupervised training. Secondly, the frame-level features and static features of a small amount of labeled speech data are simultaneously input into the LSTM network and the encoded output part of DAE for joint supervised training. Finally, a feature fusion algorithm based on attention mechanism is proposed, which can get the optimal feature set in the training process. Simulation results show that the proposed feature fusion method is significantly better than traditional feature fusion methods, and the model can achieve advanced performance with only a small amount of labeled data.
Conference Paper
Full-text available
Automatic emotion recognition from speech is a challenging task which relies heavily on the effectiveness of the speech features used for classification. In this work, we study the use of deep learning to automatically discover emotionally relevant features from speech. It is shown that using a deep recurrent neural network, we can learn both the short-time frame-level acoustic features that are emotionally relevant, as well as an appropriate temporal aggregation of those features into a compact utterance-level representation. Moreover, we propose a novel strategy for feature pooling over time which uses local attention in order to focus on specific regions of a speech signal that are more emotionally salient. The proposed solution is evaluated on the IEMOCAP corpus, and is shown to provide more accurate predictions compared to existing emotion recognition algorithms.
Full-text available
In this paper we compare different types of recurrent units in recurrent neural networks (RNNs). Especially, we focus on more sophisticated units that implement a gating mechanism, such as a long short-term memory (LSTM) unit and a recently proposed gated recurrent unit (GRU). We evaluate these recurrent units on the tasks of polyphonic music modeling and speech signal modeling. Our experiments revealed that these advanced recurrent units are indeed better than more traditional recurrent units such as tanh units. Also, we found GRU to be comparable to LSTM.
Full-text available
Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.
Full-text available
WITHOUT DOUBT, THERE IS EMOTIONAL INFORMATION IN ALMOST ANY KIND OF SOUND RECEIVED BY HUMANS EVERY DAY: be it the affective state of a person transmitted by means of speech; the emotion intended by a composer while writing a musical piece, or conveyed by a musician while performing it; or the affective state connected to an acoustic event occurring in the environment, in the soundtrack of a movie, or in a radio play. In the field of affective computing, there is currently some loosely connected research concerning either of these phenomena, but a holistic computational model of affect in sound is still lacking. In turn, for tomorrow's pervasive technical systems, including affective companions and robots, it is expected to be highly beneficial to understand the affective dimensions of "the sound that something makes," in order to evaluate the system's auditory environment and its own audio output. This article aims at a first step toward a holistic computational model: starting from standard acoustic feature extraction schemes in the domains of speech, music, and sound analysis, we interpret the worth of individual features across these three domains, considering four audio databases with observer annotations in the arousal and valence dimensions. In the results, we find that by selection of appropriate descriptors, cross-domain arousal, and valence regression is feasible achieving significant correlations with the observer annotations of up to 0.78 for arousal (training on sound and testing on enacted speech) and 0.60 for valence (training on enacted speech and testing on music). The high degree of cross-domain consistency in encoding the two main dimensions of affect may be attributable to the co-evolution of speech and music from multimodal affect bursts, including the integration of nature sounds for expressive effects.
Full-text available
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
Full-text available
Longitudinal data from 851 elderly residents of a retirement community (⁠ \(\mathrm{mean\ age\ }\mathbf{=}\ 73\ \mathrm{years}\) ⁠) were used to examine the correlates of self-assessments of health (SAH) and the predictors of changes in SAH over several follow-up periods ranging from 1 to 5 years. The authors hypothesized that indicators of positive health, including feelings of energy and positive mood, social support, and active functioning, are as important in determining current and future SAH as negative indicators such as disease history, disability, medication, and negative mood. Results of cross-sectional and longitudinal analyses showed that functional ability, medication use, and negative affect were salient to people judging their health, but positive indicators of activity and mood had an even stronger, independent effect. These findings show the importance of attending to the full illness-wellness continuum in studying people's perceptions of health.
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.
Data fusion has been applied to a large number of fields and the corresponding applications utilize numerous mathematical tools. This survey focuses on some aspects of estimation and decision fusion. In estimation fusion, we discuss the development of fusion architectures and algorithms with emphasis on the cross-correlation between local estimates from different sources. On the other hand, the techniques for decision fusion are discussed with emphasis on the classifier combining techniques. In addition, methods using neural networks for data fusion are briefly discussed.
With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-making processes. Although existing knowledge discovery and data engineering techniques have shown great success in many real-world applications, the problem of learning from imbalanced data (the imbalanced learning problem) is a relatively new challenge that has attracted growing attention from both academia and industry. The imbalanced learning problem is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced data sets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into information and knowledge representation. In this paper, we provide a comprehensive review of the development of research in learning from imbalanced data. Our focus is to provide a critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario. Furthermore, in order to stimulate future research in this field, we also highlight the major opportunities and challenges, as well as potential important research directions for learning from imbalanced data.
Persons with dementia of the Alzheimer type (DAT) are impaired in recognizing emotions from face and voice. Yet clinical practitioners use these mediums to communicate with DAT patients. Music is also used in clinical practice, but little is known about emotional processing from music in DAT. This study aims to assess emotional recognition in mild DAT. Seven patients with DAT and 16 healthy elderly adults were given three tasks of emotional recognition for face, prosody, and music. DAT participants were only impaired in the emotional recognition from the face. These preliminary results suggest that dynamic auditory emotions are preserved in DAT.