Conference PaperPDF Available

Enforcing Semantic Consistency for Cross Corpus Valence Regression from Speech Using Adversarial Discrepancy Learning

Authors:
Enforcing Semantic Consistency for Cross Corpus Valence Regression from
Speech using Adversarial Discrepancy Learning
Gao-Yi Chao1,2, Yun-Shao Lin1,2, Chun-Min Chang1,2, Chi-Chun Lee1,2
1Department of Electrical Engineering, National Tsing Hua University, Taiwan
2MOST Joint Research Center for AI Technology and All Vista Healthcare, Taiwan
cclee@ee.nthu.edu.tw
Abstract
Issues of mismatch between databases remain a major challenge
in performing emotion recognition on target unlabeled corpus
from labeled source data. While studies have shown that by
means of aligning source and target data distribution to learn a
common feature space can mitigate these issues partially, they
neglect the effect of distortion in emotion semantics across dif-
ferent databases. This distortion is especially crucial when re-
gressing higher level emotion attribute such as valence. In this
work, we propose a maximum regression discrepancy (MRD)
network, which enforces cross corpus semantic consistency by
learning a common acoustic feature space that minimizes dis-
crepancy on those maximally-distorted samples through adver-
sarial training. We evaluate our framework on two large emo-
tion corpus, the USC IEMOCAP and the MSP-IMPROV, for
the task of cross corpus valence regression from speech. Our
MRD demonstrates a significant 10% and 5% improvement in
concordance correlation coefficients (CCC) compared to using
baseline source-only methods, and we also show that it outper-
forms two state-of-art domain adaptation techniques. Further
analysis reveals that our model is more effective in reducing se-
mantic distortion on low valence than high valence samples.
Index Terms: valence, domain adaptation, adversarial learning,
cross corpus, semantic consistency
1. Introduction
Deep learning algorithms, with its complex and non-linear
learning mechanism, have brought impressive advancement to
speech emotion recognition (SER) technology in recent years.
While being powerful, such a data-driven learning methodol-
ogy can suffer from generalizability due to the phenomenon
known as dataset bias or domain shift [1]. This issue of non-
robustness is especially evident when learning to perform cross
corpus emotion recognition. Corpus-dependent idiosyncratic
factors, such as gender distribution, languages used, recorded
environments, even interaction contexts, create a situation re-
sulting in a large mismatch between testing data (target domain)
and training data (source domain). Instead of painstakingly col-
lecting labeled data to train a predictor for each possible target
scenario, domain adaption methods have been proposed to com-
pensate for the degradation in SER performances when trans-
ferring the learned knowledge from labeled source domain to
unlabeled target domain [2, 3].
Conventional SER domain adaptation approaches are based
on aligning data distributions between target and source domain
[4, 5]. For example, Song et al. introduced the use of maximum
mean discrepancy (MMD) proposed by Borgwardt et al. in the
optimization procedure of non-negative matrix factorization to
address SER domain adaptation problem [6]. Other approaches
have pointed to the direction that by deliberately learning an
indifferentiable common feature space between source and tar-
get could mitigate domain-specific idiosyncratic factors when
performing source to target emotion recognition. For exam-
ple, Abdelwahab et al. used a gradient reversal layer in a three
databases scenarios to predict emotion attributes of arousal, va-
lence, and dominance [7] . Adversarial learning mechanism has
also been utilized in general domain adaptation. For example,
Tzeng et al. exploited GAN-based loss and untied weight shar-
ing to reduce the difference between the source and the target
[8], and Laradji et al. extended the idea by adding triplet loss
and metric learning to improve the state-of-the-art unsupervised
adaptation results for computer vision task [9].
The major drawback of these algorithms assumes that by
aligning target and source emotion data distribution, the learned
target feature representation can directly be used to transfer the
source emotion label to the correct target label. However, map-
ping the target and source data to an indifferentiable common
space does not enforce any emotion semantic consistency, i.e.,
source features of high valence data may be mapped to target
features of low valence data. The reason for semantic distor-
tion may be that the data distributions of the databases are com-
pletely different like the visualization presented in [7] Figure
7. In this work, our goal is to mitigate this particular issue of
emotional semantic distortion in cross corpus valence regres-
sion from speech. The idea is inspired from an image classifi-
cation work by Saito et al. [10]; they showed that the severity of
distortion can be estimated with quantified target discrepancy,
and by incorporating this discrepancy in the procedure of learn-
ing domain-indifferentiable feature space, it can improve the
overall recognition performance.
Specifically, in this work, we propose a maximum regres-
sion discrepancy (MRD) network to perform cross corpus va-
lence regression from speech. Our MRD enforces semantic
consistency when learning the common acoustic feature space
with adversarial discrepancy mechanism, i.e., minimizing the
maximum cross corpus discrepancy. We evaluate our frame-
work on two databases, i.e., the IEMOCAP [11] and the MSP-
IMPROV [12], to perform source to target valence regression.
Our methods obtains a relative gain of 10% and 5% in con-
cordance correlation coefficients (CCC) over using source-only
baseline. We compare MRD with two other domain adapta-
tion without consistency constraint, i.e., Deep Coral (correlation
alignment for deep domain adaptation) [13] and DANN (unsu-
pervised domain adaptation by backpropogation) [14]. MRD
demonstrates its superior emotion regression results over these
two methods. Finally, analysis shows that MRD reduces the
semantic distortion more on low valence samples than high va-
lence samples. The rest of the paper is organized as follows:
section 2 describes about database and our MRD network, sec-
tion 3 presents our the experimental setup and results, and fi-
nally section 4 concludes with future work.
Copyright © 2019 ISCA
INTERSPEECH 2019
September 15–19, 2019, Graz, Austria
http://dx.doi.org/10.21437/Interspeech.2019-20371681
Step2: Maximize the discrepancy on target ( Fix E)
Step3: Minimize the discrepancy on target (Fix R1, R2)
Step1: Train both regressors and encoder to classify the source
samples correctly
Feature Extraction
Raw Wave
MRD
Prediction
(a)
(a)
(a)
(b)
(b)
(b)
Figure 1: Adversarial training steps of our MRD. Step1 learns two diverse valence regressors on the source data. Step2 maximizes the
discrepancy by changing the regressors to detect those highly-distorted target representations. Step3 learns the encoder to minimize
the discrepancy through adjusting the projected common space to reduce emotional semantic distortion. After MRD training, we finally
regress the valence value of target domain sample as the average of the two regressors.
2. Database and Method
2.1. Emotion Databases
We use two different emotion databases in our sutdy, the
USC-IEMOCAP [11] and the MSP-IMPROV [12]. These
two databases are one of the most commonly-used English
databases in cross corpus emotion recognition research (e.g.,
[15, 16]). Both databases were collected in a similar setting,
i.e., simulated natural dyadic interactions between actors and
were labeled using similar schemes. We evaluate our valence
regression experiments in a cross corpus setting, i.e., using the
USC-IEMOCAP database as our source domain (training) data
with the MSP-IMPROV database as our target domain (testing)
data, and vice versa. Brief description of the databases is below.
2.1.1. The USC-IEMOCAP Corpus
The USC-IEMOCAP has about 12 hours of audiovisual data
recorded from ten actors grouped into five sessions. The record-
ings were collected using scripted and improvised settings,
which allowed the actors to express spontaneous emotional ex-
pressions driven by the context. The database was segmented
into utterances (a total 10039 utterances). Each utterance was
annotated with categorical emotion labels as well as dimen-
sional (valence and activation) attribute with score ranges from
[1,5] by at least two evaluators. The valence label used in this
work is the average of the values given by the annotators.
2.1.2. The MSP-IMPROV Corpus
The MSP-IMPROV has over 9 hours of audiovisual emotion
data. It consists of six dyadic sessions. In every session, two
actors improvise scenarios in which one of them would utter
pre-defined targeted sentences. For each of these targeted sen-
tences, four emotional scenarios were created to contextualize
the sentence to elicit happy, angry, sad and neutral. The MSP-
IMPROV corpus includes the target sentences, other sentences
during the improvisations, and the natural interactions between
actors during the breaks. Similarly, the MSP-IMPROV was
segmented into utterances (8386 utterances in total). Each ut-
terance was annotated with categorical emotion labels as well
as dimensional (valence and activation) attributes with score
ranges from [1,5] using a crowdsourced evaluation scheme [17].
The valence label used in this work is the average of the values
given by the annotators.
2.2. Acoustic Features
In this work, we use the IS10-paraling feature set that was used
in the INTERSPEECH Paralinguistic Challenge 2010 extracted
from the openSMILE toolkit [18]. This acoustic feature set con-
sists of spectral, prosody, energy and voicing-related low level
descriptors (LLDs) further processed by computing various sta-
tistical functionals (a total dimension of 1582); more detailed
description can be found in the previous work [19]. We also
separately z-normalize this feature set for each corpus .
2.3. Maximum Regression Discrepancy (MRD) Network
In this study, the main task is defined as a regression prob-
lem to estimate valence from speech in a cross corpus setting,
i.e., source to target regression. Figure 1 shows our entire re-
gression framework and illustrates the adversarial discrepancy
learning procedure. The MRD network is learned using labeled
data, denoted as {Xs, Ys}, from the source domain and unla-
beled data, denoted as {Xt, Yt}, from the target domain. We
train a feature encoder network Ethat maps inputs of xsand
xtonto a common space, and two different regressors R1and
R2are trained to predict the valence labels from E-encoded
feature output using labeled data. f1(x)and f2(x)is used to
denote the regressed output for input xobtained by R1and R2,
respectively.
The key idea of our framework is based on adversarial dis-
crepancy learning, i.e., learning an encoder Ethat would min-
1682
Source : IEM Target : MSP Source : MSP Target : IEM
Source-Only Train-on-Target DANN Deep Coral MRD Source-Only Train-on-Target DANN Deep Coral MRD
PR 21.15 42.62 22.09 25.30 29.11 22.86 43.57 22.88 24.20 24.83
CCC 17.77 38.50 21.28 23.09 28.18 19.22 39.56 21.61 14.25 24.33
Table 1: Summary of our Experiment I. It lists both Pearson Correlation (PR) and Concordance Correlation (CCC) of the Source-only,
Train-on-Target, DANN, Deep Coral, and our MRD model. [IEM: USC-IEMOCAP corpus, MSP: MSP-IMPROV corpus]
imize the maximal semantic distortion between corpora. The
two regressors, R1and R2, are learned from the labeled source
data and can predict source samples reliably; however the prob-
lem of domain shift will degrade the performance due to seman-
tic distortion. The discrepancy distortion can be estimated using
the inconsistency loss [20] obtained from the regressor output,
f1(x)and f2(x), defined below:
Ldis(Xt) = 1
K
K
X
k=1
|f1(xtk)f2(xtk )|(1)
Kdenotes the number of batches. Note that R1and R2are
initialized differently, i.e., using different number of layers, to
obtain diverse regressors. We then train regressors to maximize
the inconsistency loss in order to identify those highly distorted
samples under a fixed E. This particular learning step is impor-
tant to detect those hidden distorted representations, otherwise,
the two regressors tend to converge to similar outputs. Then,
we finally optimize the MRD network to minimize the inconsis-
tency loss under the same regressors, which is equivalent to nar-
row the distance between the similar target samples and source
domain sample in the Eencoded space to ensure the encoded
target feature preserve the least distorted emotion information.
After the learning of MRD converges, the regressed value,
rt, for a test sample, xt, is obtained using the following:
rt=f1(xt) + f2(xt)
2(2)
2.3.1. Adversarial Discrepancy Learning Procedure
The training of our cross corpus MRD network requires two
regressors and one encoder to carry out the iterative adversarial
discrepancy learning. We summarize the training procedure for
each epoch as three major steps below (Figure 1 right):
Step 1: Both regressors and encoder are trained on the labeled
source samples. The loss function used in this step is the
mean square error (MSE) loss defined below:
min
E,R1,R2
Lmse(Xs.Ys)(3)
Step 2: We update both regressors (R1,R2) and fixed the en-
coder E. The inverse discrepancy loss is used to increase
the discrepancy to detect distorted target samples. Addi-
tionally, the source’s regression loss is added to the ob-
jective function in this step. Step 2 is learned using the
objective function defined as below:
min
R1,R2
Lmse(Xs, Ys) Ldis (Xt)(4)
Ldis =EXtxt[|f1(xt)f2(xt)|](5)
Step 3: We update the encoder, E, for mtimes to minimize
the discrepancy when fixing regressors. The objective
function is as follow:
min
E,R1,R2
Ldis(Xt)(6)
The hyperparameter mplays an important role in balanc-
ing the alternating min-max adversarial learning proce-
dure between encoder and regressors. Adversarial train-
ing is often unstable, strategy based on using different
learning rates [21] or applying different number of up-
dates on generator (our encoder) and discriminator (our
regressors) [22] have been used to stablize the training.
In this work, we update the encoder three times (m= 3)
in each epoch instead of just once. This parameter is
determined experimentally.
3. Experimental Setup and Results
We set up two different experiments in this work. Experiment
1provides a comparison in the unsupervised domain adapted
soeech valence regression tasks between the MSP-IMPROV
and the IEMOCAP of the following models:
Source-only: This is the baseline. The regressor net-
work is trained only on the source domain and regress
directly on target domain without any adaptation.
Train-on-Target: This is the upper bound of the model.
The regressor network is trained on the target domain
and test on the target domain (no adaptation is needed)
in a speaker-independent cross-validation setting.
DANN: This is an unsupervised domain adaptation
method through backpropagation based on method pro-
posed by Ganin et al. [14].
Deep Coral: This is a method based on correlation
alignment for deep unsupervised domain adaptation pro-
posed by Sun et al. [13].
MRD: Our proposed maximum regression discrepancy
network.
The parameters of the our MRD network are listed below:
the number of layers for encoder and two regressors are 4, 4,
and 5 respectively, and all hidden layer width is set to be 1024.
We use batch normalization, dropout rates (p=0.5), activation
function of SELU for all layers. The number of epochs and
learning rates are determined according to different tasks. In
this study, the maximum number of epochs is 100 and learning
rate is chosen between the range of 5e-4 to 5e-5.
The other comparison models are implemented using simi-
lar architectures to serve as a fair comparison. Each experiment
is repeated 10 times and the average of the results are reported in
this work. All results are evaluated in terms of Pearsons corre-
lation coefficient (PR) and concordance correlation coefficient
(CCC) between the ground-truth labels and the regressed val-
ues. CCC is defined as:
ρc=2ρσxσy
σ2
x+σ2
y+ (µxµy)2(7)
Experiment 2 provides a visualization on the distribution
of the encoded space to analyze the generalization ability of
MRD and its effectiveness as a function of the valence score.
1683
MSP Session 5 ( Larger domain mismatch)
CCC 12.79 27.53
MSP Session 6 ( Smaller domain mismatch)
20.7227.96
Low valence samples High valence samples Low valence samples High valence samples
Source only
MRD
IEM: Low High
MSP: Low High
Figure 2: Target domain: MSP-IMPROV and Source domain: IEMOCAP. We plot the 2D-projected acoustic representation in Source-
only and MRD encodeed space using t-SNE. It demonstrates a reduced semantic distortion, especially more effective on the low valence
samples.
3.1. Experiment 1 Results
Table 1 summarizes our Experiment 1 results. The Train-on-
Target can be seen as an upper bound of supervised regression
performances on both the IEMOCAP and the MSP-IMPROV; it
achieves 38.5% and 39.5% CCC on the two databases. Without
any domain adaptation, Source-only baseline achieves 17.7%
and 19.2% CCC, which indicates a severe domain mismatch
that degrades the regression performances significantly. DANN
improves 3.51% and 2.39% absolute over the baseline model
on the MSP-IMPROV and the IEMOCAP; Deep Coral method
shows an 5.32% improvement only when transferring from the
IEMOCAP to the MSP-IMPROV but not vice versa. Our pro-
posed MRD networks obtains an improvement of 10.4% and
5.01% absolute on the MSP-IMPROV and the IEMOCAP, re-
spectively.
MRD outperforms DANN and Deep Coral due to the fact
that these two other methods do not explicitly constrain the
learning of the encoder when aligning different databases dis-
tributions to maintain emotion semantic consistency. Without
such a consistency constraint, the encoder may generate inef-
fective acoustic representation because it only learns to make
the two distributions ambiguous and guiding the learning to pre-
dict well only in the source domain. Valence has been stated as
being a higher level affective attribute, which requires substan-
tial contextual and cognitive appraisal [23]. It can easily lead to
inconsistent semantic interpretation across domains even when
features are being projected to a similar acoustic space. Further-
more, we also observe that by using domain adaptation method
generally improves more when transferring from the IEMO-
CAP to the MSP-IMPROV, while further study is needed, we
hypothesize it may due to the amount of source data available,
i.e., the larger the variability exists in the database would often
lead to learning a more robust representation especially when
utilizing adversarial learning mechanism.
3.2. Experiment 2 Results
In Figure 2, we plot the 2D-projected acoustic representation
of the Source-only model and our MRD encoder out in task of
transferring from the IEMOCAP to the MSP-IMPROV using
t-SNE. We plot two sessions from the MSP-IMPROV. Session
5 corresponds to the subset of MSP-IMPROV that has a lower
regression performance when using Source-only model (larger
domain mismatch), and Session 6 corresponds the subset that
has a relatively higher regression performance prior to adaption
(smaller domain mismatch).
Generally, we observe that before MRD training, the en-
coded feature representations from these two sessions are all
very dissimilar between the two databases, and after the MRD
training, the differences in the two domains representation has
been decreased. By examining these t-SNE plots according to
low valence and high valence samples, it is evident that our
MRD help improve the low valence samples more than the high
valence samples. This indicates that the semantic distortion be-
ing correctly minimized in the encoded representation from our
MRD are more effective in the low valence samples. This could
potentially due to simply a larger amount of low valence sam-
ples available in the source domain, or it could be related to the
nature of acoustic manifestation of valence attributes, i.e., the
semantic consistent space for the acoustically low valence rep-
resentation may be easier to learn. However, additional investi-
gation is needed to understand the contributing factors of emo-
tional semantic distortion in the acoustic manifestations across
different domains of emotional expressions.
4. Conclusion and Future work
In this work, we present a maximum discrepancy regression
network (MRD) that learns to perform valence regression in
an unsupervised cross corpus setting. Using the target sam-
ple’s semantic discrepancy as feedback, our MRD adversarially
learns to transform the two databases acoustic representation
to a semantically-consistent and distributionally-aligned space.
This particular enforcement helps improve the valence regres-
sion performances. To our knowledge, MRD exceeds the cur-
rent state-of-the-art unsupervised adaptation results on regress-
ing valence attribute. In our immediate future work, we would
like to extend this framework to include lexical content, where
the semantic consistency between the corpus can be further con-
strained by the language used. It would be interesting to further
identify the underlying factors contributing to such a semantic
distortion between databases especially on challenging higher
level emotion attributes, such as valence, to enhance the robust-
ness of the current emotion sensing technology.
1684
5. References
[1] A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt,
and B. Sch¨
olkopf, “Covariate shift by kernel mean matching,”
Dataset shift in machine learning, vol. 3, no. 4, p. 5, 2009.
[2] H.-W. Ng, V. D. Nguyen, V. Vonikakis, and S. Winkler, “Deep
learning for emotion recognition on small datasets using transfer
learning,” in Proceedings of the 2015 ACM on international con-
ference on multimodal interaction. ACM, 2015, pp. 443–449.
[3] J. Gideon, S. Khorram, Z. Aldeneh, D. Dimitriadis, and E. M.
Provost, “Progressive neural networks for transfer learning in
emotion recognition,” arXiv preprint arXiv:1706.03256, 2017.
[4] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, “Deep
domain confusion: Maximizing for domain invariance, arXiv
preprint arXiv:1412.3474, 2014.
[5] M. Long, Y. Cao, J. Wang, and M. I. Jordan, “Learning trans-
ferable features with deep adaptation networks,” arXiv preprint
arXiv:1502.02791, 2015.
[6] P. Song, W. Zheng, S. Ou, X. Zhang, Y. Jin, J. Liu, and Y. Yu,
“Cross-corpus speech emotion recognition based on transfer non-
negative matrix factorization, Speech Communication, vol. 83,
pp. 34–41, 2016.
[7] M. Abdelwahab and C. Busso, “Domain adversarial for acoustic
emotion recognition,” IEEE/ACM Transactions on Audio, Speech,
and Language Processing, vol. 26, no. 12, pp. 2423–2435, 2018.
[8] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial
discriminative domain adaptation, in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2017,
pp. 7167–7176.
[9] I. Laradji and R. Babanezhad, “M-adda: Unsupervised do-
main adaptation with deep metric learning,” arXiv preprint
arXiv:1807.02552, 2018.
[10] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada, “Maximum
classifier discrepancy for unsupervised domain adaptation,” in
Proceedings of the IEEE Conference on Computer Vision and Pat-
tern Recognition, 2018, pp. 3723–3732.
[11] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower,
S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap:
Interactive emotional dyadic motion capture database, Language
resources and evaluation, vol. 42, no. 4, p. 335, 2008.
[12] C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab,
N. Sadoughi, and E. M. Provost, “Msp-improv: An acted corpus
of dyadic interactions to study emotion perception,” IEEE Trans-
actions on Affective Computing, no. 1, pp. 67–80, 2017.
[13] B. Sun and K. Saenko, “Deep coral: Correlation alignment for
deep domain adaptation,” in European Conference on Computer
Vision. Springer, 2016, pp. 443–450.
[14] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by
backpropagation,” arXiv preprint arXiv:1409.7495, 2014.
[15] M. Abdelwahab and C. Busso, “Incremental adaptation using
active learning for acoustic emotion recognition, in Acoustics,
Speech and Signal Processing (ICASSP), 2017 IEEE Interna-
tional Conference on. IEEE, 2017, pp. 5160–5164.
[16] ——, “Ensemble feature selection for domain adaptation in
speech emotion recognition,” in Acoustics, Speech and Signal
Processing (ICASSP), 2017 IEEE International Conference on.
IEEE, 2017, pp. 5000–5004.
[17] A. Burmania, S. Parthasarathy, and C. Busso, “Increasing the re-
liability of crowdsourcing evaluations using online quality assess-
ment,” IEEE Transactions on Affective Computing, vol. 7, no. 4,
pp. 374–388, 2016.
[18] F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent devel-
opments in opensmile, the munich open-source multimedia fea-
ture extractor, in Proceedings of the 21st ACM international con-
ference on Multimedia. ACM, 2013, pp. 835–838.
[19] B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers,
C. M¨
uller, and S. Narayanan, “The interspeech 2010 paralinguis-
tic challenge,” in Proc. INTERSPEECH 2010, Makuhari, Japan,
2010, pp. 2794–2797.
[20] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and
J. W. Vaughan, “A theory of learning from different domains,
Machine learning, vol. 79, no. 1-2, pp. 151–175, 2010.
[21] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and
S. Hochreiter, “Gans trained by a two time-scale update rule con-
verge to a local nash equilibrium, in Advances in Neural Infor-
mation Processing Systems, 2017, pp. 6626–6637.
[22] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative
adversarial networks, in International Conference on Machine
Learning, 2017, pp. 214–223.
[23] B. L. Omdahl, Cognitive appraisal, emotion, and empathy. Psy-
chology Press, 2014.
1685
... In addition to the feature domain mismatch, emotion labels can also become distorted across different datasets. Chao et al. [20] pointed out that the distortion in emotion semantics should be considered even in similar collection settings, and Wei [21] stated that the subdomains (e.g., categories and classes) should be jointly considered while training a transfer learning model. This type of research indicates the matching among feature representations does not intrinsically equal to aligning emotion labels between corpora. ...
... Result: F clf with θ * 17 S aug ← M ix(S 1 , S 2 , T f ake ) 18 n ← S aug/m 19 for batch = 1, . . . , n do 20 Sample 23 Loss clf ← CrossEntropy(Ŷ , Y ) 24 Update θ 25 end more accurately make predictions about the target corpus and increase its variability. In the emotion classifier setting, only the fully connected layers are considered. ...
Article
Full-text available
Speech emotion recognition (SER) plays a crucial role in understanding user feelings when developing artificial intelligence services. However, the data mismatch and label distortion between the training (source) set and the testing (target) set significantly degrade the performances when developing the SER systems. Additionally, most emotion-related speech datasets are highly contextualized and limited in size. The manual annotation cost is often too high leading to an active investigation of unsupervised cross-corpus SER techniques. In this paper, we propose a framework in unsupervised cross-corpus emotion recognition using multi-source corpus in a data augmentation manner. We introduced Corpus-Aware Emotional CycleGAN (CAEmoCyGAN) including a corpus-aware attention mechanism to aggregate each source datasets to generate the synthetic target sample. We choose the widely used speech emotion corpora the IEMOCAP and the VAM as sources and the MSP-Podcast as the target. By generating synthetic target-aware samples to augment source datasets and by directly training on this augmented dataset, our proposed multi-source target-aware augmentation method outperforms other baseline models in activation and valence classification.
... Our proposed approach is benchmarked against several recent state-of-the-art methods specifically designed for speech [20], Adversarial Discrepancy Learning (ADL) [47], and Few-shot Learning with Domain Adaptation [21]. All these methods, including our MDERE framework, require careful hyperparameter tuning to achieve optimal results. ...
Article
Full-text available
As artificial intelligence advances, speech-emotion recognition (SER) has become a critical research area. Traditional SER methods typically rely on homogeneous domain data for training and testing. This practice requires adaptation when confronted with real-world data's heterogeneous linguistic, methodological, and speaker-related attributes. These variances can degrade the accuracy and generalization of SER models. To address this gap, we introduce a novel domain adaptation technique, multi domain emotion recognition enhancement (MDERE), which utilizes a non-negative matrix to reduce the inflexibility of the conventional binary label matrix for source domain data. This process yields a label matrix that better adapts to the nuances of the source labels while preserving their original structure. This framework refines SER methods by fine-tuning a transformation matrix for enhanced emotion discrimination. Elastic net regularization, which combines L1 and L2 penalties, enriches the transformation matrix, selectively emphasizing relevant features to enhance the robustness of emotion detection. The framework constructs customized similarity and dissimilarity graphs to reconcile the differences between source and target domains, enabling nuanced cross-domain data analysis. Extensive testing on multiple cross-domain SER tasks has shown that MDERE substantially improves recognition accuracy, surpassing several state-of-the-art algorithms. These results demonstrate MDERE's ability to effectively align domain variations enhances the generalizability of SER systems.
... Speech emotion recognition (SER) is being actively deployed in real-world settings, as a result, developing robust SER has naturally become an important research topic. In the past, researchers have mostly worked on mitigating performance drops due to unwanted and unaccounted variations, e.g., small-scale training corpus [1], distinct contextualized settings [2], wide ranges of cross-corpus scenarios [3,4], and even semantic mismatch in emotion labeling [5], etc. Exemplary works include the use of very-deep structures [6,7], generative models [8,9,10], adversarial learning [11], and transfer learning approach [12]. Most of these contemporary methods rely on deep learning of various network complexities to handle issues of non-robust variations to achieve reliability [13]. ...
... Since CREMA-D's classes are balanced, we use accuracy as our evaluation metric. Following [23,24,25], we report the Mean Absolute Error (MAE) and the Concordance Correlation Coefficients (CCC) to assess the quality of the regression models on the MSP-IMPROV dataset. ...
Preprint
Full-text available
In this paper, we introduce a pretrained audio-visual Transformer trained on more than 500k utterances from nearly 4000 celebrities from the VoxCeleb2 dataset for human behavior understanding. The model aims to capture and extract useful information from the interactions between human facial and auditory behaviors, with application in emotion recognition. We evaluate the model performance on two datasets, namely CREMAD-D (emotion classification) and MSP-IMPROV (continuous emotion regression). Experimental results show that fine-tuning the pre-trained model helps improving emotion classification accuracy by 5-7% and Concordance Correlation Coefficients (CCC) in continuous emotion recognition by 0.03-0.09 compared to the same model trained from scratch. We also demonstrate the robustness of finetuning the pre-trained model in a low-resource setting. With only 10% of the original training set provided, fine-tuning the pre-trained model can lead to at least 10% better emotion recognition accuracy and a CCC score improvement by at least 0.1 for continuous emotion recognition.
Conference Paper
In this paper, we introduce a pretrained audio-visual Transformer trained on more than 500k utterances from nearly 4000 celebrities from the VoxCeleb2 dataset for human behavior understanding. The model aims to capture and extract useful information from the interactions between human facial and auditory behaviors, with application in emotion recognition. We evaluate the model performance on two datasets, namely CREMAD-D (emotion classification) and MSP-IMPROV (continuous emotion regression). Experimental results show that fine-tuning the pre-trained model helps improving emotion classification accuracy by 5-7% and Concordance Correlation Coefficients (CCC) in continuous emotion recognition by 0.03-0.09 compared to the same model trained from scratch. We also demonstrate the robustness of finetuning the pre-trained model in a low-resource setting. With only 10% of the original training set provided, finetuning the pre-trained model can lead to at least 10% better emotion recognition accuracy and a CCC score improvement by at least 0.1 for continuous emotion recognition.
Article
Achieving robust cross contexts speech emotion recognition (SER) has become a critical next direction of research for wide adoption of SER technology. The core challenge is in the large variability of affective speech that is highly contextualized. Prior works have worked on this as a transfer learning problem that mostly focuses on developing domain adaptation strategy. However, many of the existing speech emotion corpora, even those considered as large scale, are still limited in size resulting in an unsatisfactory transfer result. On the other hand, directly collecting context-specific corpus often results in an even smaller data size leading to an inevitably non-robust accuracy. In order to mitigate this issue, we propose the concept of enhancing the affect-related variability when learning the in-context acoustic latent representation by integrating out-of-context emotion data. Specifically, we utilize adversarial autoencoder network as our backbone with multiple out-of-context emotion labels derived for each in-context samples that serve as an auxiliary constraint in learning the latent representation. We extensively evaluate our framework using three in-context databases with three out-of-context databases. In this work, we demonstrate not only an improved recognition accuracy but also a comprehensive analysis on the effectiveness of this representation learning strategy.
Article
Speech emotion recognition (SER) is an important research area, with direct impacts in applications of our daily lives, spanning education, health care, security and defense, entertainment, and human–computer interaction. The advances in many other speech signal modeling tasks, such as automatic speech recognition, text-to-speech synthesis, and speaker identification, have led to the current proliferation of speech-based technology. Incorporating SER solutions into existing and future systems can take these voice-based solutions to the next level. Speech is a highly nonstationary signal, with dynamically evolving spatial-temporal patterns. It often requires a sophisticated representation modeling framework to develop algorithms capable of handling real-life complexities.
Article
Mismatch between databases entails a challenge in performing emotion recognition on a practical-condition unlabeled database with labeled source data. The alignment between the source and target is crucial for conventional neural network; therefore, many studies have mapped two domains in a common feature space. However, the effect of distortion in emotion semantics across different conditions has been neglected in such work, and a sample from the target may be considered a high emotional annotation in the target but as low in the source. In this work, we propose the maximum regression discrepancy (MRD) network, which enforces semantic consistency in a source and target by adjusting the acoustic feature encoder to minimize discrepancy in maximally distorted samples through adversarial training. We show our framework in several experiments using three databases (the USC IEMOCAP, MSP-Improv, and MSP-Podcast) for cross corpus emotion prediction. Compared to the Source-only neural network and DANN, MRD network demonstrates a significant improvement between 5\% and 10\% in the concordance correlation coefficient (CCC) in cross-corpus prediction and between 3\% and 10\% for evaluation on MSP-PODCAST. We also visualize the effect of MRD on feature representation to shows the efficacy of the MRD structure we designed.
Article
Within a single speech emotion corpus, deep neural networks have shown decent performance in speech emotion recognition. However, the performance of the emotion recognition based on data-driven learning methods degrades significantly for the cross-corpus scenario. To relieve this issue without any labeled samples from the target domain, we propose a cross-corpus speech emotion recognition based on few-shot learning and unsupervised domain adaptation, which is trained to learn the class (emotion) similarity from the source domain samples adapted to the target domain. In addition, we utilize multiple corpora in training to enhance the robustness of the emotion recognition to the unseen samples. Experiments on emotional speech corpora with three different languages showed that the proposed method outperformed other approaches.
Chapter
Full-text available
Unsupervised domain adaptation techniques have been successful for a wide range of problems where supervised labels are limited. The task is to classify an unlabeled “target” dataset by leveraging a labeled “source” dataset that comes from a slightly similar distribution. We propose metric-based adversarial discriminative domain adaptation (M-ADDA) which performs two main steps. First, it uses a metric learning approach to train the source model on the source dataset by optimizing the triplet loss function. This results in clusters where embeddings of the same label are close to each other and those with different labels are far from one another. Next, it uses the adversarial approach (as that used in ADDA (Tzeng et al. Adversarial discriminative domain adaptation, 2017, [36])) to make the extracted features from the source and target datasets indistinguishable. Simultaneously, we optimize a novel loss function that encourages the target dataset’s embeddings to form clusters. While ADDA and M-ADDA use similar architectures, we show that M-ADDA performs significantly better on the digits adaptation datasets of MNIST and USPS. This suggests that using metric learning for domain adaptation can lead to large improvements in classification accuracy for the domain adaptation task. The code is available at https://github.com/IssamLaradji/M-ADDA.
Article
Full-text available
The performance of speech emotion recognition is affected by the differences in data distributions between train (source domain) and test (target domain) sets used to build and evaluate the models. This is a common problem, as multiple studies have shown that the performance of emotional classifiers drop when they are exposed to data that does not match the distribution used to build the emotion classifiers. The difference in data distributions becomes very clear when the training and testing data come from different domains, causing a large performance gap between validation and testing performance. Due to the high cost of annotating new data and the abundance of unlabeled data, it is crucial to extract as much useful information as possible from the available unlabeled data. This study looks into the use of adversarial multitask training to extract a common representation between train and test domains. The primary task is to predict emotional attribute-based descriptors for arousal, valence, or dominance. The secondary task is to learn a common representation where the train and test domains cannot be distinguished. By using a gradient reversal layer, the gradients coming from the domain classifier are used to bring the source and target domain representations closer. We show that exploiting unlabeled data consistently leads to better emotion recognition performance across all emotional dimensions. We visualize the effect of adversarial training on the feature representation across the proposed deep learning architecture. The analysis shows that the data representations for the train and test domains converge as the data is passed to deeper layers of the network. We also evaluate the difference in performance when we use a shallow neural network versus a \emph{deep neural network} (DNN) and the effect of the number of shared layers used by the task and domain classifiers.
Conference Paper
Full-text available
Generative Adversarial Networks (GANs) excel at creating realistic images with complex models for which maximum likelihood is infeasible. However, the convergence of GAN training has still not been proved. We propose a two time-scale update rule (TTUR) for training GANs with stochastic gradient descent on arbitrary GAN loss functions. TTUR has an individual learning rate for both the discriminator and the generator. Using the theory of stochastic approximation, we prove that the TTUR converges under mild assumptions to a stationary local Nash equilibrium. The convergence carries over to the popular Adam optimization, for which we prove that it follows the dynamics of a heavy ball with friction and thus prefers flat minima in the objective landscape. For the evaluation of the performance of GANs at image generation, we introduce the `Fréchet Inception Distance'' (FID) which captures the similarity of generated images to real ones better than the Inception Score. In experiments, TTUR improves learning for DCGANs and Improved Wasserstein GANs (WGAN-GP) outperforming conventional GAN training on CelebA, CIFAR-10, SVHN, LSUN Bedrooms, and the One Billion Word Benchmark. https://papers.nips.cc/paper/7240-gans-trained-by-a-two-time-scale-update-rule-converge-to-a-local-nash-equilibrium
Conference Paper
Full-text available
This paper presents the techniques employed in our team's submissions to the 2015 Emotion Recognition in the Wild contest, for the sub-challenge of Static Facial Expression Recognition in the Wild. The objective of this sub-challenge is to classify the emotions expressed by the primary human subject in static images extracted from movies. We follow a transfer learning approach for deep Con-volutional Neural Network (CNN) architectures. Starting from a network pre-trained on the generic ImageNet dataset, we perform supervised fine-tuning on the network in a two-stage process, first on datasets relevant to facial expressions, followed by the contest's dataset. Experimental results show that this cascading fine-tuning approach achieves better results, compared to a single stage fine-tuning with the combined datasets. Our best submission exhibited an overall accuracy of 48.5% in the validation set and 55.6% in the test set, which compares favorably to the respective 35.96% and 39.13% of the challenge baseline.
Article
Full-text available
We present the MSP-IMPROV corpus, a multimodal emotional database, where the goal is to have control over lexical content and emotion while also promoting naturalness in the recordings. Studies on emotion perception often require stimuli with fixed lexical content, but that convey different emotions. These stimuli can also serve as an instrument to understand how emotion modulates speech at the phoneme level, in a manner that controls for coarticulation. Such audiovisual data are not easily available from natural recordings. A common solution is to record actors reading sentences that portray different emotions, which may not produce natural behaviors. We propose an alternative approach in which we define hypothetical scenarios for each sentence that are carefully designed to elicit a particular emotion. Two actors improvise these emotion-specific situations, leading them to utter contextualized, non-read renditions of sentences that have fixed lexical content and convey different emotions. We describe the context in which this corpus was recorded, the key features of the corpus, the areas in which this corpus can be useful, and the emotional content of the recordings. The paper also provides the performance for speech and facial emotion classifiers. The analysis brings novel classification evaluations where we study the performance in terms of inter-evaluator agreement and naturalness perception, leveraging the large size of the audiovisual database.
Article
Domain adaptation generalizes a learning machine across source domain and target domain under different distributions. Recent studies reveal that deep neural networks can learn transferable features generalizing well to similar novel tasks for domain adaptation. However, as deep features eventually transition from general to specific along the network, feature transferability drops significantly in higher task-specific layers with increasing domain discrepancy. To formally reduce the dataset shift and enhance the feature transferability in task-specific layers, this paper presents a novel framework for deep adaptation networks, which generalizes deep convolutional neural networks to domain adaptation. The framework embeds the deep features of all task-specific layers to reproducing kernel Hilbert spaces (RKHSs) and optimally match different domain distributions. The deep features are made more transferable by exploring low-density separation of target-unlabeled data and very deep architectures, while the domain discrepancy is further reduced using multiple kernel learning for maximal testing power of kernel embedding matching. This leads to a minimax game framework that learns transferable features with statistical guarantees, and scales linearly with unbiased estimate of kernel embedding. Extensive empirical evidence shows that the proposed networks yield state-of-the-art results on standard visual domain adaptation benchmarks.
Article
Automatic emotion recognition from speech has received an increasing amount of interest in recent years, and many speech emotion recognition methods have been presented, in which the training and testing procedures are often conducted on the same corpus. However, in practice, the training and testing speech utterances are collected from different conditions or devices, which will have adverse effects on recognition performance. To address this problem, in this paper, a novel cross-corpus speech emotion recognition method, called transfer non-negative matrix factorization (TNMF) is proposed. Specifically, the NMF approach, which is popular in computer vision and pattern recognition fields, is utilized to obtain low dimensional representations of emotional features. Meanwhile, the discrepancies between source and target data sets are considered, and the maximum mean discrepancy (MMD) algorithm is used for similarity measurement. Then, the TNMF method, which jointly optimizes the NMF and MMD algorithms, is presented. Moreover, to further improve the recognition performance, two variants of TNMF, called transfer graph regularized NMF (TGNMF) and transfer constrained NMF (TCNMF), are proposed, respectively. Several experiments are carried out on three popular emotional databases, and the results demonstrate the effectiveness and robustness of our scheme.