Conference PaperPDF Available

Can Deep Generative Audio be Emotional? Towards an Approach for Personalised Emotional Audio Generation

Can Deep Generative Audio be Emotional?
Towards an Approach for Personalised
Emotional Audio Generation
Alice Baird
Augsburg University, Germany
Shahin Amiriparian
Augsburg University, Germany
orn Schuller
Augsburg University, Germany
Abstract—The ability for sound to evoke states of emotion is
well known across fields of research, with clinical and holistic
practitioners utilising audio to create listener experiences which
target specific needs. Neural network-based generative models
have in recent years shown promise for generating high-fidelity
based on a raw audio input. With this in mind, this study utilises
the WaveNet generative model to explore the ability of such
networks to retain the emotionality of raw audio speech inputs.
We train various models on 2-classes (happy and sad) of an
emotional speech corpus containing 68 native Italian speakers.
When classifying the combined original and generated audio,
hand-crafted feature sets achieve at best 75.5 % unweighted
average recall, a 2 percent point improvement over the original
only audio features. Additionally, from a two-tailed test on the
predictions, we find that the audio features from the original
speech concatenated with the generated audio features provides
significantly different test result compared to the baseline. Both
findings indicating promise for emotion-based audio generation.
Deep generative networks (including Generative Adversarial
Networks (GANs) [1]) have found an abundance of use
cases within the field of machine learning in recent years.
Particularly in the computer audition community as well as
for vision, applications include domain adaptation [2] and
data manipulation [3], amongst others [4], [5]. With gener-
ative methods, e. g., for speech enhancement [6], [7] showing
substantial improvements over previous methods [8].
Higher fidelity audio comes with higher computational
costs, making generative networks not yet fully applicable for
realistic real-time audio-based applications. Although, through
the utilisation of pre-trained networks, real-time processing
does show promise for tasks including audio denoising [9].
However, given the often lower dimensionality of data sources,
more effective reinforced real-time frameworks have been
applied to vision tasks, e. g., image correction [10].
Given that synthetic audio has the ability to immerse a
listener in an emotional environment, and transmit an emo-
tional state [11], there is much room for research in the
realm of generative networks towards personalised variations
of emotional soundscapes. Implemented in real-time and with
All authors are affiliated to the ZD.B Chair of Embedded Intelligence for
Health Care and Wellbeing. Bj¨
orn Schuller is also affiliated to GLAM – Group
on Language, Audio and Music, Imperial College London, UK. 978-1-7281-
1817-8/19/$31.00 ©2019 European Union
specific penalisation ability, such an audio environment could
be implemented in daily life scenarios, e. g., in the work place
to improve general quality of life [12].
Emotion is a subtle aspect of audio transmission, which may
not be captured via deep generative approaches. Generative
networks are however, able to reach near human replication
in the field of speech synthesis [13], and the perception of
various approaches, including the state-of-the-art has been
evaluated [14]. As well as this, conversion of emotional speech
states utilising the WaveNet Vocoder framework has recently
shown promise [15], and approaches for deriving representa-
tions of emotional speech features found deep convolutional
generative adversarial networks (DC-GANs) to be of most
benefit for feature generation, as compared to convolutional
neural networks (CNNs) architectures [5], [16]. However,
to the best of the authors’ knowledge, the advantages of
data augmentation utilising emotional data have not yet been
explored in the audio domain, although it is a topic that has
shown to be successful for emotion-based visual data [17].
As an initial step in exploring this topic, we utilise the
large and highly emotionally diverse DEM OS corpus of Ital-
ian emotional speech [18]. Applying pitch-based augmenta-
tion, 2-classes (happy and sad) from the corpus are then
used as training data for several speaker independently parti-
tioned WaveNet models [19]. Post-audio generation from the
WaveNet models, we extract both state-of-the-art and conven-
tional feature sets including, the deep representations of the
DEE P SPE CT RUM toolkit [20], as well as hand-crafted features
from the extended Geneva Minimalistic Acoustic Parameter
Set (eGeMAPS) [21]. We choose these feature sets due to
their known strength for similar emotion recognition [22] and
classification [23] tasks. From both the generated and original
data, a series of classification experiments were performed to
ascertain if the generated audio is able to improve results,
assuming that this implies the inclusion of subtle emotion-
related features in the generated audio.
This paper is structured as follows. In section (Section II),
the corpus for our experiments is presented, including pro-
cessing, partitioning and augmentation. We then describe our
experimental settings for both the generative model Section III,
and the following classification paradigm. Followed by results
discussion in Section IV, and conclusions in Section VI.
Fig. 1: An overview of the implementation utilised in this study for emotional audio generation with WaveNet. Features
were extracted using both the OP EN SM ILE and DEEP SPEC TRU M toolkits, resulting in eight feature sets, (Baseline DEMOS ,
(G)enerated DEMOS , Augmented DEMOS , and Generated Augmented DEMOS ). A Support Vector Machines was utilised
for classification experiments, with optimisation of the complexity only for the 2-class classification.
TABLE I: Speaker independent partitions, Train,
(Dev)elopment, Test created from the DE MOS emotional
speech database. Including the distribution of the 2-classes
(Happy and Sad).
Train Dev. Test P
Speakers 24 22 22 68
Gender M:F 15:9 15:7 15:7 45:23
Happy 447 434 514 1395
Sad 493 486 551 1530
P940 920 1065 2 925
For this study, we utilise the Database of Elicited Mood
in Speech (DE MOS ) [18]. DEMOS is a corpus of induced
Italian emotional speech, including the ‘big 6’ emotions, plus
neutral, and additionally guilt. This dataset is comprised of
9 365 emotional instances and 332 neutral samples produced
by 68 native speakers (23 females, 45 males).
For this first step study, we chose to use only the emotional
samples of happiness and sadness, as these fall in opposite po-
sitions when observing the the valence and arousal emotional
circumflex [24], and will possibly allow for a more significant
difference between the generated classes. In this way, the sub-
set of DE MOS that we utilised for the study has in total 2 925
instances, with a duration of 2 h:47 m:41 s.
A. Data Pre-Processing
As a first step, the DEMOS the data was normalised across
all speakers. The data then remained at the provided format
of monophonic WAV 44.1kHz.
As a means of avoiding any speaker dependency during
training, partitioning of the data was made prior with consid-
eration to gender. There is a gender bias in the dataset (45:23,
male:female), and future work could be to consider gender-
independent models, however, with consideration to gender we
balance each (train, development, and test) equally, with the
addition of balancing the instances for the 2-classes of interest
happy and sad (cf. Table I for the partitioning applied)1.
1The Speaker IDs for each partition are as follows: (training) 01 17,21,
22,29,31,36 38, (development) 18 20,23 28,30,32 34,39 41,
43 48 and (test) 42,49,50 69.
B. Data Augmentation
In general, it is known that deep networks require a large
amount of data to achieve usable results. With this in mind, al-
though for an emotional speech corpus, the DE MOS database
is reasonable in size, when training the WaveNet system we
applied pitch shifting to augment the input data.
Pitch shifting has shown to be a strong choice over others
such as time and noise augmentation for similar tasks, includ-
ing environmental sound classification [25]. Augmentation was
only applied to the Train and Development partitions, keep-
ing the Test set entirely unchanged. Utilising the LibROSA
toolkit [26], we choose to raise or lower the pitch of the audio
samples, and keep the duration of the samples unchanged.
Each utterance was pitch shifted by a factor 10 ; 5 lower:
{0.75, 0.80, 0.85, 0.90, 0.95}and 5 higher {1.05, 1.10, 1.15,
1.20, 1.25}increments, which are audibly observed to have
made minimal change to the original data. Resulting in an
DEMOS augmented data set (not including unchanged Test
set) of 19 h:01 m:22 s.
As a first step to explore the potential of emotional audio
generation utilising generative networks, we utilise a Tensor-
Flow implementation of the WaveNet generative framework
for modelling raw audio [19]2. We choose WaveNet as this
is a standard framework in the field of audio generation
(cf. Figure 1 overview of the experimental setting).
WaveNet is an audio implementation of PixelCNN [27],
and is a generative network for modelling features of raw
audio, represented as 8-bit audio files, with 256 possible
values. During the training process, the model predicts audio
signal values (with a temporal resolution of at least 16k Hz)
at each step comparing to the true value, using cross-entropy
as a loss function. Hence, the WaveNet model implements
a 256 class classification [28]. As a means of decreasing
the computational expenses, WaveNet applies the method of
stacked dilated casual convolutions, reducing the receptive
field, and minimising the loss in the resolution [29].
A. Model Training
As the input for the WaveNet model, we supply 6 training
sets (3 for each class). We then train three models separately
on the augmented Training, Development, and Training plus
Development partitions. The WaveNet model was iterated for
100 000 steps and a silence threshold sof 0 was set. sacts
as filter, ignoring samples of silence. Given the subtle nature
of emotion-based speech features swas set to zero to avoid
loss of information. To reach 100 000 steps, ca. 22 hours was
needed on an Nvidia GTX TITAN X with 12 GB of VRAM
for each training set model.
B. Audio Generation
After training each WaveNet model for 100 000 steps, we
generate new speech audio samples. Based on an approxima-
tion of the mean duration from each partition of the original
DEMOS data (Train = 2.6 s , Dev. = 2.6s, Train+Dev. =
2.6 s ), we generated samples of 2.6 s for the Train, Devel-
opment and the combined Train and Development models,
with total instances of 2 123. Due to limited computational
processing time, we only reach ca. 75 % of the original DE-
MOS quantity 3. For generation, the hyperparameters remain
the same, with the temperature threshold tof 1.0applied –
lowering tcauses the model to focus on higher probability
predictions. Following this, we also apply data augmentation
in the same manner as described in Section II-B to each of
these generated partitions. Spectrogram plots of the generated
audio in comparison to the original audio data can be seen
in Figure 2. From a qualitative analysis of the generated data,
attributes of the original speech, i. e. accent and intonation are
audible, despite the presence of noise that has occurred in
excess during generation.
C. Feature Extraction and Fusion
For both datasets, the generated and the original, we extract
conventional hand-crafted features, and state-of-the-art deep
representations. Resulting in 8 features sets (4 hand crafted,
and 4 deep), from 4 variations of the data: (1) Original
DEMOS, (2) Augmented DEMOS, (3) Generated D EMOS,
and (4) Augmented Generated DE MOS.
Given the success of the extended Geneva Minimalistic
Acoustic Parameter Set (eGeMAPS) [21], we utilise this as
a conventional handcraft approach. From each instance, the
eGeMAPS acoustic features are extracted with the OP EN S-
MILE toolkit [30]. Using the default parameter settings from
OP EN SMIL E for the low-level descriptors (LLDs) of each
feature set, the higher level suprasegmental features were
extracted over the entire audio instance.
Additionally, we extract a 4 096 dimensional feature
set of deep data-representations with the DEEP SPEC-
TRU M toolkit [31]4. DE EP SPECTRUM has shown success for
other emotion-based speech tasks [23]. For this study, we
extract mel-spectrograms with a viridis colour map, using the
3For the interested reader, a selection of generated data can be found here:
(a) NP f 47 tri01b (b) tri 2.6 308 (c) NP f 47 gio03c (d) gio 2.6 10
Fig. 2: Spectrogram representation of speech files. ‘Sad’
original audio (a) and ‘Sad’ generated audio (b), as well as
‘Happy’ original audio (c) and ‘Happy’ generated audio (d).
Files names indicated in caption. High frequency noise can be
seen in excess for the generated audio, however, vocal features
such as formants are also seen to be replicated in the lower
frequency range.
default DEEP SP EC TRUM settings and the VGG16 pre-trained
imagenet model [32], with no window size or overlap applied.
D. Classification Approach
A support vector machine (SVM) implementation with
linear kernel from the open-source machine learning toolkit
Scikit-Learn [33] is used for our experiments. During the
development phase, we trained a series of SVM models,
optimising the complexity parameters (C104,103,102,
101,1), evaluating their performance on the Development
set. For original DEMOS data we re-trained the model with
the concatenated Train and Development set, and evaluate the
performance on the Test set. For the Generated DEMOS , we
utilise the data which has been generated from the combined
Training and Development WaveNet, and evaluated on the
original DEMOS test. Further, upon creation of the 8 afore-
mentioned features sets, we prepared 5 experiments which
were repeated for each feature set type (DE EP SPECTRUM and
eGeMAPS), in various combinations of the data, with all tested
on the original unseen DE MOS Test partition:
1) Baseline (original data for Training, Development).
2) Generated (generated speech for Training, Development).
3) Baseline + generated (combined baseline and generated
in Training, Development).
4) Generated + augmentation of original (combined gen-
erated with pitch shifting augmentation of original for
Training, Development).
5) Augmented generated + augmented original (combined
pitch shifting augmentation of generated speech with
augmented original for Training, Development).
When observing the results found in Table II, it can be
seen that for both DEEP SPECT RUM and eGeMAPS results,
there is an improvement on the classification baseline when
applying the generated audio to the Training set. We will
discuss experiments in relation to the number indicated in
the results table, as previously described in section III-D. To
evaluate the significant (or not) difference between predictions,
we conduct a two-tailed T-test, rejecting the null-hypothesis
at a significance level of p < 0.05 and below. For this, we
checked each Test set prediction result for normality using a
Shapiro-Wilktest [34].
From the DE EP SPECTRUM results we see slight improve-
ment when utilising the generative data with the original data.
Of most interest is experiment 3 in which the result improves
over the original baseline by 0.9 percent points, at the same
classification complexity of C = 102. When performing a T-
test with test predictions of experiment 3 against experiment
1, we obtain p= 0.05, which would suggest a borderline
significant difference in this improvement. We also see im-
proved results for experiment 5 although this is not found to be
significantly different to the baseline. Better results were found
for Test with larger complexity optimisation values; however,
this occurred due to overfitting on the Development set and
therefore the result is not reported.
Experiment 2 from the DE EP SPECT RUM results, which
utilises the generated data only in the Training set, received
below chance level, implying that the original data is needed
for this scenario. However, when we observe the eGeMAPS
result for experiment 2, there is a 5 percent point increase in
comparison to DE EP SPECTRUM features. This shows promise,
although through significance testing between experiment 2 of
DEE P SPE CT RUM and eGeMAPS results, there is no signifi-
cant difference found.
Continuing with the results from eGeMAPS features, the
best result is seen in experiment 4, with a 75.5% UAR, 1.9
percent higher than the baseline experiment 1. This results
would suggest that the known emotionality of the eGeMAPS
feature set is more able to capture the emotion from the
generated data, as compared to the DEEP SPECT RUM result for
this experiment. However, no significant difference is found
between these experiments when evaluating with a T-test.
The result of the eGeMAPS experiment 5 are significantly
different from the baseline experiment 1 with p= 0.006. This
does show promise for additional pitch based augmentation
on the generated data; however, for experiment 4 which is our
highest result, no significant difference over the baseline was
found (p= 0.069).
When considering the limitation of this study, we see that
the results do show promise, but there is minimal significance
to the improvement. It may be of benefit to consider deeper
networks for classification of the generated data rather than the
conventional SVM. Specifically, networks which incorporate
the time-dependency which are inherent to audio, e. g., RNNs
or convolutional RNNs [35]. As well as this, incorporating
multiple data sets, and more emotional classes may be fruitful
for evaluation, given the tendencies we have seen arise from
this first step 2-class setup. Additionally, the pitch shifting
may also be altering emotional attributes, therefore we would
consider exploring alternative augmentation methods including
additive noise and time-shifting.
In this way, our results are also limited by the single
WaveNet architecture that we have implemented, and it would
be of best interest to evaluate alternative generative networks
TABLE II: Results for 2-class classification (happy vs sad)
across all experimental setups on the DEMOS corpus as
described in Section III-D. Utilising a SVM, optimising C,
and reporting unweighted average recall (UAR) on both DEE P
SPE CT RUM and eGeMAPS feature set (Dim)ensions, including
(O)riginal, (G)enerated and (A)ugmented data. Chance level
for this task is 50 % UAR. * indicates significant difference
over the baseline (1).
Dim C Dev. Test
(1) Baseline 4 096 10274.0 73.5
(2) G 4 096 10259.2 49.0
(3) G + O 4 096 10265.3 74.4
(4) G + A O 4 096 10380.4 73.4
(5)AG+AO 4 096 10385.7 74.1
Dim C Dev. Test
(1) Baseline 88 10173.7 73.6
(2) G 88 10258.6 55.8
(3) G + O 88 176.9 74.0
(4) G + A O 88 1 79.9 75.5
(5)AG+AO 88 1 84.5 74.1*
including DC-GANs [36], allowing for deeper hyperparameter
optimisation. Additionally, other deep generative networks im-
plementations which have shown success, e.g., SpecGAN [37]
may be useful for this task.
In this study, we have utilised an emotional speech corpus
to take a first step in evaluating the ability for emotional
audio to be regenerated via the generative model for raw
audio, WaveNet. In this way, we are working towards the use
of generative models as a means of generating personalised
emotional audio environments, e. g., adapting a stressful audio
environment (soundscape) into a more enjoyable space, based
on the needs of an individual.
Findings from this have shown promise for emotional
generative audio showing an improvement on a binary (happy
vs sad) classification paradigm. Deep representations of the
audio, and the handcrafted features of eGeMAPS, result in
improvements over the dataset baseline. Results suggest that
some emotionality is retained in the generated data, in particu-
lar we find a slight above chance result for eGeMAPS features
from only generated training sets.
Given the promise shown from these results, in future work
we would consider expanding our research for generative
audio across a variety of emotional audio domains, e. g., music
and the soundscape, as a means of exploring immersive use-
cases. In this same way, it would be of interest to explore
audio generation with larger duration, evaluating the human
perception of such emotion-based generation.
This work is funded by the Bavarian State Ministry of
Education, Science and the Arts in the framework of the Centre
Digitisation.Bavaria (ZD.B).
[1] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Gen-
erative adversarial nets,” in Advances in Neural Information Processing
Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence,
and K. Q. Weinberger, Eds. 2014, pp. 2672–2680, Curran Associates,
[2] Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru
Erhan, and Dilip Krishnan, “Unsupervised pixel-level domain adaptation
with generative adversarial networks, in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2017, pp. 3722–
[3] Jun-Yan Zhu, Philipp Kr¨
uhl, Eli Shechtman, and Alexei A Efros,
“Generative visual manipulation on the natural image manifold, in
European Conference on Computer Vision. Springer, 2016, pp. 597–613.
[4] Jun Deng, Nicholas Cummins, Maximilian Schmitt, Kun Qian, Fabien
Ringeval, and Bj¨
orn Schuller, “Speech-based diagnosis of autism
spectrum condition by generative adversarial network representations,
in Proceedings of the International Conference on Digital Health. ACM,
2017, pp. 53–57.
[5] Shahin Amiriparian, Michael Freitag, Nicholas Cummins, Maurice Ger-
czuk, Sergey Pugachevskiy, and Bj¨
orn Schuller, “A fusion of deep
convolutional generative adversarial networks and sequence to sequence
autoencoders for acoustic scene classification,” in The European Signal
Processing Conference (EUSIPCO). IEEE, 2018, pp. 977–981.
[6] Santiago Pascual, Antonio Bonafonte, and Joan Serr`
a, “Segan:
Speech enhancement generative adversarial network, arXiv preprint
arXiv:1703.09452, 2017.
[7] Daniel Michelsanti and Zheng-Hua Tan, “Conditional generative ad-
versarial networks for speech enhancement and noise-robust speaker
verification,” arXiv preprint arXiv:1709.01703, 2017.
[8] Zhuo Chen, Shinji Watanabe, Hakan Erdogan, and John R Hershey,
“Speech enhancement and recognition using multi-task learning of long
short-term memory recurrent neural networks,” in Sixteenth Annual
Conference of the International Speech Communication Association,
[9] Dario Rethage, Jordi Pons, and Xavier Serra, “A wavenet for speech
denoising,” in The IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). IEEE, 2018, pp. 5069–5073.
[10] Jie Li, Katherine A Skinner, Ryan M Eustice, and Matthew Johnson-
Roberson, “Watergan: unsupervised generative network to enable real-
time color correction of monocular underwater images,” IEEE Robotics
and Automation Letters, vol. 3, no. 1, pp. 387–394, 2017.
[11] Emilia Parada-Cabaleiro, Alice Baird, Nicholas Cummins, and Bj¨
orn W
Schuller, “Stimulation of psychological listener experiences by semi-
automatically composed electroacoustic environments, in 2017 IEEE
International Conference on Multimedia and Expo (ICME). IEEE, 2017,
pp. 1051–1056.
[12] Irene Van Kamp, Ronny Klaeboe, Hanneke Kruize, Alan Lex Brown,
and Peter Lercher, “Soundscapes, human restoration and quality of
life,” in INTER-NOISE and NOISE-CON Congress and Conference
Proceedings. Institute of Noise Control Engineering, 2016, vol. 253,
pp. 1205–1215.
[13] Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan,
Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward
Lockhart, Luis C Cobo, Florian Stimberg, et al., “Parallel wavenet: Fast
high-fidelity speech synthesis,” arXiv preprint arXiv:1711.10433, 2017.
[14] Alice Baird, Stina Hasse Jørgensen, Emilia Parada-Cabaleiro, Nicholas
Cummins, Simone Hantke, and Bj¨
orn Schuller, “The perception of vocal
traits in synthesized voices: Age, gender, and human likeness,Journal
of the Audio Engineering Society, vol. 66, no. 4, pp. 277–285, 2018.
[15] Heejin Choi, Sangjun Park, Jinuk Park, and Minsoo Hahn, “Emotional
speech synthesis for multi-speaker emotional dataset using wavenet
vocoder, in 2019 IEEE International Conference on Consumer Elec-
tronics (ICCE). IEEE, 2019, pp. 1–2.
[16] Jonathan Chang and Stefan Scherer, “Learning representations of emo-
tional speech with deep convolutional generative adversarial networks,”
in 2017 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 2017, pp. 2746–2750.
[17] Xinyue Zhu, Yifan Liu, Zengchang Qin, and Jiahong Li, “Data augmen-
tation in emotion classification using generative adversarial networks,
arXiv preprint arXiv:1711.00648, 2017.
[18] Emilia Parada-Cabaleiro, Giovanni Costantini, Anton Batliner, Maximil-
ian Schmitt, and Bj¨
orn W Schuller, “Demos: an italian emotional speech
corpus,” Language Resources and Evaluation, pp. 1–43, 2019.
[19] A¨
aron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan,
Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and
Koray Kavukcuoglu, “Wavenet: A generative model for raw audio.,”
SSW, vol. 125, 2016.
[20] Shahin Amiriparian, Maurice Gerczuk, Sandra Ottl, Nicholas Cummins,
Michael Freitag, Sergey Pugachevskiy, Alice Baird, and Bj¨
orn Schuller,
“Snore sound classification using image-based deep spectrum features,”
in Proc. INTERSPEECH, 2017, pp. 3512–3516.
[21] Florian Eyben, Klaus Scherer, Bj¨
orn Schuller, Johan Sundberg, Elisabeth
e, Carlos Busso, Laurence Devillers, Julien Epps, Petri Laukka,
Shrikanth Narayanan, and Khiet Truong, “The Geneva Minimalistic
Acoustic Parameter Set (GeMAPS) for Voice Research and Affective
Computing,” IEEE Transactions on Affective Computing, vol. 7, no. 2,
pp. 190–202, 2016.
[22] George Trigeorgis, Fabien Ringeval, Raymond Brueckner, Erik Marchi,
Mihalis A Nicolaou, Bj¨
orn Schuller, and Stefanos Zafeiriou, “Adieu fea-
tures? end-to-end speech emotion recognition using a deep convolutional
recurrent network,” in 2016 IEEE international conference on acoustics,
speech and signal processing (ICASSP). IEEE, 2016, pp. 5200–5204.
[23] Nicholas Cummins, Shahin Amiriparian, Gerhard Hagerer, Anton Bat-
liner, Stefan Steidl, and Bj¨
orn W Schuller, An image-based deep
spectrum feature representation for the recognition of emotional speech,”
in Proceedings of the 25th ACM international conference on Multimedia.
ACM, 2017, pp. 478–484.
[24] J. A. Russell, “A circumplex model of affect, Journal of Personality
and Social Psychology, vol. 39, no. 6, pp. 1161–1178, 1980.
[25] Justin Salamon and Juan Pablo Bello, “Deep convolutional neural
networks and data augmentation for environmental sound classification,
IEEE Signal Processing Letters, vol. 24, no. 3, pp. 279–283, 2017.
[26] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt
McVicar, Eric Battenberg, and Oriol Nieto, “librosa: Audio and music
signal analysis in python,” in Proceedings of the 14th python in science
conference, 2015, pp. 18–25.
[27] A¨
aron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt,
Alex Graves, and Koray Kavukcuoglu, “Conditional image generation
with pixelcnn decoders,” CoRR, vol. abs/1606.05328, 2016.
[28] Rachel Manzelli, Vijay Thakkar, Ali Siahkamari, and Brian Kulis, “Con-
ditioning deep generative raw audio models for structured automatic
music,” arXiv preprint arXiv:1806.09905, 2018.
[29] Fisher Yu and Vladlen Koltun, “Multi-scale context aggregation by
dilated convolutions, arXiv preprint arXiv:1511.07122, 2015.
[30] Florian Eyben, Felix Weninger, Florian Gross, et al., “Recent Devel-
opments in openSMILE, the Munich Open-Source Multimedia Feature
Extractor, in Proc. ACM, Barcelona, Spain, 2013, pp. 835–838.
[31] Shahin Amiriparian, Maurice Gerczuk, Sandra Ottl, Nicholas Cummins,
Michael Freitag, Sergey Pugachevskiy, and Bj¨
orn Schuller, “Snore
Sound Classification Using Image-based Deep Spectrum Features,” in
Proc. of INTERSPEECH, Stockholm, Sweden, 2017, ISCA, 5 pages.
[32] Karen Simonyan and Andrew Zisserman, “Very deep convolu-
tional networks for large-scale image recognition, arXiv preprint
arXiv:1409.1556, 2014.
[33] Fabian Pedregosa, Ga¨
el Varoquaux, Alexandre Gramfort, Vincent
Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Pret-
tenhofer, Ron Weiss, Vincent Dubourg, et al., “Scikit-learn: Machine
learning in Python,” Journal of Machine Learning Research, vol. 12,
pp. 2825–2830, 2011.
[34] Jennifer Peat and Belinda Barton, Medical statistics: A guide to data
analysis and critical appraisal, John Wiley & Sons, 2008.
[35] Shahin Amiriparian, Alice Baird, Sahib Julka, Alyssa Alcorn, San-
dra Ottl, Suncica Petrovi´
c, Eloise Ainger, Nicholas Cummins, and
orn Schuller, “Recognition of Echolalic Autistic Child Vocalisations
Utilising Convolutional Recurrent Neural Networks, in Proceedings
INTERSPEECH 2018, 19th Annual Conference of the International
Speech Communication Association, Hyderabad, India, September 2018,
ISCA, pp. 2334–2338, ISCA.
[36] Alec Radford, Luke Metz, and Soumith Chintala, “Unsupervised
representation learning with deep convolutional generative adversarial
networks,” arXiv preprint arXiv:1511.06434, 2015.
[37] Chris Donahue, Julian McAuley, and Miller Puckette, “Synthe-
sizing audio with generative adversarial networks, arXiv preprint
arXiv:1802.04208, 2018.
... There are many advantages to generating new audio data computationally, mainly the scarcity of actual data, particularly in the speech emotion domain [1]. The time-dependent nature of audio makes sourcing and annotating such data an extremely timeconsuming process [2,3]. ...
... Data augmentation is another quantitative approach for evaluating the plausibility of generated audio [15,1]. In this case, generated samples are added to the training set of a classification paradigm, and where an increase in test accuracy is observed, the samples are deemed to be of value. ...
... The WAVEGAN we apply was trained using the default parameters described in [4] for 100 000 training steps. For our experiments, we generate samples until the quantity is equal to the classes within the source training data (total of 526 1-second samples) 1 . ...
... We can see that WaveNet achieves a good performance, especially on the development set. However, this WaveNet-based method was used for a binary classification task, instead of classifying seven emotional classes [216]. When comparing the methods for seven-class classification, the proposed approach performs significantly better than the data (raw audio waves and log mel spectrograms) augmentation methods using random noise (p < 0.001 in a one-tailed z-test). ...
... Although a large-scale dataset is helpful to train deep learning models, a recent trend in machine learning is to train models based on small-scale data, since there are still a number of small databases in real life. A potential solution is to augment the training data through synthesising new data using GANs [216,222]. Additionally, to train a DNN model on small quantities of data only, it is promising to apply the recently developed machine learning algorithms, including few-shot learning [223], one-shot learning [224], and zero-shot learning [225]. ...
Full-text available
Automatically recognising audio signals plays a crucial role in the development of intelligent computer audition systems. Particularly, audio signal classification, which aims to predict a label for an audio wave, has promoted many real-life applications. Amounts of efforts have been made to develop effective audio signal classification systems in the real world. However, several challenges in deep learning techniques for audio signal classification remain to be addressed. For instance, training a deep neural network (DNN) from scratch is time-consuming to extracting high-level deep representations. Furthermore, DNNs have not been well explained to construct the trust between humans and machines, and facilitate developing realistic intelligent systems. Moreover, most DNNs are vulnerable to adversarial attacks, resulting in many misclassifications. To deal with these challenges, this thesis proposes and presents a set of deep-learning-based approaches for audio signal classification. In particular, to tackle the challenge of extracting high-level deep representations, the transfer learning frameworks, benefiting from pre-trained models on large-scale image datasets, are introduced to produce effective deep spectrum representations. Furthermore, the attention mechanisms at both the frame level and the time-frequency level are proposed to explain the DNNs by respectively estimating the contributions of each frame and each time-frequency bin to the predictions. Likewise, the convolutional neural networks (CNNs) with an attention mechanism at the time-frequency level is extended to atrous CNNs with attention, aiming to explain the CNNs by visualising high-resolution attention tensors. Additionally, to interpret the CNNs evaluated on multi-device datasets, the atrous CNNs with attention are trained in the conditional training frameworks. Moreover, to improve the robustness of the DNNs against adversarial attacks, models are trained in the adversarial training frameworks. Besides, the transferability of adversarial attacks is enhanced by a lifelong learning framework. Finally, the experiments conducted with various datasets demonstrate that these presented approaches are effective to address the challenges.
... The openSMILE toolkit [23] 5 is used for the extraction of the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) [22]. This feature set which is proven valuable for SER tasks [6], also in past MuSe challenges (e. g., [50]), includes 88 acoustic features that can capture affective physiological changes in voice production. In MuSe-Humor, we use the default configuration to extract the 88 eGeMAPS functionals for each two second audio frame. ...
Full-text available
The Multimodal Sentiment Analysis Challenge 2022 (MuSe 2022) is dedicated to multimodal sentiment and emotion recognition. For this year's challenge, we feature three datasets: (i) the Passau Spontaneous Football Coach Humor (Passau-SFCH) dataset that contains audio-visual recordings of German football coaches, labelled for the presence of humour; (ii) the Hume-Reaction dataset in which reactions of individuals to emotional stimuli have been annotated with respect to seven emotional expression intensities, and (iii) the Ulm Trier Social Stress Test (Ulm-TSST) dataset comprising of audio-visual data labelled with continuous emotion values (arousal and valence) of people in stressful dispositions. Using the introduced datasets, MuSe 2022 addresses three contemporary affective computing problems: in the Humor Detection Sub-Challenge (MuSe-Humor), spontaneous humour has to be recognised; in the Emotional Reaction Sub-Challenge (MuSe-Reaction), seven fine-grained `in-the-wild' emotions have to be predicted; and in the Emotional Stress Sub-Challenge (MuSe-Stress), a continuous prediction of stressed emotion values is featured. The challenge is designed to attract different research communities, encouraging a fusion of their disciplines. Mainly, MuSe 2022 targets the communities of audio-visual emotion recognition, health informatics, and symbolic sentiment analysis. This baseline paper describes the datasets as well as the feature sets extracted from them. A recurrent neural network with LSTM cells is used to set competitive baseline results on the test partitions for each sub-challenge. We report an Area Under the Curve (AUC) of .8444 for MuSe-Humor; .2801 mean (from 7-classes) Pearson's Correlations Coefficient for MuSe-Reaction, as well as .4931 CCC and .4761 for valence and arousal in MuSe-Stress, respectively.
... In the emotional speech synthesis domain (another subjective audio domain), the generative adversarial network based studies performed in [13]- [15] quantitatively evaluated synthesised speech using a classifier to demonstrate that the generated speech contained meaningful emotional information. We draw from these techniques to assess and compare the ability of three generative models from different deep learning paradigms to generate music that contains meaningful information relating to mood/theme or a listener's emotional response. ...
Despite advances in deep algorithmic music generation, evaluation of generated samples often relies on human evaluation, which is subjective and costly. We focus on designing a homogeneous, objective framework for evaluating samples of algorithmically generated music. Any engineered measures to evaluate generated music typically attempt to define the samples' musicality, but do not capture qualities of music such as theme or mood. We do not seek to assess the musical merit of generated music, but instead explore whether generated samples contain meaningful information pertaining to emotion or mood/theme. We achieve this by measuring the change in predictive performance of a music mood/theme classifier after augmenting its training data with generated samples. We analyse music samples generated by three models -- SampleRNN, Jukebox, and DDSP -- and employ a homogeneous framework across all methods to allow for objective comparison. This is the first attempt at augmenting a music genre classification dataset with conditionally generated music. We investigate the classification performance improvement using deep music generation and the ability of the generators to make emotional music by using an additional, emotion annotation of the dataset. Finally, we use a classifier trained on real data to evaluate the label validity of class-conditionally generated samples.
... Recent work proposed to utilise generative adversarial network (GAN) frameworks [2,3,4] to convert the spectrum features of a source audio signal to the features of the target [5,6,7,8,9], thus eliminating the need for parallel data. In addition, evidence shows that the GAN-generated audio signals, as a source of augmenting data, are valuable for improving the predictive performance of speech emotion recognition (SER) models [6,10], which not only indicates that the GAN-generated audio signals indeed carry effective emotional information, but also provides a promising data augmentation solution for the SER task [11,12,13,14]. ...
... To cover a range of well-known acoustic features, we extract hand-crafted speech-based features, as well as a state-of-the-art approach, extracting spectrogram-based deep data representations from the speech signals. OPENSMILE : As a conventional and well established approach, the 6 373 dimensional COMPARE feature set [22], and the 88 dimensional EGEMAPS feature set [23], are used given our experience in similar paralinguistic tasks [24,25]. From each instance, the COMPARE and EGEMAPS acoustic features are extracted with the OPENSMILE toolkit [22]. ...
... Recent work proposed to utilise generative adversarial network (GAN) frameworks [2,3,4] to convert the spectrum features of a source audio signal to the features of the target [5,6,7,8,9], thus eliminating the need for parallel data. In addition, evidence shows that the GAN-generated audio signals, as a source of augmenting data, are valuable for improving the predictive performance of speech emotion recognition (SER) models [6,10], which not only indicates that the GAN-generated audio signals indeed carry effective emotional information, but also provides a promising data augmentation solution for the SER task [11,12,13,14]. ...
Emotional Voice Conversion (EVC) aims to convert the emotional style of a source speech signal to a target style while preserving its content and speaker identity information. Previous emotional conversion studies do not disentangle emotional information from emotion-independent information that should be preserved, thus transforming it all in a monolithic manner and generating audio of low quality, with linguistic distortions. To address this distortion problem, we propose a novel StarGAN framework along with a two-stage training process that separates emotional features from those independent of emotion by using an autoencoder with two encoders as the generator of the Generative Adversarial Network (GAN). The proposed model achieves favourable results in both the objective evaluation and the subjective evaluation in terms of distortion, which reveals that the proposed model can effectively reduce distortion. Furthermore, in data augmentation experiments for end-to-end speech emotion recognition, the proposed StarGAN model achieves an increase of 2% in Micro-F1 and 5% in Macro-F1 compared to the baseline StarGAN model, which indicates that the proposed model is more valuable for data augmentation.
Full-text available
Deep neural speech and audio processing systems have a large number of trainable parameters, a relatively complex architecture, and require a vast amount of training data and computational power. These constraints make it more challenging to integrate such systems into embedded devices and utilize them for real-time, real-world applications. We tackle these limitations by introducing DeepSpectrumLite , an open-source, lightweight transfer learning framework for on-device speech and audio recognition using pre-trained image Convolutional Neural Networks (CNNs). The framework creates and augments Mel spectrogram plots on the fly from raw audio signals which are then used to finetune specific pre-trained CNNs for the target classification task. Subsequently, the whole pipeline can be run in real-time with a mean inference lag of 242.0 ms when a DenseNet121 model is used on a consumer-grade Motorola moto e7 plus smartphone. DeepSpectrumLite operates decentralized, eliminating the need for data upload for further processing. We demonstrate the suitability of the proposed transfer learning approach for embedded audio signal processing by obtaining state-of-the-art results on a set of paralinguistic and general audio tasks, including speech and music emotion recognition, social signal processing, COVID-19 cough and COVID-19 speech analysis, and snore sound classification. We provide an extensive command-line interface for users and developers which is comprehensively documented and publicly available at .
Full-text available
We present DEMoS (Database of Elicited Mood in Speech), a new, large database with Italian emotional speech: 68 speakers, some 9 k speech samples. As Italian is under-represented in speech emotion research, for a comparison with the state-of-the-art, we model the ‘big 6 emotions’ and guilt. Besides making available this database for research, our contribution is three-fold: First, we employ a variety of mood induction procedures, whose combinations are especially tailored for specific emotions. Second, we use combinations of selection procedures such as an alexithymia test and self- and external assessment, obtaining 1,5 k (proto-) typical samples; these were used in a perception test (86 native Italian subjects, categorical identification and dimensional rating). Third, machine learning techniques—based on standardised brute-forced openSMILE ComParE features and support vector machine classifiers—were applied to assess how emotional typicality and sample size might impact machine learning efficiency. Our results are three-fold as well: First, we show that appropriate induction techniques ensure the collection of valid samples, whereas the type of self-assessment employed turned out not to be a meaningful measurement. Second, emotional typicality—which shows up in an acoustic analysis of prosodic main features—in contrast to sample size is not an essential feature for successfully training machine learning models. Third, the perceptual findings demonstrate that the confusion patterns mostly relate to cultural rules and to ambiguous emotions.
Conference Paper
Full-text available
Unsupervised representation learning shows high promise for generating robust features for acoustic scene analysis. In this regard, we propose and investigate a novel combination of features learnt using both a deep convolutional gen-erative adversarial network (DCGAN) and a recurrent sequence to sequence autoencoder (S2SAE). Each of the representation learning algorithms is trained individually on spectral features extracted from audio instances. The learnt representations are: (i) the activations of the discriminator in case of the DCGAN, and (ii) the activations of a fully connected layer between the decoder and encoder units in case of the S2SAE. We then train two multilayer perceptron neural networks on the DCGAN and S2SAE feature vectors to predict the class labels. The individual predicted labels are combined in a weighted decision-level fusion to achieve the final prediction. The system is evaluated on the development partition of the acoustic scene classification data set of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2017). In comparison to the baseline, the accuracy is increased from 74.8 % to 86.4 % using only the DCGAN, to 88.5 % on the development set using only the S2SAE, and to 91.1 % after fusion of the individual predictions.
Full-text available
It is a difficult task to classify images with multiple class labels using only a small number of labeled examples, especially when the label (class) distribution is imbalanced. Emotion classification is such an example of imbalanced label distribution, because some classes of emotions like disgusted are relatively rare comparing to other labels like happy or sad. In this paper, we propose a data augmentation method using generative adversarial networks (GAN). It can complement and complete the data manifold and find better margins between neighboring classes. Specifically, we design a framework using a CNN model as the classifier and a cycle-consistent adversarial networks (CycleGAN) as the generator. In order to avoid gradient vanishing problem, we employ the least-squared loss as adversarial loss. We also propose several evaluation methods on three benchmark datasets to validate GAN’s performance. Empirical results show that we can obtain 5%–10% increase in the classification accuracy after employing the GAN-based data augmentation techniques.
Full-text available
The paralinguistics of the voice are the perceived states and traits that make that voice unique to the human body from which it resonates. In many cases the synthesized voice is produced by concatenated segments of recorded human speech, a complex process that can result in an arguably lifeless voice, which lacks the ability for free-expression among other human qualities. In recent years technology-based companies are developing their own synthesized voice identities, yet seemingly paying little attention to the stereotypical traits being heard. Do such synthetic voice traits differ from the human traits they are modelled on? To explore this, the presented perception study performed by 18 listeners evaluated the paralinguistic traits of gender, age, and human likeness in the IBM voice library. Results herein have shown a similar trend to a previous study by the authors with no voice achieving complete human likeness, no voice being perceived within a single age frequency band, and none tied solidly to their given binary gender-a novel finding as commercially available synthesized voices are typically developed to operate within binary identification structures.
While Generative Adversarial Networks (GANs) have seen wide success at the problem of synthesizing realistic images, they have seen little application to the problem of unsupervised audio generation. Unlike for images, a barrier to success is that the best discriminative representations for audio tend to be non-invertible, and thus cannot be used to synthesize listenable outputs. In this paper, we introduce WaveGAN, a first attempt at applying GANs to raw audio synthesis in an unsupervised setting. Our experiments on speech demonstrate that WaveGAN can produce intelligible words from a small vocabulary of human speech, as well as synthesize audio from other domains such as bird vocalizations, drums, and piano. Qualitatively, we find that human judges prefer the generated examples from WaveGAN over those from a method which naively apply GANs on image-like audio feature representations.