Conference PaperPDF Available

Can Deep Generative Audio be Emotional? Towards an Approach for Personalised Emotional Audio Generation

Can Deep Generative Audio be Emotional?
Towards an Approach for Personalised
Emotional Audio Generation
Alice Baird
Augsburg University, Germany
Shahin Amiriparian
Augsburg University, Germany
orn Schuller
Augsburg University, Germany
Abstract—The ability for sound to evoke states of emotion is
well known across fields of research, with clinical and holistic
practitioners utilising audio to create listener experiences which
target specific needs. Neural network-based generative models
have in recent years shown promise for generating high-fidelity
based on a raw audio input. With this in mind, this study utilises
the WaveNet generative model to explore the ability of such
networks to retain the emotionality of raw audio speech inputs.
We train various models on 2-classes (happy and sad) of an
emotional speech corpus containing 68 native Italian speakers.
When classifying the combined original and generated audio,
hand-crafted feature sets achieve at best 75.5 % unweighted
average recall, a 2 percent point improvement over the original
only audio features. Additionally, from a two-tailed test on the
predictions, we find that the audio features from the original
speech concatenated with the generated audio features provides
significantly different test result compared to the baseline. Both
findings indicating promise for emotion-based audio generation.
Deep generative networks (including Generative Adversarial
Networks (GANs) [1]) have found an abundance of use
cases within the field of machine learning in recent years.
Particularly in the computer audition community as well as
for vision, applications include domain adaptation [2] and
data manipulation [3], amongst others [4], [5]. With gener-
ative methods, e. g., for speech enhancement [6], [7] showing
substantial improvements over previous methods [8].
Higher fidelity audio comes with higher computational
costs, making generative networks not yet fully applicable for
realistic real-time audio-based applications. Although, through
the utilisation of pre-trained networks, real-time processing
does show promise for tasks including audio denoising [9].
However, given the often lower dimensionality of data sources,
more effective reinforced real-time frameworks have been
applied to vision tasks, e. g., image correction [10].
Given that synthetic audio has the ability to immerse a
listener in an emotional environment, and transmit an emo-
tional state [11], there is much room for research in the
realm of generative networks towards personalised variations
of emotional soundscapes. Implemented in real-time and with
All authors are affiliated to the ZD.B Chair of Embedded Intelligence for
Health Care and Wellbeing. Bj¨
orn Schuller is also affiliated to GLAM – Group
on Language, Audio and Music, Imperial College London, UK. 978-1-7281-
1817-8/19/$31.00 ©2019 European Union
specific penalisation ability, such an audio environment could
be implemented in daily life scenarios, e. g., in the work place
to improve general quality of life [12].
Emotion is a subtle aspect of audio transmission, which may
not be captured via deep generative approaches. Generative
networks are however, able to reach near human replication
in the field of speech synthesis [13], and the perception of
various approaches, including the state-of-the-art has been
evaluated [14]. As well as this, conversion of emotional speech
states utilising the WaveNet Vocoder framework has recently
shown promise [15], and approaches for deriving representa-
tions of emotional speech features found deep convolutional
generative adversarial networks (DC-GANs) to be of most
benefit for feature generation, as compared to convolutional
neural networks (CNNs) architectures [5], [16]. However,
to the best of the authors’ knowledge, the advantages of
data augmentation utilising emotional data have not yet been
explored in the audio domain, although it is a topic that has
shown to be successful for emotion-based visual data [17].
As an initial step in exploring this topic, we utilise the
large and highly emotionally diverse DEM OS corpus of Ital-
ian emotional speech [18]. Applying pitch-based augmenta-
tion, 2-classes (happy and sad) from the corpus are then
used as training data for several speaker independently parti-
tioned WaveNet models [19]. Post-audio generation from the
WaveNet models, we extract both state-of-the-art and conven-
tional feature sets including, the deep representations of the
DEE P SPE CT RUM toolkit [20], as well as hand-crafted features
from the extended Geneva Minimalistic Acoustic Parameter
Set (eGeMAPS) [21]. We choose these feature sets due to
their known strength for similar emotion recognition [22] and
classification [23] tasks. From both the generated and original
data, a series of classification experiments were performed to
ascertain if the generated audio is able to improve results,
assuming that this implies the inclusion of subtle emotion-
related features in the generated audio.
This paper is structured as follows. In section (Section II),
the corpus for our experiments is presented, including pro-
cessing, partitioning and augmentation. We then describe our
experimental settings for both the generative model Section III,
and the following classification paradigm. Followed by results
discussion in Section IV, and conclusions in Section VI.
Fig. 1: An overview of the implementation utilised in this study for emotional audio generation with WaveNet. Features
were extracted using both the OP EN SM ILE and DEEP SPEC TRU M toolkits, resulting in eight feature sets, (Baseline DEMOS ,
(G)enerated DEMOS , Augmented DEMOS , and Generated Augmented DEMOS ). A Support Vector Machines was utilised
for classification experiments, with optimisation of the complexity only for the 2-class classification.
TABLE I: Speaker independent partitions, Train,
(Dev)elopment, Test created from the DE MOS emotional
speech database. Including the distribution of the 2-classes
(Happy and Sad).
Train Dev. Test P
Speakers 24 22 22 68
Gender M:F 15:9 15:7 15:7 45:23
Happy 447 434 514 1395
Sad 493 486 551 1530
P940 920 1065 2 925
For this study, we utilise the Database of Elicited Mood
in Speech (DE MOS ) [18]. DEMOS is a corpus of induced
Italian emotional speech, including the ‘big 6’ emotions, plus
neutral, and additionally guilt. This dataset is comprised of
9 365 emotional instances and 332 neutral samples produced
by 68 native speakers (23 females, 45 males).
For this first step study, we chose to use only the emotional
samples of happiness and sadness, as these fall in opposite po-
sitions when observing the the valence and arousal emotional
circumflex [24], and will possibly allow for a more significant
difference between the generated classes. In this way, the sub-
set of DE MOS that we utilised for the study has in total 2 925
instances, with a duration of 2 h:47 m:41 s.
A. Data Pre-Processing
As a first step, the DEMOS the data was normalised across
all speakers. The data then remained at the provided format
of monophonic WAV 44.1kHz.
As a means of avoiding any speaker dependency during
training, partitioning of the data was made prior with consid-
eration to gender. There is a gender bias in the dataset (45:23,
male:female), and future work could be to consider gender-
independent models, however, with consideration to gender we
balance each (train, development, and test) equally, with the
addition of balancing the instances for the 2-classes of interest
happy and sad (cf. Table I for the partitioning applied)1.
1The Speaker IDs for each partition are as follows: (training) 01 17,21,
22,29,31,36 38, (development) 18 20,23 28,30,32 34,39 41,
43 48 and (test) 42,49,50 69.
B. Data Augmentation
In general, it is known that deep networks require a large
amount of data to achieve usable results. With this in mind, al-
though for an emotional speech corpus, the DE MOS database
is reasonable in size, when training the WaveNet system we
applied pitch shifting to augment the input data.
Pitch shifting has shown to be a strong choice over others
such as time and noise augmentation for similar tasks, includ-
ing environmental sound classification [25]. Augmentation was
only applied to the Train and Development partitions, keep-
ing the Test set entirely unchanged. Utilising the LibROSA
toolkit [26], we choose to raise or lower the pitch of the audio
samples, and keep the duration of the samples unchanged.
Each utterance was pitch shifted by a factor 10 ; 5 lower:
{0.75, 0.80, 0.85, 0.90, 0.95}and 5 higher {1.05, 1.10, 1.15,
1.20, 1.25}increments, which are audibly observed to have
made minimal change to the original data. Resulting in an
DEMOS augmented data set (not including unchanged Test
set) of 19 h:01 m:22 s.
As a first step to explore the potential of emotional audio
generation utilising generative networks, we utilise a Tensor-
Flow implementation of the WaveNet generative framework
for modelling raw audio [19]2. We choose WaveNet as this
is a standard framework in the field of audio generation
(cf. Figure 1 overview of the experimental setting).
WaveNet is an audio implementation of PixelCNN [27],
and is a generative network for modelling features of raw
audio, represented as 8-bit audio files, with 256 possible
values. During the training process, the model predicts audio
signal values (with a temporal resolution of at least 16k Hz)
at each step comparing to the true value, using cross-entropy
as a loss function. Hence, the WaveNet model implements
a 256 class classification [28]. As a means of decreasing
the computational expenses, WaveNet applies the method of
stacked dilated casual convolutions, reducing the receptive
field, and minimising the loss in the resolution [29].
A. Model Training
As the input for the WaveNet model, we supply 6 training
sets (3 for each class). We then train three models separately
on the augmented Training, Development, and Training plus
Development partitions. The WaveNet model was iterated for
100 000 steps and a silence threshold sof 0 was set. sacts
as filter, ignoring samples of silence. Given the subtle nature
of emotion-based speech features swas set to zero to avoid
loss of information. To reach 100 000 steps, ca. 22 hours was
needed on an Nvidia GTX TITAN X with 12 GB of VRAM
for each training set model.
B. Audio Generation
After training each WaveNet model for 100 000 steps, we
generate new speech audio samples. Based on an approxima-
tion of the mean duration from each partition of the original
DEMOS data (Train = 2.6 s , Dev. = 2.6s, Train+Dev. =
2.6 s ), we generated samples of 2.6 s for the Train, Devel-
opment and the combined Train and Development models,
with total instances of 2 123. Due to limited computational
processing time, we only reach ca. 75 % of the original DE-
MOS quantity 3. For generation, the hyperparameters remain
the same, with the temperature threshold tof 1.0applied –
lowering tcauses the model to focus on higher probability
predictions. Following this, we also apply data augmentation
in the same manner as described in Section II-B to each of
these generated partitions. Spectrogram plots of the generated
audio in comparison to the original audio data can be seen
in Figure 2. From a qualitative analysis of the generated data,
attributes of the original speech, i. e. accent and intonation are
audible, despite the presence of noise that has occurred in
excess during generation.
C. Feature Extraction and Fusion
For both datasets, the generated and the original, we extract
conventional hand-crafted features, and state-of-the-art deep
representations. Resulting in 8 features sets (4 hand crafted,
and 4 deep), from 4 variations of the data: (1) Original
DEMOS, (2) Augmented DEMOS, (3) Generated D EMOS,
and (4) Augmented Generated DE MOS.
Given the success of the extended Geneva Minimalistic
Acoustic Parameter Set (eGeMAPS) [21], we utilise this as
a conventional handcraft approach. From each instance, the
eGeMAPS acoustic features are extracted with the OP EN S-
MILE toolkit [30]. Using the default parameter settings from
OP EN SMIL E for the low-level descriptors (LLDs) of each
feature set, the higher level suprasegmental features were
extracted over the entire audio instance.
Additionally, we extract a 4 096 dimensional feature
set of deep data-representations with the DEEP SPEC-
TRU M toolkit [31]4. DE EP SPECTRUM has shown success for
other emotion-based speech tasks [23]. For this study, we
extract mel-spectrograms with a viridis colour map, using the
3For the interested reader, a selection of generated data can be found here:
(a) NP f 47 tri01b (b) tri 2.6 308 (c) NP f 47 gio03c (d) gio 2.6 10
Fig. 2: Spectrogram representation of speech files. ‘Sad’
original audio (a) and ‘Sad’ generated audio (b), as well as
‘Happy’ original audio (c) and ‘Happy’ generated audio (d).
Files names indicated in caption. High frequency noise can be
seen in excess for the generated audio, however, vocal features
such as formants are also seen to be replicated in the lower
frequency range.
default DEEP SP EC TRUM settings and the VGG16 pre-trained
imagenet model [32], with no window size or overlap applied.
D. Classification Approach
A support vector machine (SVM) implementation with
linear kernel from the open-source machine learning toolkit
Scikit-Learn [33] is used for our experiments. During the
development phase, we trained a series of SVM models,
optimising the complexity parameters (C104,103,102,
101,1), evaluating their performance on the Development
set. For original DEMOS data we re-trained the model with
the concatenated Train and Development set, and evaluate the
performance on the Test set. For the Generated DEMOS , we
utilise the data which has been generated from the combined
Training and Development WaveNet, and evaluated on the
original DEMOS test. Further, upon creation of the 8 afore-
mentioned features sets, we prepared 5 experiments which
were repeated for each feature set type (DE EP SPECTRUM and
eGeMAPS), in various combinations of the data, with all tested
on the original unseen DE MOS Test partition:
1) Baseline (original data for Training, Development).
2) Generated (generated speech for Training, Development).
3) Baseline + generated (combined baseline and generated
in Training, Development).
4) Generated + augmentation of original (combined gen-
erated with pitch shifting augmentation of original for
Training, Development).
5) Augmented generated + augmented original (combined
pitch shifting augmentation of generated speech with
augmented original for Training, Development).
When observing the results found in Table II, it can be
seen that for both DEEP SPECT RUM and eGeMAPS results,
there is an improvement on the classification baseline when
applying the generated audio to the Training set. We will
discuss experiments in relation to the number indicated in
the results table, as previously described in section III-D. To
evaluate the significant (or not) difference between predictions,
we conduct a two-tailed T-test, rejecting the null-hypothesis
at a significance level of p < 0.05 and below. For this, we
checked each Test set prediction result for normality using a
Shapiro-Wilktest [34].
From the DE EP SPECTRUM results we see slight improve-
ment when utilising the generative data with the original data.
Of most interest is experiment 3 in which the result improves
over the original baseline by 0.9 percent points, at the same
classification complexity of C = 102. When performing a T-
test with test predictions of experiment 3 against experiment
1, we obtain p= 0.05, which would suggest a borderline
significant difference in this improvement. We also see im-
proved results for experiment 5 although this is not found to be
significantly different to the baseline. Better results were found
for Test with larger complexity optimisation values; however,
this occurred due to overfitting on the Development set and
therefore the result is not reported.
Experiment 2 from the DE EP SPECT RUM results, which
utilises the generated data only in the Training set, received
below chance level, implying that the original data is needed
for this scenario. However, when we observe the eGeMAPS
result for experiment 2, there is a 5 percent point increase in
comparison to DE EP SPECTRUM features. This shows promise,
although through significance testing between experiment 2 of
DEE P SPE CT RUM and eGeMAPS results, there is no signifi-
cant difference found.
Continuing with the results from eGeMAPS features, the
best result is seen in experiment 4, with a 75.5% UAR, 1.9
percent higher than the baseline experiment 1. This results
would suggest that the known emotionality of the eGeMAPS
feature set is more able to capture the emotion from the
generated data, as compared to the DEEP SPECT RUM result for
this experiment. However, no significant difference is found
between these experiments when evaluating with a T-test.
The result of the eGeMAPS experiment 5 are significantly
different from the baseline experiment 1 with p= 0.006. This
does show promise for additional pitch based augmentation
on the generated data; however, for experiment 4 which is our
highest result, no significant difference over the baseline was
found (p= 0.069).
When considering the limitation of this study, we see that
the results do show promise, but there is minimal significance
to the improvement. It may be of benefit to consider deeper
networks for classification of the generated data rather than the
conventional SVM. Specifically, networks which incorporate
the time-dependency which are inherent to audio, e. g., RNNs
or convolutional RNNs [35]. As well as this, incorporating
multiple data sets, and more emotional classes may be fruitful
for evaluation, given the tendencies we have seen arise from
this first step 2-class setup. Additionally, the pitch shifting
may also be altering emotional attributes, therefore we would
consider exploring alternative augmentation methods including
additive noise and time-shifting.
In this way, our results are also limited by the single
WaveNet architecture that we have implemented, and it would
be of best interest to evaluate alternative generative networks
TABLE II: Results for 2-class classification (happy vs sad)
across all experimental setups on the DEMOS corpus as
described in Section III-D. Utilising a SVM, optimising C,
and reporting unweighted average recall (UAR) on both DEE P
SPE CT RUM and eGeMAPS feature set (Dim)ensions, including
(O)riginal, (G)enerated and (A)ugmented data. Chance level
for this task is 50 % UAR. * indicates significant difference
over the baseline (1).
Dim C Dev. Test
(1) Baseline 4 096 10274.0 73.5
(2) G 4 096 10259.2 49.0
(3) G + O 4 096 10265.3 74.4
(4) G + A O 4 096 10380.4 73.4
(5)AG+AO 4 096 10385.7 74.1
Dim C Dev. Test
(1) Baseline 88 10173.7 73.6
(2) G 88 10258.6 55.8
(3) G + O 88 176.9 74.0
(4) G + A O 88 1 79.9 75.5
(5)AG+AO 88 1 84.5 74.1*
including DC-GANs [36], allowing for deeper hyperparameter
optimisation. Additionally, other deep generative networks im-
plementations which have shown success, e.g., SpecGAN [37]
may be useful for this task.
In this study, we have utilised an emotional speech corpus
to take a first step in evaluating the ability for emotional
audio to be regenerated via the generative model for raw
audio, WaveNet. In this way, we are working towards the use
of generative models as a means of generating personalised
emotional audio environments, e. g., adapting a stressful audio
environment (soundscape) into a more enjoyable space, based
on the needs of an individual.
Findings from this have shown promise for emotional
generative audio showing an improvement on a binary (happy
vs sad) classification paradigm. Deep representations of the
audio, and the handcrafted features of eGeMAPS, result in
improvements over the dataset baseline. Results suggest that
some emotionality is retained in the generated data, in particu-
lar we find a slight above chance result for eGeMAPS features
from only generated training sets.
Given the promise shown from these results, in future work
we would consider expanding our research for generative
audio across a variety of emotional audio domains, e. g., music
and the soundscape, as a means of exploring immersive use-
cases. In this same way, it would be of interest to explore
audio generation with larger duration, evaluating the human
perception of such emotion-based generation.
This work is funded by the Bavarian State Ministry of
Education, Science and the Arts in the framework of the Centre
Digitisation.Bavaria (ZD.B).
[1] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Gen-
erative adversarial nets,” in Advances in Neural Information Processing
Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence,
and K. Q. Weinberger, Eds. 2014, pp. 2672–2680, Curran Associates,
[2] Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru
Erhan, and Dilip Krishnan, “Unsupervised pixel-level domain adaptation
with generative adversarial networks, in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2017, pp. 3722–
[3] Jun-Yan Zhu, Philipp Kr¨
uhl, Eli Shechtman, and Alexei A Efros,
“Generative visual manipulation on the natural image manifold, in
European Conference on Computer Vision. Springer, 2016, pp. 597–613.
[4] Jun Deng, Nicholas Cummins, Maximilian Schmitt, Kun Qian, Fabien
Ringeval, and Bj¨
orn Schuller, “Speech-based diagnosis of autism
spectrum condition by generative adversarial network representations,
in Proceedings of the International Conference on Digital Health. ACM,
2017, pp. 53–57.
[5] Shahin Amiriparian, Michael Freitag, Nicholas Cummins, Maurice Ger-
czuk, Sergey Pugachevskiy, and Bj¨
orn Schuller, “A fusion of deep
convolutional generative adversarial networks and sequence to sequence
autoencoders for acoustic scene classification,” in The European Signal
Processing Conference (EUSIPCO). IEEE, 2018, pp. 977–981.
[6] Santiago Pascual, Antonio Bonafonte, and Joan Serr`
a, “Segan:
Speech enhancement generative adversarial network, arXiv preprint
arXiv:1703.09452, 2017.
[7] Daniel Michelsanti and Zheng-Hua Tan, “Conditional generative ad-
versarial networks for speech enhancement and noise-robust speaker
verification,” arXiv preprint arXiv:1709.01703, 2017.
[8] Zhuo Chen, Shinji Watanabe, Hakan Erdogan, and John R Hershey,
“Speech enhancement and recognition using multi-task learning of long
short-term memory recurrent neural networks,” in Sixteenth Annual
Conference of the International Speech Communication Association,
[9] Dario Rethage, Jordi Pons, and Xavier Serra, “A wavenet for speech
denoising,” in The IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). IEEE, 2018, pp. 5069–5073.
[10] Jie Li, Katherine A Skinner, Ryan M Eustice, and Matthew Johnson-
Roberson, “Watergan: unsupervised generative network to enable real-
time color correction of monocular underwater images,” IEEE Robotics
and Automation Letters, vol. 3, no. 1, pp. 387–394, 2017.
[11] Emilia Parada-Cabaleiro, Alice Baird, Nicholas Cummins, and Bj¨
orn W
Schuller, “Stimulation of psychological listener experiences by semi-
automatically composed electroacoustic environments, in 2017 IEEE
International Conference on Multimedia and Expo (ICME). IEEE, 2017,
pp. 1051–1056.
[12] Irene Van Kamp, Ronny Klaeboe, Hanneke Kruize, Alan Lex Brown,
and Peter Lercher, “Soundscapes, human restoration and quality of
life,” in INTER-NOISE and NOISE-CON Congress and Conference
Proceedings. Institute of Noise Control Engineering, 2016, vol. 253,
pp. 1205–1215.
[13] Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan,
Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward
Lockhart, Luis C Cobo, Florian Stimberg, et al., “Parallel wavenet: Fast
high-fidelity speech synthesis,” arXiv preprint arXiv:1711.10433, 2017.
[14] Alice Baird, Stina Hasse Jørgensen, Emilia Parada-Cabaleiro, Nicholas
Cummins, Simone Hantke, and Bj¨
orn Schuller, “The perception of vocal
traits in synthesized voices: Age, gender, and human likeness,Journal
of the Audio Engineering Society, vol. 66, no. 4, pp. 277–285, 2018.
[15] Heejin Choi, Sangjun Park, Jinuk Park, and Minsoo Hahn, “Emotional
speech synthesis for multi-speaker emotional dataset using wavenet
vocoder, in 2019 IEEE International Conference on Consumer Elec-
tronics (ICCE). IEEE, 2019, pp. 1–2.
[16] Jonathan Chang and Stefan Scherer, “Learning representations of emo-
tional speech with deep convolutional generative adversarial networks,”
in 2017 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 2017, pp. 2746–2750.
[17] Xinyue Zhu, Yifan Liu, Zengchang Qin, and Jiahong Li, “Data augmen-
tation in emotion classification using generative adversarial networks,
arXiv preprint arXiv:1711.00648, 2017.
[18] Emilia Parada-Cabaleiro, Giovanni Costantini, Anton Batliner, Maximil-
ian Schmitt, and Bj¨
orn W Schuller, “Demos: an italian emotional speech
corpus,” Language Resources and Evaluation, pp. 1–43, 2019.
[19] A¨
aron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan,
Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and
Koray Kavukcuoglu, “Wavenet: A generative model for raw audio.,”
SSW, vol. 125, 2016.
[20] Shahin Amiriparian, Maurice Gerczuk, Sandra Ottl, Nicholas Cummins,
Michael Freitag, Sergey Pugachevskiy, Alice Baird, and Bj¨
orn Schuller,
“Snore sound classification using image-based deep spectrum features,”
in Proc. INTERSPEECH, 2017, pp. 3512–3516.
[21] Florian Eyben, Klaus Scherer, Bj¨
orn Schuller, Johan Sundberg, Elisabeth
e, Carlos Busso, Laurence Devillers, Julien Epps, Petri Laukka,
Shrikanth Narayanan, and Khiet Truong, “The Geneva Minimalistic
Acoustic Parameter Set (GeMAPS) for Voice Research and Affective
Computing,” IEEE Transactions on Affective Computing, vol. 7, no. 2,
pp. 190–202, 2016.
[22] George Trigeorgis, Fabien Ringeval, Raymond Brueckner, Erik Marchi,
Mihalis A Nicolaou, Bj¨
orn Schuller, and Stefanos Zafeiriou, “Adieu fea-
tures? end-to-end speech emotion recognition using a deep convolutional
recurrent network,” in 2016 IEEE international conference on acoustics,
speech and signal processing (ICASSP). IEEE, 2016, pp. 5200–5204.
[23] Nicholas Cummins, Shahin Amiriparian, Gerhard Hagerer, Anton Bat-
liner, Stefan Steidl, and Bj¨
orn W Schuller, An image-based deep
spectrum feature representation for the recognition of emotional speech,”
in Proceedings of the 25th ACM international conference on Multimedia.
ACM, 2017, pp. 478–484.
[24] J. A. Russell, “A circumplex model of affect, Journal of Personality
and Social Psychology, vol. 39, no. 6, pp. 1161–1178, 1980.
[25] Justin Salamon and Juan Pablo Bello, “Deep convolutional neural
networks and data augmentation for environmental sound classification,
IEEE Signal Processing Letters, vol. 24, no. 3, pp. 279–283, 2017.
[26] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt
McVicar, Eric Battenberg, and Oriol Nieto, “librosa: Audio and music
signal analysis in python,” in Proceedings of the 14th python in science
conference, 2015, pp. 18–25.
[27] A¨
aron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt,
Alex Graves, and Koray Kavukcuoglu, “Conditional image generation
with pixelcnn decoders,” CoRR, vol. abs/1606.05328, 2016.
[28] Rachel Manzelli, Vijay Thakkar, Ali Siahkamari, and Brian Kulis, “Con-
ditioning deep generative raw audio models for structured automatic
music,” arXiv preprint arXiv:1806.09905, 2018.
[29] Fisher Yu and Vladlen Koltun, “Multi-scale context aggregation by
dilated convolutions, arXiv preprint arXiv:1511.07122, 2015.
[30] Florian Eyben, Felix Weninger, Florian Gross, et al., “Recent Devel-
opments in openSMILE, the Munich Open-Source Multimedia Feature
Extractor, in Proc. ACM, Barcelona, Spain, 2013, pp. 835–838.
[31] Shahin Amiriparian, Maurice Gerczuk, Sandra Ottl, Nicholas Cummins,
Michael Freitag, Sergey Pugachevskiy, and Bj¨
orn Schuller, “Snore
Sound Classification Using Image-based Deep Spectrum Features,” in
Proc. of INTERSPEECH, Stockholm, Sweden, 2017, ISCA, 5 pages.
[32] Karen Simonyan and Andrew Zisserman, “Very deep convolu-
tional networks for large-scale image recognition, arXiv preprint
arXiv:1409.1556, 2014.
[33] Fabian Pedregosa, Ga¨
el Varoquaux, Alexandre Gramfort, Vincent
Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Pret-
tenhofer, Ron Weiss, Vincent Dubourg, et al., “Scikit-learn: Machine
learning in Python,” Journal of Machine Learning Research, vol. 12,
pp. 2825–2830, 2011.
[34] Jennifer Peat and Belinda Barton, Medical statistics: A guide to data
analysis and critical appraisal, John Wiley & Sons, 2008.
[35] Shahin Amiriparian, Alice Baird, Sahib Julka, Alyssa Alcorn, San-
dra Ottl, Suncica Petrovi´
c, Eloise Ainger, Nicholas Cummins, and
orn Schuller, “Recognition of Echolalic Autistic Child Vocalisations
Utilising Convolutional Recurrent Neural Networks, in Proceedings
INTERSPEECH 2018, 19th Annual Conference of the International
Speech Communication Association, Hyderabad, India, September 2018,
ISCA, pp. 2334–2338, ISCA.
[36] Alec Radford, Luke Metz, and Soumith Chintala, “Unsupervised
representation learning with deep convolutional generative adversarial
networks,” arXiv preprint arXiv:1511.06434, 2015.
[37] Chris Donahue, Julian McAuley, and Miller Puckette, “Synthe-
sizing audio with generative adversarial networks, arXiv preprint
arXiv:1802.04208, 2018.
... Most prominently those known as deep generative models, which essentially provide a representation of a probability distributions over multiple variables [152]. Such networks include Variational Autoencoders (VAE) Generative Adversarial Network (GAN), and Deep Auto-regressive Networks (DARN), and in this section a description of only GANs and DARNs is given, as these have been applied in works by the author, namely [170] and [26] with further detail given in Chapter 4 Section 4.4. ...
... Given that the generative model is based on the source data training set, the assumption is made that a better performing model will produce data samples within the distribution of the source data. However, this method is a 'black-box', and only limited interpretations can be made about the overall quality of the data [170]. ...
... In this section, first the efficacy of this in relation to emotional speech is explored, and proceeding to this a method for evaluating the generated samples is then presented. These experiments are based largely on two published works by the author, firstly [170], where WaveNet was applied for the first time in the context of generating emotional speech. The limitations of this work [170] were then addressed in a later publication [26], which is the main focus for the current experiments. ...
This thesis is focused on the application of computer audition (i. e., machine listening) methodologies for monitoring states of emotional wellbeing. Computer audition is a growing field and has been successfully applied to an array of use cases in recent years. There are several advantages to audio-based computational analysis; for example, audio can be recorded non-invasively, stored economically, and can capture rich information on happenings in a given environment, e. g., human behaviour. With this in mind, maintaining emotional wellbeing is a challenge for humans and emotion-altering conditions, including stress and anxiety, have become increasingly common in recent years. Such conditions manifest in the body, inherently changing how we express ourselves. Research shows these alterations are perceivable within vocalisation, suggesting that speech-based audio monitoring may be valuable for developing artificially intelligent systems that target improved wellbeing. Furthermore, computer audition applies machine learning and other computational techniques to audio understanding, and so by combining computer audition with applications in the domain of computational paralinguistics and emotional wellbeing, this research concerns the broader field of empathy for Artificial Intelligence (AI). To this end, speech-based audio modelling that incorporates and understands paralinguistic wellbeing-related states may be a vital cornerstone for improving the degree of empathy that an artificial intelligence has. To summarise, this thesis investigates the extent to which speech-based computer audition methodologies can be utilised to understand human emotional wellbeing. A fundamental background on the fields in question as they pertain to emotional wellbeing is first presented, followed by an outline of the applied audio-based methodologies. Next, detail is provided for several machine learning experiments focused on emotional wellbeing applications, including analysis and recognition of under-researched phenomena in speech, e. g., anxiety, and markers of stress. Core contributions from this thesis include the collection of several related datasets, hybrid fusion strategies for an emotional gold standard, novel machine learning strategies for data interpretation, and an in-depth acoustic-based computational evaluation of several human states. All of these contributions focus on ascertaining the advantage of audio in the context of modelling emotional wellbeing. Given the sensitive nature of human wellbeing, the ethical implications involved with developing and applying such systems are discussed throughout.
... Using the openSMILE toolkit [25] 5 , we extract 88 extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) features [24] which have shown their suitability for sentiment analysis and SER tasks [8,59]. For each sub-challenge, we apply the standard configuration and extract features with a window size of two seconds, and a hop size of 500 ms. ...
Full-text available
The MuSe 2023 is a set of shared tasks addressing three different contemporary multimodal affect and sentiment analysis problems: In the Mimicked Emotions Sub-Challenge (MuSe-Mimic), participants predict three continuous emotion targets. This sub-challenge utilises the Hume-Vidmimic dataset comprising of user-generated videos. For the Cross-Cultural Humour Detection Sub-Challenge (MuSe-Humour), an extension of the Passau Spontaneous Football Coach Humour (Passau-SFCH) dataset is provided. Participants predict the presence of spontaneous humour in a cross-cultural setting. The Personalisation Sub-Challenge (MuSe-Personalisation) is based on the Ulm-Trier Social Stress Test (Ulm-TSST) dataset, featuring recordings of subjects in a stressed situation. Here, arousal and valence signals are to be predicted, whereas parts of the test labels are made available in order to facilitate personalisation. MuSe 2023 seeks to bring together a broad audience from different research communities such as audio-visual emotion recognition, natural language processing, signal processing, and health informatics. In this baseline paper, we introduce the datasets, sub-challenges, and provided feature sets. As a competitive baseline system, a Gated Recurrent Unit (GRU)-Recurrent Neural Network (RNN) is employed. On the respective sub-challenges' test datasets, it achieves a mean (across three continuous intensity targets) Pearson's Correlation Coefficient of .4727 for MuSe-Mimic, an Area Under the Curve (AUC) value of .8310 for MuSe-Humor and Concordance Correlation Coefficient (CCC) values of .7482 for arousal and .7827 for valence in the MuSe-Personalisation sub-challenge.
... Personalisation is expected to be another major aspect of future ESS systems. Both the expression [152,153,154] and the perception [155] of emotion show individualistic effects which are currently underexploited in the ESS field. Future approaches can benefit a lot from adopting a similar mindset and adapt the production of emotional speech to a style that fits both the speaker and the listener. ...
Full-text available
Speech is the fundamental mode of human communication, and its synthesis has long been a core priority in human-computer interaction research. In recent years, machines have managed to master the art of generating speech that is understandable by humans. But the linguistic content of an utterance encompasses only a part of its meaning. Affect, or expressivity, has the capacity to turn speech into a medium capable of conveying intimate thoughts, feelings, and emotions -- aspects that are essential for engaging and naturalistic interpersonal communication. While the goal of imparting expressivity to synthesised utterances has so far remained elusive, following recent advances in text-to-speech synthesis, a paradigm shift is well under way in the fields of affective speech synthesis and conversion as well. Deep learning, as the technology which underlies most of the recent advances in artificial intelligence, is spearheading these efforts. In the present overview, we outline ongoing trends and summarise state-of-the-art approaches in an attempt to provide a comprehensive overview of this exciting field.
... The extended Geneva Minimalistic Acoustic Parameter Set (EGEMAPS) [21], is smaller in size and was designed for affective based computational paralinguistic tasks. Both sets continue to be valuable for several computational emotion-based speech studies [22,23]. ...
Full-text available
Charisma is considered as one's ability to attract and potentially influence others. Clearly, there can be considerable interest from an artificial intelligence's (AI) perspective to provide it with such skill. Beyond, a plethora of use cases opens up for computational measurement of human charisma, such as for tutoring humans in the acquisition of charisma, mediating human-to-human conversation, or identifying charismatic individuals in big social data. While charisma is a subject of research in its own right, a number of models exist that base it on various “pillars,” that is, dimensions, often following the idea that charisma is given if someone could and would help others. Examples of such pillars, therefore, include influence (could help) and affability (would help) in scientific studies, or power (could help), presence, and warmth (both would help) as a popular concept. Modeling high levels in these dimensions, i. e., high influence and high affability, or high power, presence, and warmth for charismatic AI of the future, e. g., for humanoid robots or virtual agents, seems accomplishable. Beyond, also automatic measurement appears quite feasible with the recent advances in the related fields of Affective Computing and Social Signal Processing. Here, we therefore present a brick by brick blueprint for building machines that can appear charismatic, but also analyse the charisma of others. We first approach the topic very broadly and discuss how the foundation of charisma is defined from a psychological perspective. Throughout the manuscript, the building blocks (bricks) then become more specific and provide concrete groundwork for capturing charisma through artificial intelligence (AI). Following the introduction of the concept of charisma, we switch to charisma in spoken language as an exemplary modality that is essential for human-human and human-computer conversations. The computational perspective then deals with the recognition and generation of charismatic behavior by AI. This includes an overview of the state of play in the field and the aforementioned blueprint. We then list exemplary use cases of computational charismatic skills. The building blocks of application domains and ethics conclude the article.
Speech is the fundamental mode of human communication, and its synthesis has long been a core priority in human–computer interaction research. In recent years, machines have managed to master the art of generating speech that is understandable by humans. However, the linguistic content of an utterance encompasses only a part of its meaning. Affect, or expressivity, has the capacity to turn speech into a medium capable of conveying intimate thoughts, feelings, and emotions—aspects that are essential for engaging and naturalistic interpersonal communication. While the goal of imparting expressivity to synthesized utterances has so far remained elusive, following recent advances in text-to-speech synthesis, a paradigm shift is well under way in the fields of affective speech synthesis and conversion as well. Deep learning, as the technology that underlies most of the recent advances in artificial intelligence, is spearheading these efforts. In this overview, we outline ongoing trends and summarize state-of-the-art approaches in an attempt to provide a broad overview of this exciting field.
Full-text available
We present DEMoS (Database of Elicited Mood in Speech), a new, large database with Italian emotional speech: 68 speakers, some 9 k speech samples. As Italian is under-represented in speech emotion research, for a comparison with the state-of-the-art, we model the ‘big 6 emotions’ and guilt. Besides making available this database for research, our contribution is three-fold: First, we employ a variety of mood induction procedures, whose combinations are especially tailored for specific emotions. Second, we use combinations of selection procedures such as an alexithymia test and self- and external assessment, obtaining 1,5 k (proto-) typical samples; these were used in a perception test (86 native Italian subjects, categorical identification and dimensional rating). Third, machine learning techniques—based on standardised brute-forced openSMILE ComParE features and support vector machine classifiers—were applied to assess how emotional typicality and sample size might impact machine learning efficiency. Our results are three-fold as well: First, we show that appropriate induction techniques ensure the collection of valid samples, whereas the type of self-assessment employed turned out not to be a meaningful measurement. Second, emotional typicality—which shows up in an acoustic analysis of prosodic main features—in contrast to sample size is not an essential feature for successfully training machine learning models. Third, the perceptual findings demonstrate that the confusion patterns mostly relate to cultural rules and to ambiguous emotions.
Conference Paper
Full-text available
Unsupervised representation learning shows high promise for generating robust features for acoustic scene analysis. In this regard, we propose and investigate a novel combination of features learnt using both a deep convolutional gen-erative adversarial network (DCGAN) and a recurrent sequence to sequence autoencoder (S2SAE). Each of the representation learning algorithms is trained individually on spectral features extracted from audio instances. The learnt representations are: (i) the activations of the discriminator in case of the DCGAN, and (ii) the activations of a fully connected layer between the decoder and encoder units in case of the S2SAE. We then train two multilayer perceptron neural networks on the DCGAN and S2SAE feature vectors to predict the class labels. The individual predicted labels are combined in a weighted decision-level fusion to achieve the final prediction. The system is evaluated on the development partition of the acoustic scene classification data set of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2017). In comparison to the baseline, the accuracy is increased from 74.8 % to 86.4 % using only the DCGAN, to 88.5 % on the development set using only the S2SAE, and to 91.1 % after fusion of the individual predictions.
Full-text available
It is a difficult task to classify images with multiple class labels using only a small number of labeled examples, especially when the label (class) distribution is imbalanced. Emotion classification is such an example of imbalanced label distribution, because some classes of emotions like disgusted are relatively rare comparing to other labels like happy or sad. In this paper, we propose a data augmentation method using generative adversarial networks (GAN). It can complement and complete the data manifold and find better margins between neighboring classes. Specifically, we design a framework using a CNN model as the classifier and a cycle-consistent adversarial networks (CycleGAN) as the generator. In order to avoid gradient vanishing problem, we employ the least-squared loss as adversarial loss. We also propose several evaluation methods on three benchmark datasets to validate GAN’s performance. Empirical results show that we can obtain 5%–10% increase in the classification accuracy after employing the GAN-based data augmentation techniques.
Full-text available
The paralinguistics of the voice are the perceived states and traits that make that voice unique to the human body from which it resonates. In many cases the synthesized voice is produced by concatenated segments of recorded human speech, a complex process that can result in an arguably lifeless voice, which lacks the ability for free-expression among other human qualities. In recent years technology-based companies are developing their own synthesized voice identities, yet seemingly paying little attention to the stereotypical traits being heard. Do such synthetic voice traits differ from the human traits they are modelled on? To explore this, the presented perception study performed by 18 listeners evaluated the paralinguistic traits of gender, age, and human likeness in the IBM voice library. Results herein have shown a similar trend to a previous study by the authors with no voice achieving complete human likeness, no voice being perceived within a single age frequency band, and none tied solidly to their given binary gender-a novel finding as commercially available synthesized voices are typically developed to operate within binary identification structures.
While Generative Adversarial Networks (GANs) have seen wide success at the problem of synthesizing realistic images, they have seen little application to the problem of unsupervised audio generation. Unlike for images, a barrier to success is that the best discriminative representations for audio tend to be non-invertible, and thus cannot be used to synthesize listenable outputs. In this paper, we introduce WaveGAN, a first attempt at applying GANs to raw audio synthesis in an unsupervised setting. Our experiments on speech demonstrate that WaveGAN can produce intelligible words from a small vocabulary of human speech, as well as synthesize audio from other domains such as bird vocalizations, drums, and piano. Qualitatively, we find that human judges prefer the generated examples from WaveGAN over those from a method which naively apply GANs on image-like audio feature representations.