Conference PaperPDF Available

Synthesising Knocking Sound Effects Using Conditional WaveGAN

Authors:

Abstract and Figures

In this paper we explore the synthesis of sound effects using conditional generative adversarial networks (cGANs). We commissioned Foley artist Ulf Olausson to record a dataset of knocking sound effects with different emotions and trained a cGAN on it. We analysed the resulting syn-thesised sound effects by comparing their temporal acoustic features to the original dataset and by performing an online listening test. Results show that the acoustic features of the synthesised sounds are similar to those of the recorded dataset. Additionally, the listening test results show that the synthesised sounds can be identified by people with experience in sound design, but the model is not far from fooling non-experts. Moreover, on average most emotions can be recognised correctly in both recorded and synthesised sounds. Given that the temporal acoustic features of the two datasets are highly similar, we hypothesise that they strongly contribute to the perception of the intended emotions in the recorded and synthesised knocking sounds.
Content may be subject to copyright.
Proceedings of the 17th Sound and Music Computing Conference, Torino, June 24th – 26th 2020
450
Synthesising Knocking Sound Effects Using Conditional WaveGAN
Adri´
an Barahona-R´
ıos
Department of Computer Science,
University of York
ajbr501@york.ac.uk
Sandra Pauletto
Department of Media Technology
and Interaction Design,
KTH Royal Institute of Technology
pauletto@kth.se
ABSTRACT
In this paper we explore the synthesis of sound effects us-
ing conditional generative adversarial networks (cGANs).
We commissioned Foley artist Ulf Olausson to record a
dataset of knocking sound effects with different emotions
and trained a cGAN on it. We analysed the resulting syn-
thesised sound effects by comparing their temporal acous-
tic features to the original dataset and by performing an
online listening test. Results show that the acoustic fea-
tures of the synthesised sounds are similar to those of the
recorded dataset. Additionally, the listening test results
show that the synthesised sounds can be identified by peo-
ple with experience in sound design, but the model is not
far from fooling non-experts. Moreover, on average most
emotions can be recognised correctly in both recorded and
synthesised sounds. Given that the temporal acoustic fea-
tures of the two datasets are highly similar, we hypothesise
that they strongly contribute to the perception of the in-
tended emotions in the recorded and synthesised knocking
sounds.
1. INTRODUCTION
Video games and interactive media are mostly sound de-
signed using pre-recorded samples. In order to avoid repe-
tition when a user repeats an action, several sound effects
are used for the same action. This creates an overhead in
terms of asset management and implementation time. As
opposed to pre-recorded samples, procedural audio, in the
context of video games, refers to the use of audio synthesis
during gameplay [1]. Among its benefits, sound synthesis
can generate infinite variations of sound effects, adapting
the sound to the game parameters.
Knocking sound effects are a key element in storytelling:
they are often used as a transition element. For example,
in films a knocking action can express the emotions of the
person knocking at the door, as well as create expectations
in the audience about the possible reactions of a character
hearing the knock. Knocking sound effects are an interest-
ing subject of study for sound synthesis because they have
Copyright: c
2020 Adri´
an Barahona-R´
ıos et al. This
is an open-access article distributed under the terms of the
Creative Commons Attribution 3.0 Unported License, which permits unre-
stricted use, distribution, and reproduction in any medium, provided the original
author and source are credited.
a frequency-domain (the individual knock synthesis) com-
ponent as well as a highly articulated time-domain (how
the knocks are arranged in time) component.
Knocking sound effects are composed from one or more
impact-sounds against a surface. While impact-sounds are
often synthesised using modal synthesis [2], novel archi-
tectures in machine learning, and more specifically deep
learning, open the possibility of synthesising sounds using
alternative methods. The technique of synthesising audio
using deep learning is often called neural audio synthe-
sis [3]. In the context of deep learning, generative mod-
els are those that learn the statistical latent space from a
dataset and create new data by sampling from it [4]. In
other words, generative models can generate new unseen
data by learning from a training dataset, and the new gen-
erated data will have similar characteristics to those of the
training dataset.
GANs [5] have two components: a discriminator (also
called critic when using Wasserstein [6] or WGAN-GP [7])
and a generator. The discriminator (or critic) is a neural
network with the goal of predicting whether a sample is
real or fake. The generator is also a neural network that
aims to fool the discriminator by starting from random
values (usually noise sampled from a Gaussian distribu-
tion) and updating its parameters until it is able to create
an output good enough to fool the discriminator. The gen-
eration from GANs is not conditioned. cGANs [8] are an
extension to GANs that allow to introduce more data (such
as a class label) to further condition the generation of the
model. For instance, a GAN generator trained on a corpus
of footstep sound effects on different surfaces will output
a footstep each time on a different random surface. How-
ever, if a surface label is added to the model, the generator
can be controlled to output a sound on a particular surface.
GANs have been used for tasks such as computer vision
to generate ultra-realistic human face images [9] or proce-
dural content generation to generate coloured 3D meshes
[10]. Regarding the use of GANs for audio synthesis, dif-
ferent architectures have been proposed, such as Wave-
GAN [11] and GANSynth. [12]. WaveGAN learns directly
from raw audio data (sound files) and produces raw au-
dio data. WaveGAN is based on DCGAN (deep convo-
lutional GAN) [13], but using one-dimensional convolu-
tions instead of two-dimensional convolutions in order to
process the raw audio data (as opposed to images). GAN-
Synth uses an architecture similar to progressive GAN [14]
and different audio representations (such as spectrograms
Proceedings of the 17th Sound and Music Computing Conference, Torino, June 24th – 26th 2020
451
and mel-spectrograms) instead of raw audio. GANSynth
reported better scores than WaveGAN on human percep-
tual evaluation for musical notes with pitch conditioning.
While GANs do not synthesise audio in real-time, the gen-
eration time is remarkably fast (especially for batches of
sounds thanks to parallelisation).
We decided to use the WaveGAN architecture given its
simplicity and the good perceptual results it produces when
synthesising drum samples in a virtual environment [15],
which can represent a close scenario to the synthesis of
knocking sound effects in a video game. Pilot studies ex-
ist aimed to condition the generation of WaveGAN applied
to speech synthesis [16], although authors found the gen-
erated audio files too noisy. Our objective is to use the
conditional WaveGAN architecture to explore the synthe-
sis of knocking sounds effects with different emotional in-
tentions and analyse the results in terms of synthesis plau-
sibility, acoustic features and emotion portrayed.
2. KNOCKING SOUND EFFECTS DATASET
In order to train the cGAN to synthesise the sound ef-
fects, we asked the professional Foley artist Ulf Olaus-
son to record a dataset of knocking sound effects at the
FoleyWorks studios in Stockholm 1. Inspired by previous
work on knocking sounds [17], we chose five basic emo-
tions [18] to be portrayed in the dataset: anger, fear, hap-
piness, neutral and sadness. We gave the Foley artist the
following scenarios to perform the knocking actions:
Anger: telling a flatmate for the 4th time to turn
down the very loud music.
Fear: alerting a neighbour of a possible risk.
Happiness: telling a flatmate that they won a prize.
Neutral: parcel delivery.
Sadness: telling a friend that someone passed away.
We also asked the Foley artist to perform diverse inter-
pretations of the provided scenarios in order to produce a
wider variety of sounds. The dataset was recorded with a
Rode NT1 microphone, performing the knocks to a closed
door (Figure 1).
We recorded a total of 600 knocking actions (120 ac-
tions per category). An action is a sequence of individual
knocks. We discarded 20 actions per category to filter out
unwanted noise. The final 500 audio files were trimmed
so each file started on the first knock onset and finished on
the last knock decay. The dataset can be accessed from the
online repository [19].
2.1 Acoustic Analysis
Research in music performance and speech [20] has demon-
strated that sound performed with different intended emo-
tions present emotion-specific acoustic patterns. Therefore
in order to understand the variations of the knocking pat-
terns for different emotions, and to compare the original
1http://foleyworks.se/
Figure 1: Microphone placement during the recording.
dataset to the synthesised sounds, we extracted different
acoustic features from the dataset. The different features
used for the analysis are the following:
Figure 2: Example of the root-mean-square energy fea-
ture. The X axis represents knocking action duration in
seconds. The Y axis represents the RMSE of the individual
knocks. The individual knock positions in the action are
represented by the black dots. The slope of the fitted line
(in red) is negative, therefore the action has a decrescendo
energy pattern.
Action duration: Length of the knocking action. The
knocking action length is the time passed from the
first knock onset to the last knock decay.
Number of knocks per action: The number of knocks
were retrieved by counting the number of onsets de-
tected in each audio file.
Knocking rate: The knocking rate was retrieved by
dividing the number of knocks in an action by the
total time of the action. This feature is applied only
to actions with 2 or more knocks.
Knocking regularity: The knocking regularity mea-
sures how regular the knocks are in an action. Ac-
tions where the knocks are performed on a steady
pace will show a higher regularity. To extract this
Proceedings of the 17th Sound and Music Computing Conference, Torino, June 24th – 26th 2020
452
feature, we calculated the inter-onset interval (IOI)
of each action and computed the coefficient of vari-
ation (the standard deviation divided by the mean).
Irregular actions will have higher coefficient of vari-
ation. This feature is applied only to actions with
more than 2 knocks.
Root-mean-square energy (RMSE) slope: This re-
trieves the energy pattern (crescendo or decrescendo)
of an action. We calculated the root-mean-square
energy of each individual knock and applied a linear
regression to each action. The slope of the fitted line
determines the energy crescendo (positive values) or
decrescendo (negative values) of the action. An ex-
ample of this feature applied to an individual action
can be seen in Figure 2. This feature is applied only
to actions with 2 or more knocks.
These features were utilised by the authors in a parallel
study on the emotion perception in knocking actions of
naive performers [21] in order to allow future comparison
of the results.
3. METHOD
3.1 Sound Synthesis
As mentioned, we used a conditional WaveGAN architec-
ture to synthesise the audio. We conditioned the architec-
ture to accept the input label corresponding to the different
knocking action’s emotion. As the longest knocking action
in the recorded dataset is 2.85 seconds, we used the largest
version of the architecture that allows the processing of
65536 samples. We used a sampling rate of 22050Hz to
fit all knocking actions in the architecture (the longest ac-
tion has 62843 samples using the 22050Hz sampling rate).
We ran a series of pilot experiments with different hyper-
parameters. While we used most of the hyper-parameters
proposed in the WaveGAN paper, we found phase shuffle
worsened the output of our model, therefore we discarded
it. We judged the performance of the model subjectively
by synthesising a knocking sound effect per different emo-
tion every 200 batches of the training loop. After the pilot
experiments, we set the critic loops to 5, the learning rate
(for both the critic and the generator) to 0.0002 and the
batch size to 128. We used WGAN-GP, batch normalisa-
tion, Adam optimiser and no phase shuffle. We trained the
model on a NVIDIA Tesla V100 GPU for 72 hours.
We synthesised a total of 500 sounds (100 sounds per
emotion) to compare the synthesised sounds to the original
dataset. We did not favour the best sounding synthesised
samples: they are the raw output from the model. The syn-
thesis of the 500 sound effects took a total of 30 seconds
on a NVIDIA GTX 1060 GPU.
Our Keras implementation of the conditional WaveGAN
architecture is available in the online repository 2.
2Code: https://github.com/adrianbarahona/
conditional_wavegan_knocking_sounds
3.2 Perceptual Evaluation
To understand how the sounds are perceived, we made an
online listening test. The listening test had two objectives:
to determine whether or not the synthesised sound effects
are distinguishable from the recorded sounds in the dataset,
and to learn whether or not the intended emotion of the
knocking action (recorded or synthesised) could be identi-
fied.
To learn how the synthesised sounds are perceptually per-
ceived compared to the recorded dataset, we planned our
listening test using the guidelines of the RS (real and syn-
thetic) listening test [22]. The objective of the test is to
determine how distinguishable the synthesised sounds are
from the recorded ones. The test was done online. Partici-
pants were asked to use headphones. Overall the listening
conditions were considered to be similar to those of a per-
son playing a video game using their own equipment. As
a control mechanism, we introduced an ‘acid’ (a clearly
synthesised sound) sample as suggested by the RS listen-
ing test guidelines.
The metrics used from the RS listening test were the dis-
criminator factor dand the F-measure. The discriminator
factor dis defined as:
d=PCS PF P + 1
2(1)
Where PCS are the percentage of the samples correctly la-
beled as synthetic and PFP are the percentage of the recorded
samples labeled as synthetic (false positives). The F-measure
is defined as:
Fmeasure =(β2+ 1) ×P recision ×Recall
β2×P recision +Recall
(2)
Where precision and recall are defined as:
P recision =PCS
PCS +PF P
(3)
Recall =PCS
PCS +PF N
(4)
With PFN being the percentage of synthetic sounds labeled
as recorded and β= 1.
In terms of the interpretation of dand the F-measure, val-
ues below 0.5 are no different from random guessing, and
values close to 1.0 show the recorded and synthetic sounds
are clearly distinguishable from each other. Values below
0.75 show there is no clear distinction between recorded
and synthetic sounds.
To measure how the emotions are perceived in both the
recorded dataset and in the synthesised sounds, we per-
formed a Chi-square test on the emotion labeling by the
participants.
We used the online platform Qualtrics3to carry out the
listening test. Each participant was presented with a total
of 51 sounds. 25 sounds were recorded (5 per emotion),
25 sounds were synthesised (5 per emotion) and 1 sound
was the acid test. The selection of sounds, and the order
3https://www.qualtrics.com
Proceedings of the 17th Sound and Music Computing Conference, Torino, June 24th – 26th 2020
453
in which they were presented, were randomised for each
participant. The sounds were presented in their original
format: 48000Hz/16bit WAV for the recorded sounds and
22050Hz/32bit WAV for the synthesised sounds. Again,
the model uses a sampling rate of 22050Hz to fit all the
knocking actions in the architecture. While the difference
in sampling rates between the recorded and synthesised
sounds could affect the perceptual evaluation, we decided
to maintain the original sampling rates given that it bet-
ter represents the performance of the model against pre-
recorded sounds.
The participants were asked to label each sound as ‘recorded’
or ‘synthesised’ and to choose the emotion each sound rep-
resents the most (Figure 3). Participants were also asked
their level of expertise in sound design, ranked from 1 (no
expertise) to 5 (professional).
Figure 3: Online test interface.
4. RESULTS
Sound examples of both the recorded dataset and the syn-
thesised sounds can be found in the webpage of the project 4.
4.1 Comparison of Recorded and Synthesised Sounds
Acoustic Features
The acoustic analysis of both the recorded dataset and the
synthesised sounds is shown in Figure 4. When compar-
ing the analysed features for both the recorded dataset (left
column, Figures 4a, 4c, 4e, 4g and 4i) and the synthesised
sounds (right column, Figures 4b, 4d, 4f, 4h and 4j), it
is clear that the synthesised sounds retain the features from
the recorded dataset. While their distributions are not iden-
tical, synthesised sounds follow the recorded dataset trend
in all the analysed features.
4.2 Listening Test Results
A total of 22 participants (17 males, 5 females) with ages
from 18 to 58 took the online listening test. 14 of the par-
ticipants reported no experience in sound design, 2 partic-
ipants a level of 2 out of 5, 3 participants a level of 3 out
of 5, 2 participants a level of 4 out of 5 and 1 participant a
level of 5 out of 5.
One participant failed to label the ‘acid’ sample correctly
and therefore was removed from the evaluation completely.
4Sound examples: https://www.adrianbarahonarios.
com/conditional_wavegan_knocking_sounds/
Another participant did not label the emotion of one sound
file and therefore was removed from the evaluation of the
emotion labeling.
The average values of the RS listening test can be seen in
Table 1. The raw percentage of participants who correctly
labeled the recorded and synthesised sounds can be seen in
Table 2. A visualisation of the average RS test results with
the individual scores separated by the level of the partici-
pants level of expertise in sound design is shown in Figure
5. While the RS listening test results are on average below
the 0.75 threshold, synthesised samples can definitely be
identified by persons with medium to high experience in
sound design. However, some people with low experience
in sound design are unable to detect them.
The raw percentage of the participant emotion labeling
for the recorded dataset is shown in Table 3. We performed
a Chi-Square Test with χ2(16, N = 500) = 564.341, p =
.000 < .05 (columns compared with a z-test and p val-
ues adjusted with Bonferroni method). Results show that
anger, happiness, neutral and sadness are statistically dif-
ferent from others. Anger and fear are not statistically dif-
ferent from each other, but they are statistically different
from the other emotions. This indicates that fear is con-
fused with anger. For the synthesised sounds, the raw per-
centage of the participant emotion labeling is shown in Ta-
ble 4. We performed a Chi-Square Test with χ2(16, N =
500) = 376.371, p =.000 < .05 (columns compared with
a z-test and p values adjusted with Bonferroni method).
Identically to the recorded emotion labeling results, the
Chi-Square Test shows that anger, happiness, neutral and
sadness are statistically different from others and that anger
and fear are not statistically different from each other, but
they are statistically different from the other emotions.
5. DISCUSSION AND FUTURE WORK
This paper explored the synthesis of knocking sound ef-
fects with emotional intention using a conditional Wave-
GAN architecture. We analysed the results in terms of syn-
thesis plausibility (how distinguishable are the synthesised
samples from the original dataset), acoustic features and
emotion.
Results show that the model is close to be able to syn-
thesise samples that would be indistinguishable from their
recorded counterparts for persons without experience in
sound design. However, synthesised sounds are easily iden-
tifiable for people with experience in sound design. To ex-
plore how to improve the model perceptually, we plan, in
future work, to use the GANSynth architecture to compare
it with the conditional WaveGAN architecture on the syn-
thesis of knocking sound effects. GANSynth already re-
ported better perceptual results on musical notes with pitch
conditioning, and might produce better results in the syn-
thesis of sound effects too.
In terms of the intended (and labelled) emotions, results
show that, on average, most emotions can be recognised
correctly in both the recorded and the synthesised sound
groups. In both groups fear is confused with anger. This
means that, despite the fact that recorded and synthesised
sounds can be distinguished, the emotional characteristics
Proceedings of the 17th Sound and Music Computing Conference, Torino, June 24th – 26th 2020
454
(a) Action duration per emotion (recorded dataset). (b) Action duration per emotion (synthesised sounds).
(c) Number of knocks in each action (recorded dataset). (d) Number of knocks in each action (synthesised sounds).
(e) Knocking rate per action per emotion (recorded dataset). (f) Knocking rate per action per emotion (synthesised sounds).
(g) Knocking regularity per action per emotion (recorded dataset). (h) Knocking regularity per action per emotion (synthesised sounds).
(i) Slope of the actions RMS energy (recorded dataset). (j) Slope of the actions RMS energy (synthesised sounds).
Figure 4: Recorded dataset and synthesised sounds analysis results. The left column (Figures 4a, 4c, 4e, 4g and 4i) shows
the results for the recorded dataset. The right column (Figures 4b, 4d, 4f, 4h and 4j) shows the results for the synthesised
sounds. In Figures 4g and 4h higher values indicate less regularity.
Proceedings of the 17th Sound and Music Computing Conference, Torino, June 24th – 26th 2020
455
Avg dσ σ2Avg F-measure σ σ2
0.63 0.31 0.09 0.67 0.24 0.06
Table 1: RS listening test dand F-measure values across all participants.
Emotion Correctly labeled as recorded Emotion Correctly labeled as synthesised
[Recorded] Anger 73.3% [Synthesised] Anger 60.9%
[Recorded] Fear 72.4% [Synthesised] Fear 60.0%
[Recorded] Happiness 80.9% [Synthesised] Happiness 63.4%
[Recorded] Neutral 68.6% [Synthesised] Neutral 65.7%
[Recorded] Sadness 60.9% [Synthesised] Sadness 65.7%
Table 2: Real and synthesised labeling by the participants (raw results).
[Perceived] Anger [Perceived] Fear [Perceived] Happiness [Perceived] Neutral [Perceived] Sadness
[Intended] Anger 68.0% 12.0% 10.0% 10.0% 0.0%
[Intended] Fear 60.0% 28.0% 5.0% 5.0% 2.0%
[Intended] Happiness 4.0% 2.0% 64.0% 27.0% 3.0%
[Intended] Neutral 5.0% 6.0% 15.0% 63.0% 11.0%
[Intended] Sadness 2.0% 10.0% 2.0% 29.0% 57.0%
Table 3: Recorded dataset intended and perceived emotion labeling by the participants.
[Perceived] Anger [Perceived] Fear [Perceived] Happiness [Perceived] Neutral [Perceived] Sadness
[Intended] Anger 47.0% 19.0% 8.0% 18.0% 8.0%
[Intended] Fear 40.0% 44.0% 9.0% 7.0% 0.0%
[Intended] Happiness 1.0% 12.0% 44.0% 37.0% 6.0%
[Intended] Neutral 6.0% 6.0% 17.0% 62.0% 9.0%
[Intended] Sadness 0.0% 16.0% 3.0% 34.0% 47.0%
Table 4: Synthesised sounds intended and perceived emotion labeling by the participants.
Figure 5: dand F-measure values across all participants
and emotions (box plot) and individual scores for each par-
ticipant (swarm plot). The level of expertise in sound de-
sign of each participant is represented by the dot color.
of the recorded dataset are present in the synthesised sounds.
Since the acoustic features of the synthesised dataset are
very similar to those of the recorded dataset, we hypothe-
sise that they strongly contribute to the perception of emo-
tions, and that they are sufficiently well modelled in the
synthesised dataset to allow listeners to perceive the in-
tended emotions as well as they are in the recorded dataset.
Additionally, the acoustic features distribution is not ex-
actly the same in the two datasets. This is actually desir-
able considering that the aim is to generate new data and
not to re-synthesise data that was already in the training
dataset. However, diversity of new generated data is chal-
lenging when using generative methods, especially for au-
dio [23]. In the case of the synthesised knocking sound ef-
fects, upon subjective qualitative evaluation, some can be
traced back to the original recorded sound on which they
were trained from. In future work we plan to tackle this
problem.
Regarding the control over the synthesised samples, the
model is capable of synthesise knocking sound effects with
emotional conditioning. We plan to work on further condi-
tion the generation of the sounds with other features (such
as the features used for the analysis).
In terms of applications, this approach could be used in
video games and post-production. While the sound synthe-
sis using GANs is not done in real-time (GANs synthesise
one or several sounds at a time, not sample by sample),
the sound generation can be fast enough to be used in a
real-time scenario without perceived latency [15].
Proceedings of the 17th Sound and Music Computing Conference, Torino, June 24th – 26th 2020
456
Acknowledgments
This work was supported by the NordicSMC Network (Project
Number: 86892), the Department of Media Technology
and Interaction Design of KTH and the EPSRC Centre
for Doctoral Training in Intelligent Games & Game Intel-
ligence (IGGI) [EP/L015846/1].
6. REFERENCES
[1] A. Farnell, “An Introduction to Procedural Audio and
Its Application in Computer Games,” in Audio Mostly
Conference, 2007, pp. 1–31.
[2] P. R. Cook, Real Sound Synthesis for Interactive Appli-
cations. Natick, Massachusetts: A K Peters, 2002.
[3] J. Engel, C. Resnick, A. Roberts, S. Dieleman,
M. Norouzi, D. Eck, and K. Simonyan, “Neural Au-
dio Synthesis of Musical Notes With Wavenet Autoen-
coders,” in Proceedings of the 34th International Con-
ference on Machine Learning-Volume 70. JMLR. org,
2017, pp. 1068–1077.
[4] F. Chollet, Deep Learning with Python, 1st ed. USA:
Manning Publications Co., 2017, p. 270.
[5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio,
“Generative Adversarial Nets,” in Advances in neural
information processing systems, 2014, pp. 2672–2680.
[6] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein
GAN,” arXiv preprint arXiv:1701.07875, 2017.
[7] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and
A. C. Courville, “Improved Training of Wasserstein
GANs,” in Advances in neural information processing
systems, 2017, pp. 5767–5777.
[8] M. Mirza and S. Osindero, “Conditional Generative
Adversarial Nets, arXiv preprint arXiv:1411.1784,
2014.
[9] T. Karras, S. Laine, M. Aittala, J. Hellsten,
J. Lehtinen, and T. Aila, “Analyzing and Improv-
ing the Image Quality of Stylegan, arXiv preprint
arXiv:1912.04958, 2019.
[10] R. Spick, S. Demediuk, and J. Alfred Walker, “Naive
Mesh-to-Mesh Coloured Model Generation using 3D
GANs,” in Proceedings of the Australasian Computer
Science Week Multiconference, 2020, pp. 1–6.
[11] C. Donahue, J. McAuley, and M. Puckette, “Adversar-
ial Audio Synthesis,” in ICLR, 2019.
[12] J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Don-
ahue, and A. Roberts, “Gansynth: Adversarial Neural
Audio Synthesis,” arXiv preprint arXiv:1902.08710,
2019.
[13] A. Radford, L. Metz, and S. Chintala, “Unsupervised
Representation Learning With Deep Convolutional
Generative Adversarial Networks, arXiv preprint
arXiv:1511.06434, 2015.
[14] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progres-
sive Growing of Gans for Improved Quality, Stabil-
ity, and Variation,arXiv preprint arXiv:1710.10196,
2017.
[15] M. Chang, Y. R. Kim, and G. J. Kim, “A Perceptual
Evaluation of Generative Adversarial Network Real-
Time Synthesized Drum Sounds in a Virtual Environ-
ment,” in 2018 IEEE International Conference on Ar-
tificial Intelligence and Virtual Reality (AIVR). IEEE,
2018, pp. 144–148.
[16] C. Y. Lee, A. Toffy, G. J. Jung, and W.-J. Han, “Condi-
tional WaveGAN,” arXiv preprint arXiv:1809.10636,
2018.
[17] R. Vitale and R. Bresin, “Emotional Cues in Knocking
Sounds,” in 10th International Conference on Music
Perception and Cognition, Sapporo, Japan, August 25-
29, 2008, 2008, p. 276.
[18] P. Ekman, “Basic emotions,” Handbook of cognition
and emotion, vol. 98, no. 45-60, p. 16, 1999.
[19] A. Barahona-R´
ıos and S. Pauletto, “Knocking Sound
Effects With Emotional Intentions, Feb. 2020.
[Online]. Available: https://doi.org/10.5281/zenodo.
3668503
[20] P. N. Juslin and P. Laukka, “Communication of emo-
tions in vocal expression and music performance: Dif-
ferent channels, same code?” Psychological bulletin,
vol. 129, no. 5, p. 770, 2003.
[21] M. Houel, A. Arun, A. Berg, A. Iop, A. Barahona-R´
ıos,
and S. Pauletto, “Perception of Emotions in Knocking
Sounds: An Evaluation Study,” in 17th Sound and Mu-
sic Computing Conference, Online, 2020.
[22] L. Gabrielli, S. Squartini, and V. V¨
alim¨
aki, “A Sub-
jective Validation Method for Musical Instrument Em-
ulation,” in AES 131st Convention, New York, USA,
2011.
[23] J. M. Antognini, M. Hoffman, and R. J. Weiss, “Audio
Texture Synthesis With Random Neural Networks: Im-
proving Diversity and Quality,” in ICASSP 2019-2019
IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). IEEE, 2019, pp.
3587–3591.
... The aim of this study was to use its results, in conjunction with findings from literature on colour and emotions, to select 5 different coloured doors associated with anger, fear, happiness, sadness and neutral. In the subsequent audiovisual experiment, we combined 5 knocking action sounds, which in our previous study [6] were rated to be strongly associated with the same 5 basic emotions, with the 5 coloured doors. In this experiment, we aimed to investigate whether the combination of the appearance of the door with an emotional knocking sound could affect the overall emotional perception 1 . ...
... In regard to knocking sounds, our recent research has shown that basic emotional intentions such as anger, fear, happiness, sadness, and neutral state can be recognised from listening to knocking sounds alone [11]. Additionally, when utilising a large dataset of knocking action sounds produced by a professional Foley artist the degree of emotion recognition increases, showing only confusion between the labelling of anger and fear [6]. We also showed that emotion-specific acoustic patterns in knocking sounds confirm findings from previous research in speech and music performance [2,12] In this study, we selected 5 knocking sound actions from our data-set 2 of professionally performed knocking actions that were most strongly associated with anger, fear, happiness, sadness, and neutral state. ...
... There is no statistical significant difference between results for anger and fear. This confirms results from our previous study [6] from which the knocking actions where selected. Happiness, sadness and neutral are recognised correctly with statistical significance (see Figure 3). ...
Conference Paper
Full-text available
Knocking sounds are highly meaningful everyday sounds. There exist many ways of knocking, expressing important information about the state of the person knocking and their relationship with the other side of the door. In media production, knocking sounds are important storytelling devices: they allow transitions to new scenes and create expectations in the audience. Despite this important role, knocking sounds have rarely been the focus of research. In this study, we create a data set of knocking actions performed with different emotional intentions. We then verify, through a listening test, whether these emotional intentions are perceived through listening to sound alone. Finally, we perform an acoustic analysis of the experimental data set to identify whether emotion-specific acoustic patterns emerge. The results show that emotional intentions are correctly perceived for some emotions. Additionally, the emerging emotion-specific acoustic patterns confirm, at least in part, findings from previous research in speech and music performance.
... But, the application of recent developments to neural synthesis of sound effects is yet to be explored. We could only find one other work [19], related to the present study, where the authors focused on synthesis of knocking sounds with emotional content using a conditional WaveGAN. ...
... As shown in Fig. 1, the generator expands the latent variable z to the final audio output size. After reshaping and concatenation with the conditioning label [19,20], the output is synthesised by passing the input through 5 upsampling 1-D convolutional layers. Upsampling is obtained by: zero-stuffing or nearest neighbour, linear or cubic interpolation. ...
Preprint
Full-text available
Footsteps are among the most ubiquitous sound effects in multimedia applications. There is substantial research into understanding the acoustic features and developing synthesis models for footstep sound effects. In this paper, we present a first attempt at adopting neural synthesis for this task. We implemented two GAN-based architectures and compared the results with real recordings as well as six traditional sound synthesis methods. Our architectures reached realism scores as high as recorded samples, showing encouraging results for the task at hand.
... However, the latter two do not run in real time. Recently, various machine-learning methods have also been applied to synthesize impact sounds [15,16]. A deep-learning architecture was proposed for efficient real-time modal synthesis of impact sounds [15]. ...
... A deep-learning architecture was proposed for efficient real-time modal synthesis of impact sounds [15]. Another approach uses conditional WaveGAN for synthesizing knocking sounds [16]. Lloyd et al. suggested the use of a multi-notch filter with random adjustment to produce plausible variations [12]. ...
Conference Paper
Full-text available
A filtering algorithm for generating subtle random variations in sampled sounds is proposed. Using only one recording for impact sound effects or drum machine sounds results in unrealistic repetitiveness during consecutive playback. This paper studies spectral variations in repeated knocking sounds and in three drum sounds: a hihat, a snare, and a tomtom. The proposed method uses a short pseudo-random velvet-noise filter and a low-shelf filter to produce timbral variations targeted at appropriate spectral regions, yielding potentially an endless number of new realistic versions of a single percussive sampled sound. The realism of the resulting processed sounds is studied in a listening test. The results show that the sound quality obtained with the proposed algorithm is at least as good as that of a previous method while using 77% fewer computational operations. The algorithm is widely applicable to computer-generated music and game audio.
... These techniques usually work either directly on the time domain (waveforms) or in the frequency domain (spectrograms). There are multiple architectures and applications that have been explored, such as autoregressive models for waveform synthesis [16], GANs for unconditional [17] and conditional [18] waveform synthesis or for frequency-domain conditional synthesis [19] [20] or diffusion models for conditional and unconditional waveform synthesis [21]. Other architectures, such as Differentiable DSP (DDSP) [22], incorporate DSP methods like spectral modeling synthesis [23] into the deep learning domain. ...
... This feature is applied only to actions with two or more knocks. These features were utilised by the authors in a parallel study on the synthesis of knocking actions [25] in order to allow future comparison of the results. ...
... This feature is applied only to actions with two or more knocks. These features were utilised by the authors in a parallel study on the synthesis of knocking actions [25] in order to allow future comparison of the results. ...
Conference Paper
Full-text available
Knocking sounds are highly meaningful everyday sounds. There exist many ways of knocking, expressing important information about the state of the person knocking and their relationship with the other side of the door. In media production, knocking sounds are important storytelling devices: they allow transitions to new scenes and create expectations in the audience. Despite this important role, knocking sounds have rarely been the focus of research. In this study, we create a data set of knocking actions performed with different emotional intentions. We then verify , through a listening test, whether these emotional intentions are perceived through listening to sound alone. Finally , we perform an acoustic analysis of the experimental data set to identify whether emotion-specific acoustic patterns emerge. The results show that emotional intentions are correctly perceived for some emotions. Additionally, the emerging emotion-specific acoustic patterns confirm, at least in part, findings from previous research in speech and music performance.
Conference Paper
Full-text available
Knocking sounds are highly meaningful everyday sounds. There exist many ways of knocking, expressing important information about the state of the person knocking and their relationship with the other side of the door. In media production, knocking sounds are important storytelling devices: they allow transitions to new scenes and create expectations in the audience. Despite this important role, knocking sounds have rarely been the focus of research. In this study, we create a data set of knocking actions performed with different emotional intentions. We then verify, through a listening test, whether these emotional intentions are perceived through listening to sound alone. Finally, we perform an acoustic analysis of the experimental data set to identify whether emotion-specific acoustic patterns emerge. The results show that emotional intentions are correctly perceived for some emotions. Additionally, the emerging emotion-specific acoustic patterns confirm, at least in part, findings from previous research in speech and music performance.
Article
Full-text available
Many authors have speculated about a close relationship between vocal expression of emotions and musical expression of emotions. but evidence bearing on this relationship has unfortunately been lacking. This review of 104 studies of vocal expression and 41 studies of music performance reveals similarities between the 2 channels concerning (a) the accuracy with which discrete emotions were communicated to listeners and (b) the emotion-specific patterns of acoustic cues used to communicate each emotion. The patterns are generally consistent with K. R. Scherer's (1986) theoretical predictions. The results can explain why music is perceived as expressive of emotion, and they are consistent with an evolutionary perspective on vocal expression of emotions. Discussion focuses on theoretical accounts and directions for future research.
Article
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
Conference Paper
In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, we use the learned features for novel tasks - demonstrating their applicability as general image representations.
Article
Generative Adversarial Networks (GANs) are powerful generative models, but suffer from training instability. The recently proposed Wasserstein GAN (WGAN) makes significant progress toward stable training of GANs, but can still generate low-quality samples or fail to converge in some settings. We find that these training failures are often due to the use of weight clipping in WGAN to enforce a Lipschitz constraint on the critic, which can lead to pathological behavior. We propose an alternative method for enforcing the Lipschitz constraint: instead of clipping weights, penalize the norm of the gradient of the critic with respect to its input. Our proposed method converges faster and generates higher-quality samples than WGAN with weight clipping. Finally, our method enables very stable GAN training: for the first time, we can train a wide variety of GAN architectures with almost no hyperparameter tuning, including 101-layer ResNets and language models over discrete data.
Conference Paper
This paper deals with the problem of assessing the distinguishability between the sound generated by an acoustical or electric instrument and an algorithm designed to emulate its behavior. To accomplish this, several previous works employed subjective listening tests. These are briefly reviewed in the paper. Different metrics to evaluate test results are discussed as well. Results are reported for listening tests performed on the sound of the Clavinet and a computational model aimed at its emulation. After discussing these results a guideline for subjective listening tests in the field of sound synthesis is proposed to the research community for further discussion and improvement.