Figure - uploaded by Jon Fagerström
Content may be subject to copyright.
Source publication
A filtering algorithm for generating subtle random variations in sampled sounds is proposed. Using only one recording for impact sound effects or drum machine sounds results in unrealistic repetitiveness during consecutive playback. This paper studies spectral variations in repeated knocking sounds and in three drum sounds: a hihat, a snare, and a...
Similar publications
Enhancing the sound quality of historical music recordings is a long-standing problem. This paper presents a novel denoising method based on a fully-convolutional deep neural network. A two-stage U-Net model architecture is designed to model and suppress the degradations with high fidelity. The method processes the time-frequency representation of...
Audio bandwidth extension aims to expand the spectrum of bandlimited audio signals. Although this topic has been broadly studied during recent years, the particular problem of extending the bandwidth of historical music recordings remains an open challenge. This paper proposes a method for the bandwidth extension of historical music using generativ...
This study investigates the vinyl revival, with particular focus given to the listener's perception of audio quality. A new album was produced using known source material. Subjects then participated in a series of double-blind listening tests, comparing vinyl to established digital formats. Subsequent usability tests required subjects not only to r...
Citations
... Furthmore, parametric timbre control could be used in sound design applications or to create stimuli for perceptual studies where independent control over individual features is beneficial and typically relies on additive synthesis [30]. Creation of meaningful variations in sounds is an another active area of research for drum oneshots [15] and sound effects for video games [51]. Timbre variations could be learned in a data-driven manner from sample libraries, for instance, by creating differentiable implementations of procedural synthesis methods [40]. ...
... Referring to the effect of re-triggering a recorded sample repeatedly, sometimes called a "machine-gun effect"[15] ...
Timbre is a primary mode of expression in diverse musical contexts. However, prevalent audio-driven synthesis methods predominantly rely on pitch and loudness envelopes, effectively flattening timbral expression from the input. Our approach draws on the concept of timbre analogies and investigates how timbral expression from an input signal can be mapped onto controls for a synthesizer. Leveraging differentiable digital signal processing, our method facilitates direct optimization of synthesizer parameters through a novel feature difference loss. This loss function, designed to learn relative timbral differences between musical events, prioritizes the subtleties of graded timbre modulations within phrases, allowing for meaningful translations in a timbre space. Using snare drum performances as a case study, where timbral expression is central, we demonstrate real-time timbre remapping from acoustic snare drums to a differentiable synthesizer modeled after the Roland TR-808.
... Another application of velvet noise is the decorrelation of audio signals [9][10][11][12], a process that allows reducing the correlation of signals [13,14] and, for example, distributing processed copies of a mono signal to multiple loudspeakers to produce a diffuse sound field [15,16]. Velvet noise has also been used in sound [17][18][19][20] and speech synthesis [21,22]. ...
Velvet noise, a sparse pseudo-random signal, finds valuable applications in audio engineering , such as artificial reverberation, decorrelation filtering, and sound synthesis. These applications rely on convolution operations whose computational requirements depend on the length, sparsity, and bit resolution of the velvet-noise sequence used as filter coefficients. Given the inherent sparsity of velvet noise and its occasional restriction to a few distinct values, significant computational savings can be achieved by designing convolution algorithms that exploit these unique properties. This paper shows that an algorithm called the transposed double-vector filter is the most efficient way of convolving velvet noise with an audio signal. This method optimizes access patterns to take advantage of the processor's fast caches. The sequential sparse algorithm is shown to be always faster than the dense one, and the speedup is linearly dependent on sparsity. The paper also explores the potential for further speedup on multicore platforms through parallelism and evaluates the impact of data encoding, including 16-bit and 32-bit integers and 32-bit floating-point representations. The results show that using the fastest implementation of a long velvet-noise filter, it is possible to process more than 40 channels of audio in real time using the quad-core processor of a modern system-on-chip.
... Apart from SpecSinGAN, there are other approaches to synthesise audio training on reduced datasets, such as [38] that generates longer sequences from just 20 seconds of training data. Concerning non-deep learning and pure digital signal processing (DSP) approaches, in [39,40] they generate variations of percussive pre-recorded sounds. We use SpecSinGAN because 1) it allows us to train on any single one-shot sound effect that we can choose accordingly depending of the specifications of the project; 2) we do not need to rely on finding a large dataset of sounds that may be scarce; 3) SpecSinGAN generates short variations we can continuously trigger to produce an adap-tive soundscape. ...
In this paper we present a system for the sonification of the electricity drawn by different household appliances. The system uses SpecSinGAN as the basis for the sound design , which is an unconditional generative architecture that takes a single one-shot sound effect (e.g., a fire crackle) and produces novel variations of it. SpecSinGAN is based on single-image generative adversarial networks that learn from the internal distribution of a single training example (in this case the spectrogram of the sound file) to generate novel variations of it, removing the need of a large dataset. In our system, we use a python script in a Raspberry PI to receive the data of the electricity drawn by an appliance via a Smart Plug. The data is then sent to a Pure Data patch via Open Sound Control. The electricity drawn is mapped to the sound of fire, which is generated in real-time using Pure Data by mixing different variations of four fire sounds-a fire crackle, a low end fire rumble, a mid level rumble, and hiss-which were synthesised offline by SpecSinGAN. The result is a dynamic fire sound that is never the same, and that grows in intensity depending on the electricity consumption. The density of the crackles and the level of the rumbles increase with the electricity consumption. We pilot tested the system in two households, and with different appliances. Results confirm that, from a technical standpoint , the sonification system responds as intended, and that it provides an intuitive auditory display of the energy consumed by different appliances. In particular, this soni-fication is useful in drawing attention to "invisible" energy consumption. Finally, we discuss these results and future work.
... Using the same audio sample repetitively, e.g. in a game setting, often deteriorates the quality of the perceptual experience. Therefore it is desirable to have many variations of a particular sound type for developing scenes and actions [4,5]. This happens to be an area of strength for generative models as they are capable of generating highquality and diverse sounds that are conditioned on different classes [6,7]. ...
Over recent years generative models utilizing deep neural networks have demonstrated outstanding capacity in synthesizing high-quality and plausible human speech and music. The majority of research in neural audio synthesis (NAS) targets speech or music, whereas general sound effects such as environmental sounds or Foley sounds have received less attention. In this work, we study the generative performance of NAS models for sound effects with a conditional Wasserstein GAN (WGAN) model. We train our models conditioned on different classes of sound effects and report on their performances in terms of quality and diversity. Many existing GAN models use magnitude spectrograms which require audio reconstruction using phase estimation after training. The often imperfect reconstruction of the audio signal has led us to propose an additional audio reconstruction loss term for the generator. We show that this additional loss term improves the quality of the audio generation considerably with small sacrifice to the diversity. The results indicate that a conditional WGAN model trained on log-magnitude spectrograms paired with an appropriately weighted reconstruction loss is capable of synthesizing highly plausible sound effects.
... Recently, a real-time implementation of the endless sustain technique was presented [16]. Short velvet-noise sequences have been used for efficient audio decorrelation [17,18] as well as for humanizing sampling synthesis of percussive sounds [19]. ...
... Interestingly, all ternary sequences whose signs occur with equal probability converge to have a flat power spectral density (PSD) for an infinitely long sequence [20], no matter the generation process. Note that, for short sequences, deviations from the white spectrum occur [19], as is typical for all random noises. ...
This paper proposes dark velvet noise (DVN) as an extension of the original velvet noise with a lowpass spectrum. The lowpass spectrum is achieved by allowing each pulse in the sparse sequence to have a randomized pulse width. The cutoff frequency is controlled by the density of the sequence. The modulated pulse-width can be implemented efficiently utilizing a discrete set of recursive running-sum filters, one for each unique pulse width. DVN may be used in reverberation algorithms. Typical room reverberation has a frequency-dependent decay, where the high frequencies decay faster than the low ones. A similar effect is achieved by lowering the density and increasing the pulse-width of DVN in time, thereby making the DVN suitable for artificial reverberation.
... Thus, convolution of an audio signal with velvet noise is extremely cheap in terms of computational cost [3,4,5]. This property, combined with the smooth perceptual quality of velvet noise [2] makes it a suitable tool in audio processing and synthesis [5,6]. This paper focuses on the use of velvet noise in multichannel artificial reverberation and its properties regarding decorrelation. ...
The cross-correlation of multichannel reverberation generated using interleaved velvet noise is studied. The interleaved velvet-noise reverberator was proposed recently for synthesizing the late reverb of an acoustic space. In addition to providing a computationally efficient structure and a perceptually smooth response, the interleaving method allows combining its independent branch outputs in different permutations, which are all equally smooth and flutter-free. For instance, a four-branch output can be combined in 4! or 24 ways. Additionally, each branch output set is mixed orthogonally, which increases the number of permutations from M! to M^2!, since sign inversions are taken along. Using specific matrices for this operation, which change the sign of velvet-noise sequences, decreases the correlation of some of the combinations. This paper shows that many selections of permutations offer a set of well decorrelated output channels, which produce a diffuse and colorless sound field, which is validated with spatial variation. The results of this work can be applied in the design of computationally efficient multichannel reverberators.
... In the context of game audio, this is often called generative or procedural audio [2]. Procedural audio usually refers to the use of real-time digital signal processing (DSP) systems such as sound synthesisers, while generative (audio) can be defined as "algorithms to produce an output that is not explicitly defined" [3, p.1]. Apart from sound synthesisers, other DSP methods involve the manipulation of audio files in order to obtain a desired effect, transforming the source asset [4,5]. Generative and procedural audio allow the dynamic creation of sound assets on demand. ...
... Guidelines to choose a suitable sound synthesis method to synthesise a target sound have been proposed also [2] [14] [15]. There have also been DSP systems catered to generate variations of target pre-recorded impact sounds, such as in [4] and, after we conducted our listening study, in [5]. ...
... We would also observe that, despite showing SpecSin-GAN is a viable alternative to synthesise arbitrary one-shot sound effects, DSP-based systems are also capable of producing continuous streams of audio, as well as running in real-time with direct input from either human-interpretable controls or in-game parameters, granting them great adaptability. We also acknowledge that, while we focused on arbitrary sound effects, further listening studies need to be carried out to understand how SpecSinGAN compares to DSP methods such as [5] for generating variations of target percussive sounds and to [28], adapting it to work with shorter sounds at 44.1 kHz. We suggest, however, that SpecSinGAN can be useful in a contexts where 1) sound designers need to produce novel variations of a specific pre-recorded sound, or 2) data is scarce, in which case SpecSinGAN acts as a data augmentation tool. ...
... In the context of game audio, this is often called generative or procedural audio [2]. Procedural audio usually refers to the use of realtime digital signal processing (DSP) systems such as sound synthesisers, while generative (audio) can be defined as "algorithms to produce an output that is not explicitly defined" [3, p.1]. Apart from sound synthesisers, other DSP methods involve the manipulation of audio files in order to obtain a desired effect, transforming the source asset [4,5]. Generative and procedural audio allow the dynamic creation of sound assets on demand. ...
... Guidelines to choose a suitable sound synthesis method to synthesise a target sound have been proposed also [2][14] [15]. There have also been DSP systems catered to generate variations of target pre-recorded impact sounds, such as in [4] and, after we conducted our listening study, in [5]. ...
... We would also observe that, despite showing SpecSinGAN is a viable alternative to synthesise arbitrary one-shot sound effects, DSP-based systems are also capable of producing continuous streams of audio, as well as running in real-time with direct input from either human-interpretable controls or ingame parameters, granting them great adaptability. We also acknowledge that, while we focused on arbitrary sound effects, further listening studies need to be carried out to understand how SpecSinGAN compares to DSP methods such as [5] for generating variations of target percussive sounds. We suggest, however, that SpecSinGAN can be useful in a contexts where 1) sound designers need to produce novel variations of a specific pre-recorded sound, or 2) data is scarce, in which case SpecSinGAN acts as a data augmentation tool. ...
Single-image generative adversarial networks learn from the internal distribution of a single training example to generate variations of it, removing the need of a large dataset. In this paper we introduce SpecSinGAN, an unconditional generative architecture that takes a single one-shot sound effect (e.g., a footstep; a character jump) and produces novel variations of it, as if they were different takes from the same recording session. We explore the use of multi-channel spectrograms to train the model on the various layers that comprise a single sound effect. A listening study comparing our model to real recordings and to digital signal processing procedural audio models in terms of sound plausibility and variation revealed that SpecSinGAN is more plausible and varied than the procedural audio models considered, when using multi-channel spectrograms. Sound examples can be found at the project website: https://www.adrianbarahonarios.com/specsingan/