ChapterPDF Available

Musical Sound Modeling with Sinusoids plus Noise

Authors:
A preview of the PDF is not available
... A 700-ms-long flute tone (along with other wind instrument tones) was generated by means of the spectral analysis method (Strong & Clark, 1967) using the weighted sum of 30 sinusoids. Later on, another additive synthesis method called Spectral Model Synthesis (SMS) based on overlap-add method was developed, which modelled the spectrum as the sum of deterministic and stochastic components (Serra & Smith, 1990;Serra et al., 1997). Time domain envelope was not explicitly modelled. ...
... The wind noise is another important component present in the flute sound. In a flute, while the harmonic part of the signal is generated as a result of the sustained oscillations produced inside the bore, the wind noise is produced by the turbulent streaming of the air when it passes through a narrow opening (Serra et al., 1997). In addition to the harmonic components, noise-like energy can also be seen in the spectrogram shown in Figure 1. ...
... We model the pitch contour, harmonic amplitudes, time-domain envelope and wind noise for each of the notes for this task. Synthesis is based on the modified sinusoids plus noise model (Serra et al., 1997). ...
Article
Full-text available
Gamakas, the essential embellishments, are integral parts of Karnatic music. Synthesising any form of Karnatic music necessitates proper modelling and synthesis of different gamakas associated with each note. We propose a spectral model to efficiently synthesize gamakas for Karnatic bamboo flute music from the notes, duration and gamaka information. We model three different components of the flute sound, namely, pitch contour, harmonic weights and time domain amplitude envelope. Cubic splines are used to parametrically represent these components. Subjective analysis of the results shows that the proposed method is better than the existing spectral methods in terms of tonal and aesthetic qualities of gamaka rendition. Hypothesis test results show that the observed improvements over other methods are statistically significant at 95% confidence interval.
... In this paper a method is presented for proposing sound modifications based on their measurements (or simulations) by showing how the tonal components of the sound can be changed to achieve a specific goal, defined in this paper as a target tonality value. The proposed method is based on the Harmonic-Plus-Noise method [5], which decomposes the sound into a tonal and a noise part. This approach is extended by using the High-resolution Spectral Analysis (HSA) method [6,7] to identify and remove tonal components of the original signal by means of a deconvolution. ...
... The method described in this paper is based on the Harmonic-Plus-Noise model presented in [5]. The method decomposes the signal into a harmonic and a stochastic part as follows: ...
... This smoothed spectrum is then resynthesized with a random phase, which is equivalent to filtering an ideal white noise using the smoothed spectrum as the impulse response h(t). In the method presented in [5], a quadratic interpolation is performed to obtain the exact components and phase. This step is extended here by using the HSA method instead, which has considerably better frequency resolution than quadratic interpolation [7] and can also estimate the phase of the tonal component. ...
Conference Paper
div class="section abstract"> E-vehicles can generate strong tonal components that may disturb people inside the vehicle. However, such components, deliberately generated, may be necessary to meet audibility standards that ensure the safety of pedestrians outside the vehicle. A tradeoff must be made between pedestrian audibility and internal sound quality, but any iteration that requires additional measurements is costly. One solution to this problem is to modify the recorded signals to find the variant with the best sound quality that complies with regulations. This is only possible if there is a good separation of the tonal components of the signal. In this work, a method is proposed that uses the High-resolution Spectral Analysis (HSA) to extract the tonal components of the signal, which can then be recombined to optimize any sound quality metric, such as the tonality using the Sottek Hearing Model (standardized in ECMA 418-2). </div
... While real-world sounds could also be inverted to find latent representations in a trained GAN, they are much more difficult to control than parametric acoustic sound synthesizers [59], [60], [61], [62] or physics-based models [63]. For our inharmonic textures, we use a physically informed synthesis technique found in William Gaver's seminal work on auditory perception [64], [65]. ...
Preprint
Full-text available
Generative models for synthesizing audio textures explicitly encode controllability by conditioning the model with labelled data. While datasets for audio textures can be easily recorded in-the-wild, semantically labeling them is expensive, time-consuming, and prone to errors due to human annotator subjectivity. Thus, to control generation, there is a need to automatically infer user-defined perceptual factors of variation in the latent space of a generative model while modelling unlabeled textures. In this paper, we propose an example-based framework to determine vectors to guide texture generation based on user-defined semantic attributes. By synthesizing a few synthetic examples to indicate the presence or absence of a semantic attribute, we can infer the guidance vectors in the latent space of a generative model to control that attribute during generation. Our results show that our method is capable of finding perceptually relevant and deterministic guidance vectors for controllable generation for both discrete as well as continuous textures. Furthermore, we demonstrate the application of this method to other tasks such as selective semantic attribute transfer.
... In the original DDSP paper, the parameters of a differentiable audio synthesis model were estimated by a neural network in an end-to-end manner [4]. More specifically, a differentiable version of an additive synthesis model called the harmonics-plus-noise model, a variant of the sinusoids-plus-noise model [23], was used. While this model can technically be viewed as a synthesizer, its controls are much more complicated than conventional synthesizers, as the amplitude of each harmonic and the full frequency response of the filter at each frame must be specified. ...
Article
Full-text available
While synthesizers have become commonplace in music production, many users find it difficult to control the parameters of a synthesizer to create a sound as they intended. In order to assist the user, the sound matching task aims to estimate synthesis parameters that produce a sound that is as close as possible to the query sound. Recently, neural networks have been employed for this task. These neural networks are trained on paired data of synthesis parameters and the corresponding output sound, optimizing a loss of synthesis parameters. However, query by the user usually consists of real-world sounds, different from the synthesizer output sounds used as training data. In a previous work, the authors presented a sound matching method where the synthesizer is implemented using differentiable DSP. The estimator network could then be trained by directly optimizing the spectral similarity between the original sound and the output sound. Furthermore, the network could be trained on real-world sounds whose ground-truth synthesis parameters are unavailable. This method was shown to improve the match quality in both objective and subjective measures. In this work, we experiment with different synthesizer configurations and extend this approach to a more practical synthesizer with effect modules and envelope generators. We propose a novel training strategy where the network is fully trained using both parameter loss and spectral loss. We show that models trained using this strategy is able to utilize the chorus effect effectively while models that switch completely to spectral loss underutilizes the chorus effect.
... Aware of the drawbacks mentioned above, we have designed a novel approach to harmonic estimation and removal in wind turbines inspired by the techniques which appear in speech and audio signal synthesis and coding [46][47][48][49]. It turns out that audio signals bear certain similarity to vibration signals, in the sense that tonal music sounds are made out of time-variant deterministic plus stochastic contribution. ...
Article
Full-text available
We present a novel approach to harmonic disturbance removal in single-channel wind turbine acceleration data by means of time-variant signal modeling. Harmonics are conceived as a set of quasi-stationary sinusoids whose instantaneous amplitude and phase vary slowly and continuously in a short-time analysis frame. These non-stationarities in the harmonics are modeled by low-degree time polynomials whose coefficients capture the instantaneous dynamics of the corresponding waveforms. The model is linear-in-parameters and is straightforwardly estimated by the linear least-squares algorithm. Estimates from contiguous analysis frames are further combined in the overlap-add fashion in order to yield overall harmonic disturbance waveform and its removal from the data. The algorithm performance analysis, in terms of input parameter sensitivity and comparison against three state-of-the-art methods, has been carried out with synthetic signals. Further model validation has been accomplished through real-world signals and stabilization diagrams, which are a standard tool for determining modal parameters in many time-domain modal identification algorithms. The results show that the proposed method exhibits a robust performance particularly when only the average rotational speed is known, as is often the case for stand-alone sensors which typically carry out data pre-processing for structural health monitoring. Moreover, for real-world analysis scenarios, we show that the proposed method delivers consistent vibration mode parameter estimates, which can straightforwardly be used for structural health monitoring.
... A more audacious and challenging goal would consist of realizing sound restoration by resynthesis. Putting aside issues associated with the artistic fidelity of the results, from a technical point of view such strategy would be multidisciplinary, involving among others audio content analysis; computational auditory scene analysis within a structured-audio framework (Bregman, 1990;Dixon, 2004;Ellis, 1996;Rosenthal & Okuno, 1998;Scheirer, 1999;Vercoe, Gardner, & Scheirer, 1998); sound source modeling and synthesis of musical instrument, speech, and singing voice sounds (Cook, 2002;Miranda, 2002;Serra, 1997;Smith, 1991;Tolonen, 2000;Välimäki, Pakarinen, Erkut, & Karjalainen, 2006); psychoacoustics (Järveläinen, 2003;Zwicker & Fastl, 1999); and auditory modeling (Beerends, ...
Chapter
Music experienced through vibrotactile interfaces is a method of perceiving musical elements through the sense of touch, often involving vibrations. This technology functions by converting audio signals into physical sensations that can be sensed through the skin, typically via a wearable device like a wristband. Beginning with an initial audio file devoid of tactile feedback, the procedure entails altering it through sinusoidal modeling and, if necessary, implementing a Space-Fixed Audio transformation by utilizing the Head-Related Transfer Function (HRTF). In this study, we successfully transformed sound files into tactile stereo vibrations using specialized hardware. This process was rigorously tested and validated through experimentation involving ten individuals. Our findings confirm that psychophysical sensations can indeed be perceptible. Notably, the most consistent responses were observed when applying the Vibrato and Tremolo effect, following an HRTF transformation. The Space-Fixed Audio transformation primarily introduced variations in azimuth, covering 360 in a clockwise direction. Consequently, this processing led to significant spectral changes, effectively rescaling and compressing the audio’s frequencies into lower equivalents. These modified spectral characteristics were subsequently transmitted through vibrotactile actuators, thereby transforming the essence of sound into a tactile experience. This innovative system creates a sensory replacement approach based on the psychophysical sensations perceived on the skin. It has proven to be exceptionally beneficial, particularly for individuals with hearing impairments who may not perceive music in the same way as individuals with typical hearing abilities.
Article
Controllable generation in StyleGANs is usually achieved by training the model using labeled data. For audio textures, however, there is currently a lack of large semantically labeled datasets. Therefore, to control generation, we develop a method for semantic control over an unconditionally trained StyleGAN in the absence of such labeled datasets. In this paper, we propose an example-based framework to determine guidance vectors for audio texture generation based on user-defined semantic attributes. Our approach leverages the semantically disentangled latent space of an unconditionally trained StyleGAN. By using a few synthetic examples to indicate the presence or absence of a semantic attribute, we infer the guidance vectors in the latent space of the StyleGAN to control that attribute during generation. Our results show that our framework can find user-defined and perceptually relevant guidance vectors for controllable generation for audio textures. Furthermore, we demonstrate an application of our framework to other tasks, such as selective semantic attribute transfer.
Conference Paper
Full-text available
Interest in neural audio synthesis has been growing lately both in academia and industry. Deep Learning (DL) synthesisers enable musicians to generate fresh, often completely unconventional sounds. However, most of these applications present a drawback. It is difficult for musicians to generate sounds which reflect the timbral properties they have in mind, because of the nature of the latent spaces of such systems. These spaces generally have large dimensionality and cannot easily be mapped to semantically meaningful timbral properties. Navigation of such timbral spaces is therefore impractical. In this paper, we introduce a DL-powered instrument that generates guitar sounds from vocal commands. The system analyses vocal instructions to extract timbral descriptors which condition the sound generation.
Article
Full-text available
We have developed an alternate method of representing harmonic amplitude envelopes of musical instrument sounds using principal component analysis. Statistical analysis reveals considerable correlation between the harmonic amplitude values at different time positions in the envelopes. This correlation is exploited in order to reduce the dimensionality of envelope specification. It was found that two or three parameters provide a reasonable approximation to the different harmonic envelope curves present in musical instrument sounds. T he representation is suited for the development of high-level control mechanisms for manipulating the timbre of resynthesized harmonic sounds.
Article
Full-text available
Fundamental frequency (F-0) estimation for quasiharmonic signals is an important task in music signal processing. Many previously developed techniques have suffered from unsatisfactory performance due to ambiguous spectra, noise perturbations, wide frequency range, vibrate, and other common artifacts encountered in musical signals. In this paper a new two-way mismatch (TWM) procedure for estimating F-0 is described which may lead to improved results in this area. This computer-based method uses the quasiharmonic assumption to guide a search for F-0 based on the short-time spectra of an input signal. The estimated F-0 is chosen to minimize discrepancies between measured partial frequencies and harmonic frequencies generated by trial values of F-0. For each trial F-0, mismatches between the harmonics generated and the measured partial frequencies are averaged over a fixed subset of the available partials. A weighting scheme is used to reduce the susceptibility of the procedure to the presence of noise or absence of certain partials in the spectral data. Graphs of F-0 estimate versus time for several representative recorded solo musical instrument and voice passages are presented. Some special strategies for extending the TWM procedure for F-0 estimations of two simultaneous voices in duet recordings are also discussed.
Article
This paper discusses an additional signal processing procedure for use in a recently proposed frequencydomain additive synthesizer [1, 2]. The overlap-add approach used in the frequency-domain synthesis can result in amplitude modulation in the overlap region between successive time frames if the partial frequencies in those frames are significantly different. A new method is proposed in which sinusoids of linearly time-varying frequency (chirps) are used to accommodate the frame-to-frame frequency changes. 1. INTRODUCTION In analysis-based synthesis, the synthesis is driven by a set of analysis parameters that describe the time evolution of the input signal. For additive music synthesis, these parameters typically include the amplitude, frequency, and phase of each sinusoidal component or partial of the input. A time-domain synthesizer uses these parameter tracks as control inputs to a bank of oscillators whose respective outputs are accumulated to generate the final output; this met...
Article
A procedure is described for the automatic extraction of the various pitch percepts which may be simultaneously evoked by complex tonal stimuli. The procedure is based on the theory of virtual pitch, and in particular on the principle that the whole pitch percept is dependent both on analytic listening (yielding spectral pitch), and on holistic perception (yielding virtual pitch). The more or less ambiguous pitch percept governed by these two pitch modes is described by two pitch patterns: the spectral-pitch pattern, and the virtual-pitch pattern. Each of these patterns consists of a number of pitch (height) values, and associated weights, which account for the relative prominence of every individual pitch. The spectral-pitch pattern is constructed by spectral analysis, extraction of tonal components, evaluation of masking effects (masking and pitch shifts), and weighting according to the principle of spectral dominance. The virtual-pitch pattern is obtained from the spectral-pitch pattern by an advanced algorithm of subharmonic coincidence assessment. The procedure is based on a previous algorithm for calculating virtual pitch and can be regarded as a generalized and advanced version thereof.
Article
A computer algorithm has been developed for music applications that automatically estimates pitch from any small number of frequencies, not necessarily of equal amplitude. Low‐integer ratios are sought for each pair of frequencies. If an actual frequency‐pair ratio closely approximates a low‐integer ratio, the matching integers of this latter ’’ideal harmonic ratio’’ are accepted as possible harmonic numbers of the two actual frequencies in the tone. Hence, by comparing components to each other we can infer their likely harmonic numbers; if this is done successfully, then estimating the fundamental becomes a straightforward task. Therefore, the efficacy of the method lies in the cumulative effect of several component pair evaluations, all substantiating consistent harmonic assignments to the participating components. The method has proven to be useful in extracting pitch from spectral peaks taken from natural musical sounds, and it closely approximates some results of earlier experimental measurements of pitch perception. The method operates well with no fundamental present and for nonsuccessive partials (as well as inharmonic partials), and it is ’’phase insensitive.’’ Prior knowledge regarding the source of the signal (e.g., the musical instrument being played) is not required.
Article
Spectral modeling synthesis (SMS) is a spectrum modeling technique that models time‐varying spectra as (1) a collection of sinusoids controlled through time by piecewise linear amplitude and frequency envelopes (the ‘‘deterministic’’ part), and (2) a time‐varying filtered noise component (the ‘‘stochastic’’ part). This technique has proved to give general, high quality transformations for a wide variety of musical signals. As part of the analysis process of SMS the deterministic component is subtracted in the time domain from the original sound resulting into a residual signal. For a good deterministic‐stochastic decomposition this residual should be free of the partials of the analyzed sound and a minimization criteria of the residual can be establish to measure the quality of the decomposition. There are many control parameters in the analysis process and their impact in the synthesized sound can be measured numerically by studying their effect on the residual. This paper will discuss the effect that parameters like size and type of analysis window, hop size, or FFT size have on the residual and analytical ways to get the best analysis for a given sound.