Conference PaperPDF Available

STATISTICAL SINUSOIDAL MODELING FOR EXPRESSIVE SOUND SYNTHESIS

Authors:

Abstract

Statistical sinusoidal modeling represents a method for transferring a sample library of instrument sounds into a data base of sinusoidal parameters for the use in real time additive synthesis. Single sounds, capturing a musical instrument in combinations of pitch and intensity, are therefor segmented into attack, sustain and release. Partial amplitudes, frequencies and Bark band energies are calculated for all sounds and segments. For the sustain part, all partial and noise parameters are transformed to probabilistic distributions. Interpolated inverse transform sampling is introduced for generating parameter trajectories during synthesis in real time, allowing the creation of sounds located at pitches and intensities between the actual support points of the sample library. Evaluation is performed by qualitative analysis of the system response to sweeps of the control parameters pitch and intensity. Results for a set of violin samples demonstrate the ability of the approach to model dynamic timbre changes, which is crucial for the perceived quality of expressive sound synthesis.
Proceedings of the 22nd International Conference on Digital Audio Effects (DAFx-19), Birmingham, UK, September 2–6, 2019
STATISTICAL SINUSOIDAL MODELING FOR EXPRESSIVE SOUND SYNTHESIS
Henrik von Coler
Audio Communication Group
TU Berlin
Berlin, Germany
voncoler@tu-berlin.de
ABSTRACT
Statistical sinusoidal modeling represents a method for transfer-
ring a sample library of instrument sounds into a data base of sinu-
soidal parameters for the use in real time additive synthesis. Single
sounds, capturing a musical instrument in combinations of pitch
and intensity, are therefor segmented into attack, sustain and re-
lease. Partial amplitudes, frequencies and Bark band energies are
calculated for all sounds and segments. For the sustain part, all
partial and noise parameters are transformed to probabilistic dis-
tributions. Interpolated inverse transform sampling is introduced
for generating parameter trajectories during synthesis in real time,
allowing the creation of sounds located at pitches and intensities
between the actual support points of the sample library. Evalua-
tion is performed by qualitative analysis of the system response to
sweeps of the control parameters pitch and intensity. Results for
a set of violin samples demonstrate the ability of the approach to
model dynamic timbre changes, which is crucial for the perceived
quality of expressive sound synthesis.
1. INTRODUCTION
A system capable of expressive sound synthesis reacts to dynamic
control input with the desired or appropriate changes in sound. In
analysis-synthesis systems this means that the perceived timbral
qualities of the synthesized sound emulate the behavior of the an-
alyzed instrument as close as possible. Such systems thus need to
capture the individual sound of an instrument and allow manipu-
lations, based on a limited set of control parameters. In order to
achieve this, the synthesis approach presented in this paper, enti-
tled statistical sinusoidal modeling, combines a sample based ap-
proach with a novel method for sinusoidal modeling.
Sample based synthesis, in its basic form, is able to capture
individual sounds very accurately but does not offer manipulation
techniques necessary for an expressive synthesis [1]. Sinusoidal
modeling, on the other hand, is capable of wide-ranging means
for sound manipulation. A key problem of sinusoidal modeling
approaches, however, is the mapping of control parameters to the
large amount of synthesis parameters. Statistical sinusoidal mod-
eling can be regarded as a way of mapping the control parameters
pitch and intensity to the parameters of a sinusoidal model. This
reduced set of control parameters is often considered the central
input for similar sound synthesis systems.
Different approaches aim at improving sample based sound
synthesis. Among them are granular synthesis and corpus based
Copyright: c
2019 Henrik von Coler . This is an open-access article distributed
under the terms of the Creative Commons Attribution 3.0 Unported License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the
original author and source are credited.
concatenative synthesis [2]. Combined with spectral manipula-
tion techniques, the flexibility of these approaches is further in-
creased. Such combinations have proven to be effective for ex-
pressive sound synthesis. Examples include spectral concatena-
tive synthesis [3] and reconstructive phrase modeling [4].
An extended source-filter model has been proposed by Hahn et
al. [5, 6]. Partial and noise parameters are modeled in dependency
of the control parameters pitch and intensity, by means of tensor-
product B-splines. Two separate filters are used, one representing
the instrument-specific features by partial index and another one
capturing the frequency dependent partial characteristics. Wessel
et al. [7] present a system for removing the temporal axis from the
analysis data of sinusoidal models by the use of neural networks
and memory-based machine learning. These methods are used
to learn mappings of the three control parameters pitch, intensity
and brightness to partial parameters. A system combining corpus
based concatenative synthesis with audio mosaicing [8] has been
proposed by Wager et al. [9]. This approach is able to synthe-
size an expressive target melody with arbitrary sound material by
target-to-source mapping, using the features pitch, RMS energy
and the modulus of the windowed Short- Time Fourier Transform.
A key feature in an expressive re-synthesis of many instru-
ments, especially of bowed strings, are the so called spectral en-
velope modulations (SEM) [10]. The amplitude of each partial is
modulated by its frequency in relation to the underlying frequency
response of the instrument’s resonant body. A vibrato in string
instruments thus creates a periodic change in the relative partial
amplitudes. At the typical vibrato frequencies of 5-9 Hz this effect
is perceived as a timbral quality, rather than a rhythmical feature.
This phenomenon, perceptually also referred to as Sizzle [11], con-
tributes to the individual sound of instruments to a great extent.
Using spectral modeling techniques for manipulations of instru-
ment sounds, this effect is also considered essential for improving
the quality [12]. Glissandi also result in spectral envelope modu-
lations in the same way.
Another important aspect for an expressive re-synthesis is the
connection between intensity and the spectral features of the in-
strument’s sound. Increases in intensity usually cause significant
changes in the spectral distribution, respectively spectral skewness
and spectral flatness [13], as well as in the tonal/noise ratio.
The proposed system is designed to encompass the above men-
tioned effects with simple means, enabling an efficient real time
implementation. Details on the analysis, sinusoidal modeling and
statistical modeling are presented in Section 2. Section 3 explains
the statistical sinusoidal synthesis process in detail, followed by
the evaluation of synthesis results in Section 4. The conclusion in
Section 5 summarizes the findings and lists perspectives for further
development.
DAFX-1
Proceedings of the 22nd International Conference on Digital Audio Effects (DAFx-19), Birmingham, UK, September 2–6, 2019
10.5 0 0.5 1
0.00
0.05
0.10
0.15
x
Px(x)
(a) PMF
10.5 0 0.5 1
0
0.5
1
x
P(Xx)
(b) CMF
0 0.5 1
1
0.5
0
0.5
1
P(Xx)
x
(c) ICMF
Figure 1: Exemplary probability mass function (PMF) with derived cumulative mass function (CMF) and inverted CMF.
2. ANALYSIS AND MODELING
2.1. Sample Library
The focus of the presented synthesis system rests on excitation-
continuous melody instruments, using a violin as source material
for the analysis stage. The TU-Note Violin Sample Library[14]
is used for generating the statistical model. Featuring 336 sin-
gle sounds and 344 two-note sequences, it has been specifically
designed for this purpose. For the use in this project, the sin-
gle sounds are reduced to a total of 204, consisting of 51 unique,
equally spaced pitches, each captured at four dynamic levels. In
the remainder, this two-dimensional space will be referred to as
the timbre plane. It must be noted that, depending on the instru-
ment, additional dimensions for timbre control need to be added
to this space. For the violin, limited to standard techniques, the
proposed reduction is acceptable. MIDI values from 0 to 127 are
used to organize the dimensions pitch and velocity. Pitches range
from the lowest violin note at MIDI 55 G3 = 197.3341 Hz (443
tuning frequency) to the note at MIDI 105 A7 = 3544 Hz. The
dynamic levels pp,mp,mf and ff are captured in the timbre plane.
The material has been recorded at 96 kHz with 24 bit resolution
and can be downloaded using a static repository [15].
2.2. Sinusoidal Analysis
The sinusoids+noise model [16] is used for extracting the tonal
and noise parameters for each single sound. Analysis and model-
ing is carried out offline, prior to the synthesis stage. Monophonic
pitch tracking is performed using the YIN [17] and SWIPE [18]
algorithms. Tests with more recent, real-time capable approaches
[19] did not improve the performance. Based on the extracted fun-
damental frequency trajectories, partial trajectories are obtained
by peak picking in the short-time Fourier transform (STFT), using
a hop-size of 256 samples (2.67 ms) and a window size of 4096
samples, zero-padded to 8192 samples.
Quadratic interpolation (QIFFT), as presented by Smith et al.
[20], is applied for estimating amplitude and frequency of up to
80 partials in each frame. The partial phases ϕiare obtained by
finding the argument of the minimum when subtracting each par-
tial with the individual amplitude aiand frequency fifrom the
complete frame xat different phases ϕ:
ϕi= arg min "L
X
n=1
(x(n)aisin(2πfit(n) + ϕ))#,
ϕ=π . . . +π
(1)
After all partial parameters are extracted, the residual is cre-
ated by subtracting the tonal part with original phases from the
complete sound in the time domain. Modeling of the residual com-
ponent is performed with a filter bank, based on the Bark scale, as
proposed by Levine et al. [21]. The instantaneous energy trajec-
tories of all bands are calculated using a sliding RMS with the
hop-size of the sinusoidal analysis (2.67 ms) and a window length
of 21.33 ms.
For each single sound, the analysis results in up to 80 partial
amplitude trajectories, 80 partial frequency trajectories, 80 partial
phase trajectories and 24 noise band energy trajectories. Since the
original phases are not relevant for the proposed synthesis algo-
rithm, they are not used for the further modeling steps.
2.3. Segmentation
The TU-Note Violin Sample Library includes manually annotated
segmentation labels, based on a transient/steady state discrimina-
tion. For the single sounds they define the attack, sustain and re-
lease portion of each sound. Trajectories during attack and release
segments are stored completely and additionally modeled as para-
metric linear and exponential trajectories. Details on the modeling
and synthesis of attack and release segments are not subject to this
paper. The sustain part is synthesized with the statistical sinusoidal
modeling approach, explained in detail in the following section.
2.4. Statistical Modeling
After the segmentation, the above obtained trajectories of the par-
tials and noise bands during the sustain portion of the sound are
transformed into statistical distributions. Probability mass func-
tions (PMF ) with 50 equally spaced bins are created and trans-
formed to cumulative mass functions (CMF ):
CMF (x) = X
xix
PMF (xi)(2)
Inverse transform sampling relies on the inverted cumulative
mass function (ICMF ), also referred to as quantile function, for
generating random number sequences with the given distribution.
Figure 1 shows an exemplary PMF with the derived CMF and
ICMF . For the synthesis algorithm, CMF s and their inversions
are calculated and stored for all partial and noise trajectories during
the sustain parts. CMFs for the first five partials’ amplitudes and
frequencies are shown in Figure 2, respectively Figure 3. CMFs
for the first five noise band energies are shown in Figure 4. Addi-
tionally, the mean, median and standard deviation of all distribu-
tions are stored with the model.
DAFX-2
Proceedings of the 22nd International Conference on Digital Audio Effects (DAFx-19), Birmingham, UK, September 2–6, 2019
104103102101
0.00
0.50
1.00
log(ai)
P(Xx)
0
2
4
Partial index
Figure 2: CMFs for the first five partial amplitudes.
100100.2100.4100.6
0.00
0.50
1.00
fi
P(Xx)
0
2
4
Partial index
Figure 3: CMFs for the first five partial frequencies.
3. SYNTHESIS
3.1. Algorithm Overview
The implementation of the synthesis algorithm is included in a
C++ based framework [22], using the JACK API. Synthesis is per-
formed in the time domain, with a non-overlapping approach and
a frame size related to the buffer size of the audio interface. On
the test system, a buffer size of 128 samples was used at a sam-
pling rate of fs= 48 kHz, which allows a responsive use of the
synthesizer. For generating a single sound, a maximum of 160
partial parameters and 24 bark band energies have to be generated
each synthesis frame. The full number of 80 partials, however, is
only synthesized for pitches below 600 Hz at a sampling rate of
96 kHz, respectively below 300 Hz at 48 kHz. Figure 5 shows the
number of synthesized partials in dependency of sampling rate and
fundamental frequency.
Listing 1: Pseudo code for the synthesis algorithm.
for each frame:
get control inputs
for all partials:
generate random frequency value
generate random amplitude value
generate linear amplitude ramp
synthesize sinusoid
add sinusoid to output
for all Bark bands:
generate random band energy
generate linear energy ramp
apply band filter to noise signal
add band signal to output
105104103
0.00
0.50
1.00
log(rmsi)
P(Xx)
0
2
4
Bark band
Figure 4: CMFs for the first five bark energy trajectories.
0.5 1 1.5
0
20
40
60
80
f0/kHz
Npart
44.1 kHz
48 kHz
96 kHz
Figure 5: Number of partials synthesized, depending on sampling
rate and fundamental frequency.
For each frame of the synthesis output, a new set of support
points is generated, as shown in Listing 1. Interpolation trajec-
tories are generated for the connection to the preceding values of
partial amplitudes and noise band energies. Partial frequencies are
piece-wise constant.
3.2. Statistical Value Generation
At the present time, the statistical sinusoidal synthesis offers two
different modes for generating parameter trajectories. For the three
synthesis parameter types (partial amplitude, partial frequency and
noise band energy) the mode can be selected, individually.
3.2.1. Mean/Median Mode
In the mean/median mode, the individual distribution functions are
not used. Support points of the parameter trajectories are generated
using the mean or median values stored in the model. For a con-
stant control input, the resulting parameter trajectories remain con-
stant, too. Variations in parameters are thus induced only through
modulations of the input parameters.
3.2.2. Inverse Transform Sampling
Inverse transform sampling is a method for generating random
number sequences with desired distributions from uniformly dis-
tributed random processes [23]. The inverted CMF, as shown
in Figure 1c, maps the uniform distribution U(0,1) to the target
distribution. The method can be implemented using a sequential
search method [24, p. 85], without actually inverting the distribu-
tion functions in advance. For a random value 0r1from
the uniform distribution, the corresponding value ˜rfrom the target
distribution can be obtained as the argument of the minimum of
DAFX-3
Proceedings of the 22nd International Conference on Digital Audio Effects (DAFx-19), Birmingham, UK, September 2–6, 2019
the difference to the relevant cumulative mass function, as shown
in Figure 1b:
˜r= arg min [CMF (x)r](3)
In the implementation, this is realized using a vector search
for Equation 3. Binary search trees can increase the efficiency of
this approach and lookup tables or guide tables for the individual
distributions are even more efficient [24]. For the chosen amount
of parameters, the sequential search showed to be efficient enough
to run the synthesis smoothly with 80 partials on an Intel R
CoreTM
i7-5500U CPU at 2.40GHz.
3.3. Timbre Plane Interpolation
A B
CD
P
Figure 6: Interpolation in the timbre plane.
For the use of expressive input streams for pitch and intensity,
arbitrary points in the timbre plane need to be synthesized. In-
terpolation between the support points generated by analyzing the
sample library is possible for the mean/median mode as well as
for the inverse transform sampling mode. Figure 6 shows a point
Plocated in the square ABCD between four support points. The
weights for the parameters at each support point are calculated by
the following distance-based formulae:
wA= (1 x)(1 y)(4)
wB=x(1 y)(5)
wC=x·y(6)
wD= (1 x)y(7)
In the mean/median mode, the weights wican be directly ap-
plied to the mean or median values micorresponding to the pa-
rameter values at the given points A,B,Cand Dfor obtaining
the interpolated average ˜m:
˜m=wAmA+wBmB+wCmC+wDmD(8)
In the case of inverse transform sampling, the interpolation is
performed as presented in Figure 7. A single random value ris
generated from a uniformly distributed random process U(0,1).
This value is then used to generate four random values ˜riusing
the CMFs at the four support points. These resulting values are
finally multiplied by the weights from Equations 5–7 and summed
to obtain the interpolated random value ˜r:
˜r=wA˜r(CMF A) + wB˜r(CMF B)
+wC˜r(CMF C) + wD˜r(CMF D)(9)
CMF ACMF BCMF CCMF D
r
wAwBwCwD
Σ
˜r
˜rA˜rB˜rC˜rD
Figure 7: Interpolated inverse transform sampling.
3.4. Smoothing
The inverse transform sampling method as presented above does
not consider the recent state for generating new support points.
Hence, it does not capture the frequency characteristics of the an-
alyzed trajectories. As a result, rapid changes may occur in the
synthesized trajectories, which are not included in the original sig-
nals, although the resulting distribution functions are correct. For
that reason, an adjustable low-pass filter is inserted after the ran-
dom number generators for smoothing the trajectories. It should be
noted that this filtering process narrows the distribution functions.
4. MEASUREMENTS
For evaluating the ability of the proposed synthesis algorithm to
react to expressive control streams, the responses to sweeps in the
frequency and intensity dimension are captured and analyzed by
qualitative means. Only the deterministic component is used for
this evaluation, discarding the noise.
4.1. Frequency Sweeps
102.3102.4102.5
0.00
0.01
0.02
0.03
f/Hz
a1(f)
Figure 8: SEM for an octave sweep of the first partial.
For analyzing the effect of the spectral envelope modulations,
a frequency sweep of one octave is sent to the synthesis system
at four different intensities. The sweep ranges from the lowest
tone of MIDI=55 (G3,197.33 Hz) to MIDI=67 (G4,394.67 Hz).
The responses of all active partials to the frequency sweeps are
recorded as separate signals for an analysis.
DAFX-4
Proceedings of the 22nd International Conference on Digital Audio Effects (DAFx-19), Birmingham, UK, September 2–6, 2019
102103104
0.00
0.05
0.10
f/Hz
ai(f)
0
10
20
30
Partial index
(a) Low intensity (pp)
102103104
0.00
0.05
0.10
f/Hz
ai(f)
0
10
20
30
Partial index
(b) Medium-low intensity (mp)
102103104
0.00
0.05
0.10
f/Hz
ai(f)
0
10
20
30
Partial index
(c) Medium-high intensity (mf)
102103104
0.00
0.05
0.10
f/Hz
ai(f)
0
10
20
30
Partial index
(d) High intensity (ff)
Figure 9: Representations of the frequency response through spectral envelope modulations of 30 partials by a one octave sweep.
Figure 8 shows the amplitude of the first partial as a function of
the fundamental frequency. The resulting trajectory shows no dis-
continuities, validating the interpolation process. It further shows
a prominent peak at approximately 257 Hz, caused by the spec-
tral envelope modulations. Joining the amplitude over frequency
trajectories of the first 30 partials, the frequency response of the
instrument can be visualized through SEM. Results are shown for
MIDI intensities 20 (pp), 50 (mp), 80 (mf) and 120 (ff) in Fig-
ure 9. With increasing partial index, the overlap with the neigh-
boring partial trajectories increases. The approximated frequency
responses are thus blurred for higher frequencies.
Figure 10: Input admittance at the bass bar side of an Andrea
Guarneri violin [25].
All four representations of the frequency responses in Figure 9
show the same prominent peaks. These peaks correspond to the
formants typical for violins. For a comparison, the input admit-
tance of a Guarneri violin is shown in Figure 10.
Characteristic resonances of violin bodies have been labeled
inconsistently by different researchers. However, referring to Curtin
et al. [11], the prominent resonances for Figure 10 are listed in Ta-
ble 1. Plots 9a – 9d all show the f-hole resonance at 284 Hz and
the main wood resonance, respectively the lowest corpus mode at
415 Hz. At higher intensities, the plots show peaks at 709 Hz,
872 Hz and 1170 Hz, related to the upper wood resonances and
the lateral air motion. The so called violin formant is represented
by a region of increased energy between 2000 Hz and 3000 Hz.
Table 1: Main resonances of a violin body [25, 11].
Label Frequency Description
A0: 275 Hz f-hole resonance
C2 (T1): 460 Hz main wood
C3: 530 Hz second wood
C4: 700 Hz third wood
F: 1000 Hz lateral air motion
2000 - 3000 Hz violin formant, bridge hill
4.2. Intensity Modulations
The response of the synthesis system to changes in intensity is cap-
tured at four different pitches. Intensity sweeps from 0to 127 are
used at MIDI pitches 55 (G3,197.33 Hz), 67 (G4,394.67 Hz), 79
(G5,789.33 Hz) and 93 (A6,1772.00 Hz). The plots in Figure 11
show the spectrum of the harmonic signal in dependency of the
DAFX-5
Proceedings of the 22nd International Conference on Digital Audio Effects (DAFx-19), Birmingham, UK, September 2–6, 2019
2 4 6 8
105
104
103
102
f/kHz
log(amp)
0
50
100
Intensity
(a) Low pitch (MIDI=55)
5 10 15
105
104
103
102
f/kHz
log(amp)
0
50
100
Intensity
(b) Medium-low pitch (MIDI=67)
5 10 15
105
104
103
102
f/kHz
log(amp)
0
50
100
Intensity
(c) Medium-high pitch (MIDI=79)
5 10 15
104
103
102
f/kHz
log(amp)
0
50
100
Intensity
(d) High pitch (MIDI=93)
Figure 11: Amplitudes of the first 50 partials in dependency of intensity, captured for four different pitches.
intensity, sampled at the partial frequencies. For higher pitches,
the number of partials is reduced, resulting in a lower frequency
resolution. An increase in high frequency content is indicated for
higher intensities at all pitches.
20 40 60 80 100 120
4
6
8
Intensity
HSC
55
67
79
93
Figure 12: Harmonic spectral centroid as function of intensity for
four different MIDI pitches (55, 67, 97, 93).
The harmonic spectral centroid (HSC) is calculated for 50
equally spaced points within all intensity sweeps for analyzing the
influence of the intensity on the harmonic component of the signal.
Based on the spectral centroid, the HSC regards only the ampli-
tudes aiof the partials, resulting in a pitch independent measure
for the spectral distribution of the partials:
HSC =
i=N
P
i=1
iai
i=N
P
i=1
ai
(10)
Figure 12 shows the HSC as a function of the intensity for four
different pitches. All trajectories show a quasi monotonic increase
in the HSC with increasing intensity. Changes in intensity thus
result in changes in timbre, respectively in brightness.
5. CONCLUSION
The proposed statistical sinusoidal modeling system is capable of
reacting to expressive gestures, using the input parameters pitch
and intensity. Evaluations of frequency and intensity sweeps show
the desired responses in timbral qualities, validating the interpo-
lated inverse transform sampling. The next important step for im-
proving the algorithm is the implementation of a Markovian in-
verse transform sampling, considering past values for the random
sequence generation and thus preserving the frequency character-
istics of the synthesis parameters.
Using the actual inverse cumulative mass functions during run-
time could further improve the performance of the algorithm. At
the current state the inverse transform sampling requires a search
within an unsorted vector, whereas actual inverted functions can
be used by simple indexing. The flexibility and compression rate
of the model could be increased by using parametric distributions
instead of stored distribution functions.
Since the presented approach aims at the synthesis of sustained
signals, an integration of parametric transition models [26] and
trajectory models for attack and release segments is necessary for
completing the synthesis system. Future experiments aim at a per-
ceptual evaluation of synthesized sounds and expressive phrases
from the full system. User studies are planned for assessing the
applicability of the synthesis approach in a real-time scenario.
DAFX-6
Proceedings of the 22nd International Conference on Digital Audio Effects (DAFx-19), Birmingham, UK, September 2–6, 2019
6. REFERENCES
[1] Roger B. Dannenberg and Istvan Derenyi, “Combining In-
strument and Performance Models for High-Quality Music
Synthesis,” Journal of New Music Research, pp. 211–238,
1998.
[2] Diemo Schwarz, “A System for Data-Driven Concatenative
Sound Synthesis,” in Proceedings of the COST-G6 Con-
ference on Digital Audio Effects (DAFx-00), Verona, Italy,
2000.
[3] Alfonso Perez, Jordi Bonada, Esteban Maestre, Enric Guaus,
and Merlijn Blaauw, “Combining Performance Actions with
Spectral Models for Violin Sound Transformation,” in Inter-
national Congress on Acoustics, Madrid, Spain, 2007.
[4] Eric Lindemann, “Music Synthesis with Reconstructive
Phrase Modeling,” IEEE Signal Processing Magazine [80]
March 2007, 2007.
[5] Henrik Hahn and Axel Röbel, “Extended Source-Filter
Model of Quasi-Harmonic Instruments for Sound Synthe-
sis, Transformation and Interpolation, in Proceedings of the
9th Sound and Music Computing Conference (SMC), Copen-
hagen, Denmark, 2012, pp. 434–441.
[6] Henrik Hahn and Axel Röbel, “Extended Source-Filter
Model for Harmonic Instruments for Expressive Control of
Sound Synthesis and Transformation, in Proceedings of
the 16th International Conference on Digital Audio Effects
(DAFx-13), Maynooth, Ireland, 2013.
[7] David Wessel, Cyril Drame, and Matthew Wright, “Re-
moving the Time Axis from Spectral Model Analysis-Based
Additive Synthesis: Neural Networks versus Memory-Based
Machine Learning,” in Proceedings of the International
Computer Music Conference (ICMC), Ann Arbor, Michigan,
1998, p. 62–65.
[8] Jonathan Driedger, Thomas Prätzlich, and Meinard Müller,
“Let it Bee – Towards NMF-Inspired Audio Mosaicing, in
Proceedings of the International Conference on Music In-
formation Retrieval (ISMIR), Malaga, Spain, 2015, pp. 350–
356.
[9] Sanna Wager, Liang Chen, Minje Kim, and Christo-
pher Raphael, “Towards Expressive Instrument Synthe-
sis Through Smooth Frame-by-Frame Reconstruction: From
String to Woodwind, in Proceedings of the 2017 IEEE In-
ternational Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), New Orleans, USA, 2017, pp. 391–395.
[10] Ixone Arroabarren, Miroslav Zivanovic, Xavier Rodet, and
Alfonso Carlosena, “Instantaneous Frequency and Ampli-
tude of Vibrato in Singing Voice,” in Proceedings of the
2003 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), Hong Kong, China, 2003,
p. 537–540.
[11] Joseph Curtin and Thomas D Rossing, “Violin, in The Sci-
ence of String Instruments, Thomas Rossing, Ed., pp. 209–
244. Springer, 2010.
[12] Axel Roebel, Simon Maller, and Javier Contreras, “Trans-
forming Vibrato Extent in Monophonic Sounds, in Proceed-
ings of the 14 th International Conference on Digital Audio
Effects (DAFx), Paris, France, 2011.
[13] Stefan Weinzierl, Steffen Lepa, Frank Schultz, Erik Detzner,
Henrik von Coler, and Gottfried Behler, “Sound Power and
Timbre as Cues for the Dynamic Strength of Orchestral In-
struments,” The Journal of the Acoustical Society of Amer-
ica, vol. 144, no. 3, pp. 1347–1355, 2018.
[14] Henrik von Coler, Jonas Margraf, and Paul Schuladen,
“TU-Note Violin Sample Library,” TU Berlin, http:
//dx.doi.org/10.14279/depositonce-6747,
2018, Data set.
[15] Henrik von Coler, “TU-Note Violin Sample Library –
A Database of Violin Sounds with Segmentation Ground
Truth, in Proceedings of the 21st International Conference
on Digital Audio Effects (DAFx), Aveiro, Portugal, 2018.
[16] Xavier Serra and Julius Smith, “Spectral Modeling Synthe-
sis: A Sound Analysis/Synthesis System Based on a Deter-
ministic Plus Stochastic Decomposition ,” Computer Music
Journal, vol. 14, no. 4, pp. 12–14, 1990.
[17] Alain de Cheveigné and Hideki Kawahara, “YIN, a Fun-
damental Frequency Estimator for Speech and Music, The
Journal of the Acoustical Society of America, vol. 111, no. 4,
pp. 1917–1930, 2002.
[18] Arturo Camacho, Swipe: A Sawtooth Waveform Inspired
Pitch Estimator for Speech and Music, Ph.D. thesis,
Gainesville, FL, USA, 2007.
[19] Orchisama Das, Julius Smith, and Chris Chafe, “Real-time
Pitch Tracking in Audio Signals with the Extended Com-
plex Kalman Filter, in Proceedings of the 20th Interna-
tional Conference on Digital Audio Effects (DAFx), Edin-
burgh, UK, 2017.
[20] Julius O. Smith and Xavier Serra, “PARSHL: An Analy-
sis/Synthesis Program for Non-Harmonic Sounds Based on
a Sinusoidal Representation,” in Proceedings of the Inter-
national Computer Music Conference (ICMC), Barcelona,
Spain, 2005.
[21] Scott Levine and Julius Smith, “A Sines+Transients+Noise
Audio Representation for Data Compression and Time/Pitch
Scale Modifications,” in Proceedings of the 105th Audio En-
gineering Society Convention, San Francisco, CA, 1998.
[22] Henrik von Coler, “A Jack-based Application for Spectro-
Spatial Additive Synthesis, in Proceedings of the 17th Linux
Audio Conference (LAC-19), Stanford University, USA,
2019.
[23] Luc Devroye, “Sample-based Non-uniform Random Variate
Generation,” in Winter Simulation Conference, Washington,
D.C., USA, 12 1986, pp. 260–265.
[24] Luc Devroye, Non-Uniform Random Variate Generation,
Springer, McGill University, 1986.
[25] J. Alonso Moral and E. Jansson, “Input Admittance, Eigen-
modes and Quality of Violins, in STL-QPSR, vol. 23, pp.
60–75. KTH Royal Institute of Technology, 1982.
[26] Henrik von Coler, Moritz Götz, and Steffen Lepa, “Paramet-
ric Synthesis of Glissando Note Transitions – A user Study
in a Real-Time Application, in Proceedings of the 21st
International Conference on Digital Audio Effects (DAFx),
Aveiro, Portugal, 2018.
DAFX-7
... Statistical spectral modeling aims at capturing the timbre of musical sounds by means of measuring the distribution of spectral modeling parameters. A first implementation [7] captured the distribution functions of amplitude and frequency trajectories for single partials, as shown in Figure 1 for a partial's amplitude. New trajectories could be synthesized with this distribution using the inverse transform sampling method [8], followed by a low-pass filter smoothing. ...
... These matrices can be used for generating a stochastic process with the properties of the original trajectory. Both the stateless and the discrete state modeling allow to interpolate between samples from the analysis data set [7]. ...
Conference Paper
Full-text available
Continuous State Markovian Spectral Modeling is a novel approach for parametric synthesis of spectral modeling parameters, based on the sines plus noise paradigm. The method aims specifically at capturing shimmer and jitter-micro-fluctuations in the partials' frequency and amplitude trajectories, which are essential for the timbre of musical instruments. It allows for parametric control over the timbral qualities, while removing the need for the more computationally expensive and restrictive process of the discrete state space modeling method. A qualitative comparison between an original violin sound and a re-synthesis shows the ability of the algorithm to reproduce the micro-fluctuations, considering their stochastic and spectral properties.
... Sound synthesis is carried out in a standalone application, implemented as a Jack 1 client on a Linux audio system [7]. The underlying method for sound synthesis is based on a statistical model of spectral modeling data, gathered from a library of violin recordings [8]. The synthesis engine provides 24 outputs, each representing a Bark frequency band. ...
Conference Paper
Full-text available
The presented sound synthesis system allows the individual spatialization of spectral components in real-time, using a sinusoidal modeling approach within 3-dimensional sound reproduction systems. A co-developed, dedicated haptic interface is used to jointly control spectral and spatial attributes of the sound. Within a user study, participants were asked to create an individual mapping between control parameters of the interface and rendering parameters of sound synthesis and spatialization, using a visual programming environment. Resulting mappings of all participants are evaluated, indicating the preference of single control parameters for specific tasks. In comparison with mappings intended by the development team, the results validate certain design decisions and indicate new directions.
Conference Paper
Full-text available
This paper presents a real-time additive sound synthesis application with individual outputs for each partial and noise component. The synthesizer is programmed in C++, relying on the Jack API for audio connectivity with an OSC interface for control input. These features allow the individual spatialization of the partials and noise, referred to as spectro-spatial synthesis, in connection with an OSC capable spatial rendering software. Additive synthesis is performed in the time domain, using previously extracted partial trajectories from instrument recordings. Noise is synthesized using bark band energy trajectories. The sinusoidal data set for the synthesis is generated from a custom violin sample library in advance. Spatialization is realized using established rendering software implementations on a dedicated server. Pure Data is used for processing control streams from an expressive musical interface and distributing it to synthesizer and renderer.
Data
Full-text available
The presented sample library of violin sounds is designed as a tool for the research, development and testing of sound analysis / synthesis algorithms. The library features single sounds which cover the entire frequency range of the instrument in four dynamic levels, two-note sequences for the study of note transitions, and solo pieces and scales. All parts come with hand-labeled segmentation ground-truth files which mark attack, release and transition/transient segments. Additional relevant information on the samples' properties is provided for single sounds and two-note sequences. Recordings took place in an anechoic chamber with a professional violinist and a recording engineer, using two microphone positions. This document briefly describes the content and structure of the data set.
Conference Paper
Full-text available
The presented sample library of violin sounds is designed as a tool for the research, development and testing of sound analy- sis/synthesis algorithms. The library features single sounds which cover the entire frequency range of the instrument in four dynamic levels, two-note sequences for the study of note transitions and vi- brato, as well as solo pieces for performance analysis. All parts come with a hand-labeled segmentation ground truth which mark attack, release and transition/transient segments. Additional rele- vant information on the samples’ properties is provided for single sounds and two-note sequences. Recordings took place in an ane- choic chamber with a professional violinist and a recording engi- neer, using two microphone positions. This document describes the content and the recording setup in detail, alongside basic sta- tistical properties of the data.
Article
Full-text available
In a series of measurements, the sound power of 40 musical instruments, including all standard modern orchestral instruments, as well as some of their historic precursors from the classical and the baroque epoch, was determined using the enveloping surface method with a 32-channel spherical microphone array according to ISO 3745. Single notes were recorded at the extremes of the dynamic range (pp and ff) over the entire pitch range. In a subsequent audio content analysis, audio features were determined for all 3482 single notes using the timbre toolbox. In order to analyze the relative contributions of timbre- and amplitude-related properties to the expression of musical dynamics in different instruments, Bayesian linear discriminant analysis and generalized linear mixed modelling were employed to determine those audio features discriminating best between extremes of dynamics both within and across instruments. The results from these measurements and statistical analyses thus deliver a comprehensive picture of the acoustical manifestation of “musical dynamics” with respect to sound power and timbre for all standard orchestral instruments.
Conference Paper
Full-text available
The Kalman filter is a well-known tool used extensively in robotics, navigation, speech enhancement and finance. In this paper, we propose a novel pitch follower based on the Extended Complex Kalman Filter (ECKF). An advantage of this pitch follower is that it operates on a sample-by-sample basis, unlike other block-based algorithms that are most commonly used in pitch estimation. Thus, it estimates sample-synchronous fundamental frequency (assumed to be the perceived pitch), which makes it ideal for real-time implementation. Simultaneously, the ECKF also tracks the amplitude envelope of the input audio signal. Finally, we test our ECKF pitch detector on a number of cello and double bass recordings played with various ornaments, such as vibrato, portamento and trill, and compare its result with the well-known YIN estimator, to conclude the effectiveness of our algorithm.
Conference Paper
Full-text available
In this paper we present a revised and improved version of a re-cently proposed extended source-filter model for sound synthesis, transformation and hybridization of harmonic instruments. This extension focuses mainly on the application for impulsively ex-cited instruments like piano or guitar, but also improves synthesis results for continuously driven instruments including their hybrids. This technique comprises an extensive analysis of an instruments sound database, followed by the estimation of a generalized in-strument model reflecting timbre variations according to selected control parameters. Such an instrument model allows for natu-ral sounding transformations and expressive control of instrument sounds regarding its control parameters.
Chapter
Chapters I–XIV are based on the premises that a perfect uniform [0,1] random variate generator is available and that real numbers can be manipulated and stored. Now we drop the first of these premises and Instead assume a perfect bit generator (i.e., a source capable of generating lid {0,1} random varlates B 1,B 2,…), While still assuming that real numbers can be manipulated and stored, as before: this is for example necessary when someone gives us the probabilities p n for discrete random variate generation. The cost of an algorithm can be measured in terms of the number of bits required to generate a random variate. This model is due to Knuth and Yao (1976) who introduced a complexity theory for nonuniform random variate generation. We will report the main ideas of Knuth and Yao in this chapter.
Article
From a Scandinavian violinmakers' competition, a number of 24 violins of different qualities were selected. Input admittances as function of frequency were recorded perpendicularly to the top plate at two positions on top of the bridge. The admittance curves at the side of the bass bar show three resonances at approximately 400, 500, and 700 Hz and a broad resonance around 3 kHz. Three ?acoustical? quality criteria were extracted from these curves: (1) the average level of the three resonance peaks (high is favorable), (2) the summed discrepancies between single peak levels and the average level (small is favorable), and (3) steepness of upwards slope from 1500 to 3000 Hz (steep is favorable). The levels of input admittances of the bass bar and sound post sides are approximately the same. A fourth acoustical quality criterion was extracted from the two curves: (4) the level discrepancies between the two curves (small is favorable). Calculated ?acoustical? quality points correlate well with given tonal quality ratings (a correlation coefficient of 0.92 with five violins with singular faults excluded).