Content uploaded by Henrik von Coler

Author content

All content in this area was uploaded by Henrik von Coler on Oct 13, 2019

Content may be subject to copyright.

Content uploaded by Henrik von Coler

Author content

All content in this area was uploaded by Henrik von Coler on Oct 13, 2019

Content may be subject to copyright.

Proceedings of the 22nd International Conference on Digital Audio Effects (DAFx-19), Birmingham, UK, September 2–6, 2019

STATISTICAL SINUSOIDAL MODELING FOR EXPRESSIVE SOUND SYNTHESIS

Henrik von Coler

Audio Communication Group

TU Berlin

Berlin, Germany

voncoler@tu-berlin.de

ABSTRACT

Statistical sinusoidal modeling represents a method for transfer-

ring a sample library of instrument sounds into a data base of sinu-

soidal parameters for the use in real time additive synthesis. Single

sounds, capturing a musical instrument in combinations of pitch

and intensity, are therefor segmented into attack, sustain and re-

lease. Partial amplitudes, frequencies and Bark band energies are

calculated for all sounds and segments. For the sustain part, all

partial and noise parameters are transformed to probabilistic dis-

tributions. Interpolated inverse transform sampling is introduced

for generating parameter trajectories during synthesis in real time,

allowing the creation of sounds located at pitches and intensities

between the actual support points of the sample library. Evalua-

tion is performed by qualitative analysis of the system response to

sweeps of the control parameters pitch and intensity. Results for

a set of violin samples demonstrate the ability of the approach to

model dynamic timbre changes, which is crucial for the perceived

quality of expressive sound synthesis.

1. INTRODUCTION

A system capable of expressive sound synthesis reacts to dynamic

control input with the desired or appropriate changes in sound. In

analysis-synthesis systems this means that the perceived timbral

qualities of the synthesized sound emulate the behavior of the an-

alyzed instrument as close as possible. Such systems thus need to

capture the individual sound of an instrument and allow manipu-

lations, based on a limited set of control parameters. In order to

achieve this, the synthesis approach presented in this paper, enti-

tled statistical sinusoidal modeling, combines a sample based ap-

proach with a novel method for sinusoidal modeling.

Sample based synthesis, in its basic form, is able to capture

individual sounds very accurately but does not offer manipulation

techniques necessary for an expressive synthesis [1]. Sinusoidal

modeling, on the other hand, is capable of wide-ranging means

for sound manipulation. A key problem of sinusoidal modeling

approaches, however, is the mapping of control parameters to the

large amount of synthesis parameters. Statistical sinusoidal mod-

eling can be regarded as a way of mapping the control parameters

pitch and intensity to the parameters of a sinusoidal model. This

reduced set of control parameters is often considered the central

input for similar sound synthesis systems.

Different approaches aim at improving sample based sound

synthesis. Among them are granular synthesis and corpus based

Copyright: c

2019 Henrik von Coler . This is an open-access article distributed

under the terms of the Creative Commons Attribution 3.0 Unported License, which

permits unrestricted use, distribution, and reproduction in any medium, provided the

original author and source are credited.

concatenative synthesis [2]. Combined with spectral manipula-

tion techniques, the ﬂexibility of these approaches is further in-

creased. Such combinations have proven to be effective for ex-

pressive sound synthesis. Examples include spectral concatena-

tive synthesis [3] and reconstructive phrase modeling [4].

An extended source-ﬁlter model has been proposed by Hahn et

al. [5, 6]. Partial and noise parameters are modeled in dependency

of the control parameters pitch and intensity, by means of tensor-

product B-splines. Two separate ﬁlters are used, one representing

the instrument-speciﬁc features by partial index and another one

capturing the frequency dependent partial characteristics. Wessel

et al. [7] present a system for removing the temporal axis from the

analysis data of sinusoidal models by the use of neural networks

and memory-based machine learning. These methods are used

to learn mappings of the three control parameters pitch, intensity

and brightness to partial parameters. A system combining corpus

based concatenative synthesis with audio mosaicing [8] has been

proposed by Wager et al. [9]. This approach is able to synthe-

size an expressive target melody with arbitrary sound material by

target-to-source mapping, using the features pitch, RMS energy

and the modulus of the windowed Short- Time Fourier Transform.

A key feature in an expressive re-synthesis of many instru-

ments, especially of bowed strings, are the so called spectral en-

velope modulations (SEM) [10]. The amplitude of each partial is

modulated by its frequency in relation to the underlying frequency

response of the instrument’s resonant body. A vibrato in string

instruments thus creates a periodic change in the relative partial

amplitudes. At the typical vibrato frequencies of 5-9 Hz this effect

is perceived as a timbral quality, rather than a rhythmical feature.

This phenomenon, perceptually also referred to as Sizzle [11], con-

tributes to the individual sound of instruments to a great extent.

Using spectral modeling techniques for manipulations of instru-

ment sounds, this effect is also considered essential for improving

the quality [12]. Glissandi also result in spectral envelope modu-

lations in the same way.

Another important aspect for an expressive re-synthesis is the

connection between intensity and the spectral features of the in-

strument’s sound. Increases in intensity usually cause signiﬁcant

changes in the spectral distribution, respectively spectral skewness

and spectral ﬂatness [13], as well as in the tonal/noise ratio.

The proposed system is designed to encompass the above men-

tioned effects with simple means, enabling an efﬁcient real time

implementation. Details on the analysis, sinusoidal modeling and

statistical modeling are presented in Section 2. Section 3 explains

the statistical sinusoidal synthesis process in detail, followed by

the evaluation of synthesis results in Section 4. The conclusion in

Section 5 summarizes the ﬁndings and lists perspectives for further

development.

DAFX-1

Proceedings of the 22nd International Conference on Digital Audio Effects (DAFx-19), Birmingham, UK, September 2–6, 2019

−1−0.5 0 0.5 1

0.00

0.05

0.10

0.15

x

Px(x)

(a) PMF

−1−0.5 0 0.5 1

0

0.5

1

x

P(X≤x)

(b) CMF

0 0.5 1

−1

−0.5

0

0.5

1

P(X≤x)

x

(c) ICMF

Figure 1: Exemplary probability mass function (PMF) with derived cumulative mass function (CMF) and inverted CMF.

2. ANALYSIS AND MODELING

2.1. Sample Library

The focus of the presented synthesis system rests on excitation-

continuous melody instruments, using a violin as source material

for the analysis stage. The TU-Note Violin Sample Library[14]

is used for generating the statistical model. Featuring 336 sin-

gle sounds and 344 two-note sequences, it has been speciﬁcally

designed for this purpose. For the use in this project, the sin-

gle sounds are reduced to a total of 204, consisting of 51 unique,

equally spaced pitches, each captured at four dynamic levels. In

the remainder, this two-dimensional space will be referred to as

the timbre plane. It must be noted that, depending on the instru-

ment, additional dimensions for timbre control need to be added

to this space. For the violin, limited to standard techniques, the

proposed reduction is acceptable. MIDI values from 0 to 127 are

used to organize the dimensions pitch and velocity. Pitches range

from the lowest violin note at MIDI 55 ∼G3 = 197.3341 Hz (443

tuning frequency) to the note at MIDI 105 ∼A7 = 3544 Hz. The

dynamic levels pp,mp,mf and ff are captured in the timbre plane.

The material has been recorded at 96 kHz with 24 bit resolution

and can be downloaded using a static repository [15].

2.2. Sinusoidal Analysis

The sinusoids+noise model [16] is used for extracting the tonal

and noise parameters for each single sound. Analysis and model-

ing is carried out ofﬂine, prior to the synthesis stage. Monophonic

pitch tracking is performed using the YIN [17] and SWIPE [18]

algorithms. Tests with more recent, real-time capable approaches

[19] did not improve the performance. Based on the extracted fun-

damental frequency trajectories, partial trajectories are obtained

by peak picking in the short-time Fourier transform (STFT), using

a hop-size of 256 samples (2.67 ms) and a window size of 4096

samples, zero-padded to 8192 samples.

Quadratic interpolation (QIFFT), as presented by Smith et al.

[20], is applied for estimating amplitude and frequency of up to

80 partials in each frame. The partial phases ϕiare obtained by

ﬁnding the argument of the minimum when subtracting each par-

tial with the individual amplitude aiand frequency fifrom the

complete frame xat different phases ϕ∗:

ϕi= arg min "L

X

n=1

(x(n)−aisin(2πfit(n) + ϕ∗))#,

ϕ∗=−π . . . +π

(1)

After all partial parameters are extracted, the residual is cre-

ated by subtracting the tonal part with original phases from the

complete sound in the time domain. Modeling of the residual com-

ponent is performed with a ﬁlter bank, based on the Bark scale, as

proposed by Levine et al. [21]. The instantaneous energy trajec-

tories of all bands are calculated using a sliding RMS with the

hop-size of the sinusoidal analysis (2.67 ms) and a window length

of 21.33 ms.

For each single sound, the analysis results in up to 80 partial

amplitude trajectories, 80 partial frequency trajectories, 80 partial

phase trajectories and 24 noise band energy trajectories. Since the

original phases are not relevant for the proposed synthesis algo-

rithm, they are not used for the further modeling steps.

2.3. Segmentation

The TU-Note Violin Sample Library includes manually annotated

segmentation labels, based on a transient/steady state discrimina-

tion. For the single sounds they deﬁne the attack, sustain and re-

lease portion of each sound. Trajectories during attack and release

segments are stored completely and additionally modeled as para-

metric linear and exponential trajectories. Details on the modeling

and synthesis of attack and release segments are not subject to this

paper. The sustain part is synthesized with the statistical sinusoidal

modeling approach, explained in detail in the following section.

2.4. Statistical Modeling

After the segmentation, the above obtained trajectories of the par-

tials and noise bands during the sustain portion of the sound are

transformed into statistical distributions. Probability mass func-

tions (PMF ) with 50 equally spaced bins are created and trans-

formed to cumulative mass functions (CMF ):

CMF (x) = X

xi≤x

PMF (xi)(2)

Inverse transform sampling relies on the inverted cumulative

mass function (ICMF ), also referred to as quantile function, for

generating random number sequences with the given distribution.

Figure 1 shows an exemplary PMF with the derived CMF and

ICMF . For the synthesis algorithm, CMF s and their inversions

are calculated and stored for all partial and noise trajectories during

the sustain parts. CMFs for the ﬁrst ﬁve partials’ amplitudes and

frequencies are shown in Figure 2, respectively Figure 3. CMFs

for the ﬁrst ﬁve noise band energies are shown in Figure 4. Addi-

tionally, the mean, median and standard deviation of all distribu-

tions are stored with the model.

DAFX-2

Proceedings of the 22nd International Conference on Digital Audio Effects (DAFx-19), Birmingham, UK, September 2–6, 2019

10−410−310−210−1

0.00

0.50

1.00

log(ai)

P(X≤x)

0

2

4

Partial index

Figure 2: CMFs for the ﬁrst ﬁve partial amplitudes.

100100.2100.4100.6

0.00

0.50

1.00

fi

P(X≤x)

0

2

4

Partial index

Figure 3: CMFs for the ﬁrst ﬁve partial frequencies.

3. SYNTHESIS

3.1. Algorithm Overview

The implementation of the synthesis algorithm is included in a

C++ based framework [22], using the JACK API. Synthesis is per-

formed in the time domain, with a non-overlapping approach and

a frame size related to the buffer size of the audio interface. On

the test system, a buffer size of 128 samples was used at a sam-

pling rate of fs= 48 kHz, which allows a responsive use of the

synthesizer. For generating a single sound, a maximum of 160

partial parameters and 24 bark band energies have to be generated

each synthesis frame. The full number of 80 partials, however, is

only synthesized for pitches below 600 Hz at a sampling rate of

96 kHz, respectively below 300 Hz at 48 kHz. Figure 5 shows the

number of synthesized partials in dependency of sampling rate and

fundamental frequency.

Listing 1: Pseudo code for the synthesis algorithm.

for each frame:

get control inputs

for all partials:

generate random frequency value

generate random amplitude value

generate linear amplitude ramp

synthesize sinusoid

add sinusoid to output

for all Bark bands:

generate random band energy

generate linear energy ramp

apply band filter to noise signal

add band signal to output

10−510−410−3

0.00

0.50

1.00

log(rmsi)

P(X≤x)

0

2

4

Bark band

Figure 4: CMFs for the ﬁrst ﬁve bark energy trajectories.

0.5 1 1.5

0

20

40

60

80

f0/kHz

Npart

44.1 kHz

48 kHz

96 kHz

Figure 5: Number of partials synthesized, depending on sampling

rate and fundamental frequency.

For each frame of the synthesis output, a new set of support

points is generated, as shown in Listing 1. Interpolation trajec-

tories are generated for the connection to the preceding values of

partial amplitudes and noise band energies. Partial frequencies are

piece-wise constant.

3.2. Statistical Value Generation

At the present time, the statistical sinusoidal synthesis offers two

different modes for generating parameter trajectories. For the three

synthesis parameter types (partial amplitude, partial frequency and

noise band energy) the mode can be selected, individually.

3.2.1. Mean/Median Mode

In the mean/median mode, the individual distribution functions are

not used. Support points of the parameter trajectories are generated

using the mean or median values stored in the model. For a con-

stant control input, the resulting parameter trajectories remain con-

stant, too. Variations in parameters are thus induced only through

modulations of the input parameters.

3.2.2. Inverse Transform Sampling

Inverse transform sampling is a method for generating random

number sequences with desired distributions from uniformly dis-

tributed random processes [23]. The inverted CMF, as shown

in Figure 1c, maps the uniform distribution U(0,1) to the target

distribution. The method can be implemented using a sequential

search method [24, p. 85], without actually inverting the distribu-

tion functions in advance. For a random value 0≤r≤1from

the uniform distribution, the corresponding value ˜rfrom the target

distribution can be obtained as the argument of the minimum of

DAFX-3

the difference to the relevant cumulative mass function, as shown

in Figure 1b:

˜r= arg min [CMF (x)−r](3)

In the implementation, this is realized using a vector search

for Equation 3. Binary search trees can increase the efﬁciency of

this approach and lookup tables or guide tables for the individual

distributions are even more efﬁcient [24]. For the chosen amount

of parameters, the sequential search showed to be efﬁcient enough

to run the synthesis smoothly with 80 partials on an Intel R

CoreTM

i7-5500U CPU at 2.40GHz.

3.3. Timbre Plane Interpolation

15 47 79 111

57

58

59

60

A B

CD

P

Intensity

Pitch

Figure 6: Interpolation in the timbre plane.

For the use of expressive input streams for pitch and intensity,

arbitrary points in the timbre plane need to be synthesized. In-

terpolation between the support points generated by analyzing the

sample library is possible for the mean/median mode as well as

for the inverse transform sampling mode. Figure 6 shows a point

Plocated in the square ABCD between four support points. The

weights for the parameters at each support point are calculated by

the following distance-based formulae:

wA= (1 −x)(1 −y)(4)

wB=x(1 −y)(5)

wC=x·y(6)

wD= (1 −x)y(7)

In the mean/median mode, the weights wican be directly ap-

plied to the mean or median values micorresponding to the pa-

rameter values at the given points A,B,Cand Dfor obtaining

the interpolated average ˜m:

˜m=wAmA+wBmB+wCmC+wDmD(8)

In the case of inverse transform sampling, the interpolation is

performed as presented in Figure 7. A single random value ris

generated from a uniformly distributed random process U(0,1).

This value is then used to generate four random values ˜riusing

the CMFs at the four support points. These resulting values are

ﬁnally multiplied by the weights from Equations 5–7 and summed

to obtain the interpolated random value ˜r∗:

˜r∗=wA˜r(CMF A) + wB˜r(CMF B)

+wC˜r(CMF C) + wD˜r(CMF D)(9)

CMF ACMF BCMF CCMF D

r

wAwBwCwD

Σ

˜r∗

˜rA˜rB˜rC˜rD

Figure 7: Interpolated inverse transform sampling.

3.4. Smoothing

The inverse transform sampling method as presented above does

not consider the recent state for generating new support points.

Hence, it does not capture the frequency characteristics of the an-

alyzed trajectories. As a result, rapid changes may occur in the

synthesized trajectories, which are not included in the original sig-

nals, although the resulting distribution functions are correct. For

that reason, an adjustable low-pass ﬁlter is inserted after the ran-

dom number generators for smoothing the trajectories. It should be

noted that this ﬁltering process narrows the distribution functions.

4. MEASUREMENTS

For evaluating the ability of the proposed synthesis algorithm to

react to expressive control streams, the responses to sweeps in the

frequency and intensity dimension are captured and analyzed by

qualitative means. Only the deterministic component is used for

this evaluation, discarding the noise.

4.1. Frequency Sweeps

102.3102.4102.5

0.00

0.01

0.02

0.03

f/Hz

a1(f)

Figure 8: SEM for an octave sweep of the ﬁrst partial.

For analyzing the effect of the spectral envelope modulations,

a frequency sweep of one octave is sent to the synthesis system

at four different intensities. The sweep ranges from the lowest

tone of MIDI=55 (G3,197.33 Hz) to MIDI=67 (G4,394.67 Hz).

The responses of all active partials to the frequency sweeps are

recorded as separate signals for an analysis.

DAFX-4

102103104

0.00

0.05

0.10

f/Hz

ai(f)

0

10

20

30

Partial index

(a) Low intensity (pp)

102103104

0.00

0.05

0.10

f/Hz

ai(f)

0

10

20

30

Partial index

(b) Medium-low intensity (mp)

102103104

0.00

0.05

0.10

f/Hz

ai(f)

0

10

20

30

Partial index

(c) Medium-high intensity (mf)

102103104

0.00

0.05

0.10

f/Hz

ai(f)

0

10

20

30

Partial index

(d) High intensity (ff)

Figure 9: Representations of the frequency response through spectral envelope modulations of 30 partials by a one octave sweep.

Figure 8 shows the amplitude of the ﬁrst partial as a function of

the fundamental frequency. The resulting trajectory shows no dis-

continuities, validating the interpolation process. It further shows

a prominent peak at approximately 257 Hz, caused by the spec-

tral envelope modulations. Joining the amplitude over frequency

trajectories of the ﬁrst 30 partials, the frequency response of the

instrument can be visualized through SEM. Results are shown for

MIDI intensities 20 (pp), 50 (mp), 80 (mf) and 120 (ff) in Fig-

ure 9. With increasing partial index, the overlap with the neigh-

boring partial trajectories increases. The approximated frequency

responses are thus blurred for higher frequencies.

Figure 10: Input admittance at the bass bar side of an Andrea

Guarneri violin [25].

All four representations of the frequency responses in Figure 9

show the same prominent peaks. These peaks correspond to the

formants typical for violins. For a comparison, the input admit-

tance of a Guarneri violin is shown in Figure 10.

Characteristic resonances of violin bodies have been labeled

inconsistently by different researchers. However, referring to Curtin

et al. [11], the prominent resonances for Figure 10 are listed in Ta-

ble 1. Plots 9a – 9d all show the f-hole resonance at 284 Hz and

the main wood resonance, respectively the lowest corpus mode at

415 Hz. At higher intensities, the plots show peaks at 709 Hz,

872 Hz and 1170 Hz, related to the upper wood resonances and

the lateral air motion. The so called violin formant is represented

by a region of increased energy between 2000 Hz and 3000 Hz.

Table 1: Main resonances of a violin body [25, 11].

Label Frequency Description

A0: 275 Hz f-hole resonance

C2 (T1): 460 Hz main wood

C3: 530 Hz second wood

C4: 700 Hz third wood

F: 1000 Hz lateral air motion

2000 - 3000 Hz violin formant, bridge hill

4.2. Intensity Modulations

The response of the synthesis system to changes in intensity is cap-

tured at four different pitches. Intensity sweeps from 0to 127 are

used at MIDI pitches 55 (G3,197.33 Hz), 67 (G4,394.67 Hz), 79

(G5,789.33 Hz) and 93 (A6,1772.00 Hz). The plots in Figure 11

show the spectrum of the harmonic signal in dependency of the

DAFX-5

2 4 6 8

10−5

10−4

10−3

10−2

f/kHz

log(amp)

0

50

100

Intensity

(a) Low pitch (MIDI=55)

5 10 15

10−5

10−4

10−3

10−2

f/kHz

log(amp)

0

50

100

Intensity

(b) Medium-low pitch (MIDI=67)

5 10 15

10−5

10−4

10−3

10−2

f/kHz

log(amp)

0

50

100

Intensity

(c) Medium-high pitch (MIDI=79)

5 10 15

10−4

10−3

10−2

f/kHz

log(amp)

0

50

100

Intensity

(d) High pitch (MIDI=93)

Figure 11: Amplitudes of the ﬁrst 50 partials in dependency of intensity, captured for four different pitches.

intensity, sampled at the partial frequencies. For higher pitches,

the number of partials is reduced, resulting in a lower frequency

resolution. An increase in high frequency content is indicated for

higher intensities at all pitches.

20 40 60 80 100 120

4

6

8

Intensity

HSC

55

67

79

93

Figure 12: Harmonic spectral centroid as function of intensity for

four different MIDI pitches (55, 67, 97, 93).

The harmonic spectral centroid (HSC) is calculated for 50

equally spaced points within all intensity sweeps for analyzing the

inﬂuence of the intensity on the harmonic component of the signal.

Based on the spectral centroid, the HSC regards only the ampli-

tudes aiof the partials, resulting in a pitch independent measure

for the spectral distribution of the partials:

HSC =

i=N

P

i=1

iai

i=N

P

i=1

ai

(10)

Figure 12 shows the HSC as a function of the intensity for four

different pitches. All trajectories show a quasi monotonic increase

in the HSC with increasing intensity. Changes in intensity thus

result in changes in timbre, respectively in brightness.

5. CONCLUSION

The proposed statistical sinusoidal modeling system is capable of

reacting to expressive gestures, using the input parameters pitch

and intensity. Evaluations of frequency and intensity sweeps show

the desired responses in timbral qualities, validating the interpo-

lated inverse transform sampling. The next important step for im-

proving the algorithm is the implementation of a Markovian in-

verse transform sampling, considering past values for the random

sequence generation and thus preserving the frequency character-

istics of the synthesis parameters.

Using the actual inverse cumulative mass functions during run-

time could further improve the performance of the algorithm. At

the current state the inverse transform sampling requires a search

within an unsorted vector, whereas actual inverted functions can

be used by simple indexing. The ﬂexibility and compression rate

of the model could be increased by using parametric distributions

instead of stored distribution functions.

Since the presented approach aims at the synthesis of sustained

signals, an integration of parametric transition models [26] and

trajectory models for attack and release segments is necessary for

completing the synthesis system. Future experiments aim at a per-

ceptual evaluation of synthesized sounds and expressive phrases

from the full system. User studies are planned for assessing the

applicability of the synthesis approach in a real-time scenario.

DAFX-6

6. REFERENCES

[1] Roger B. Dannenberg and Istvan Derenyi, “Combining In-

strument and Performance Models for High-Quality Music

Synthesis,” Journal of New Music Research, pp. 211–238,

1998.

[2] Diemo Schwarz, “A System for Data-Driven Concatenative

Sound Synthesis,” in Proceedings of the COST-G6 Con-

ference on Digital Audio Effects (DAFx-00), Verona, Italy,

2000.

[3] Alfonso Perez, Jordi Bonada, Esteban Maestre, Enric Guaus,

and Merlijn Blaauw, “Combining Performance Actions with

Spectral Models for Violin Sound Transformation,” in Inter-

national Congress on Acoustics, Madrid, Spain, 2007.

[4] Eric Lindemann, “Music Synthesis with Reconstructive

Phrase Modeling,” IEEE Signal Processing Magazine [80]

March 2007, 2007.

[5] Henrik Hahn and Axel Röbel, “Extended Source-Filter

Model of Quasi-Harmonic Instruments for Sound Synthe-

sis, Transformation and Interpolation,” in Proceedings of the

9th Sound and Music Computing Conference (SMC), Copen-

hagen, Denmark, 2012, pp. 434–441.

[6] Henrik Hahn and Axel Röbel, “Extended Source-Filter

Model for Harmonic Instruments for Expressive Control of

Sound Synthesis and Transformation,” in Proceedings of

the 16th International Conference on Digital Audio Effects

(DAFx-13), Maynooth, Ireland, 2013.

[7] David Wessel, Cyril Drame, and Matthew Wright, “Re-

moving the Time Axis from Spectral Model Analysis-Based

Additive Synthesis: Neural Networks versus Memory-Based

Machine Learning,” in Proceedings of the International

Computer Music Conference (ICMC), Ann Arbor, Michigan,

1998, p. 62–65.

[8] Jonathan Driedger, Thomas Prätzlich, and Meinard Müller,

“Let it Bee – Towards NMF-Inspired Audio Mosaicing,” in

Proceedings of the International Conference on Music In-

formation Retrieval (ISMIR), Malaga, Spain, 2015, pp. 350–

356.

[9] Sanna Wager, Liang Chen, Minje Kim, and Christo-

pher Raphael, “Towards Expressive Instrument Synthe-

sis Through Smooth Frame-by-Frame Reconstruction: From

String to Woodwind,” in Proceedings of the 2017 IEEE In-

ternational Conference on Acoustics, Speech and Signal Pro-

cessing (ICASSP), New Orleans, USA, 2017, pp. 391–395.

[10] Ixone Arroabarren, Miroslav Zivanovic, Xavier Rodet, and

Alfonso Carlosena, “Instantaneous Frequency and Ampli-

tude of Vibrato in Singing Voice,” in Proceedings of the

2003 IEEE International Conference on Acoustics, Speech

and Signal Processing (ICASSP), Hong Kong, China, 2003,

p. 537–540.

[11] Joseph Curtin and Thomas D Rossing, “Violin,” in The Sci-

ence of String Instruments, Thomas Rossing, Ed., pp. 209–

244. Springer, 2010.

[12] Axel Roebel, Simon Maller, and Javier Contreras, “Trans-

forming Vibrato Extent in Monophonic Sounds,” in Proceed-

ings of the 14 th International Conference on Digital Audio

Effects (DAFx), Paris, France, 2011.

[13] Stefan Weinzierl, Steffen Lepa, Frank Schultz, Erik Detzner,

Henrik von Coler, and Gottfried Behler, “Sound Power and

Timbre as Cues for the Dynamic Strength of Orchestral In-

struments,” The Journal of the Acoustical Society of Amer-

ica, vol. 144, no. 3, pp. 1347–1355, 2018.

[14] Henrik von Coler, Jonas Margraf, and Paul Schuladen,

“TU-Note Violin Sample Library,” TU Berlin, http:

//dx.doi.org/10.14279/depositonce-6747,

2018, Data set.

[15] Henrik von Coler, “TU-Note Violin Sample Library –

A Database of Violin Sounds with Segmentation Ground

Truth,” in Proceedings of the 21st International Conference

on Digital Audio Effects (DAFx), Aveiro, Portugal, 2018.

[16] Xavier Serra and Julius Smith, “Spectral Modeling Synthe-

sis: A Sound Analysis/Synthesis System Based on a Deter-

ministic Plus Stochastic Decomposition ,” Computer Music

Journal, vol. 14, no. 4, pp. 12–14, 1990.

[17] Alain de Cheveigné and Hideki Kawahara, “YIN, a Fun-

damental Frequency Estimator for Speech and Music,” The

Journal of the Acoustical Society of America, vol. 111, no. 4,

pp. 1917–1930, 2002.

[18] Arturo Camacho, Swipe: A Sawtooth Waveform Inspired

Pitch Estimator for Speech and Music, Ph.D. thesis,

Gainesville, FL, USA, 2007.

[19] Orchisama Das, Julius Smith, and Chris Chafe, “Real-time

Pitch Tracking in Audio Signals with the Extended Com-

plex Kalman Filter,” in Proceedings of the 20th Interna-

tional Conference on Digital Audio Effects (DAFx), Edin-

burgh, UK, 2017.

[20] Julius O. Smith and Xavier Serra, “PARSHL: An Analy-

sis/Synthesis Program for Non-Harmonic Sounds Based on

a Sinusoidal Representation,” in Proceedings of the Inter-

national Computer Music Conference (ICMC), Barcelona,

Spain, 2005.

[21] Scott Levine and Julius Smith, “A Sines+Transients+Noise

Audio Representation for Data Compression and Time/Pitch

Scale Modiﬁcations,” in Proceedings of the 105th Audio En-

gineering Society Convention, San Francisco, CA, 1998.

[22] Henrik von Coler, “A Jack-based Application for Spectro-

Spatial Additive Synthesis,” in Proceedings of the 17th Linux

Audio Conference (LAC-19), Stanford University, USA,

2019.

[23] Luc Devroye, “Sample-based Non-uniform Random Variate

Generation,” in Winter Simulation Conference, Washington,

D.C., USA, 12 1986, pp. 260–265.

[24] Luc Devroye, Non-Uniform Random Variate Generation,

Springer, McGill University, 1986.

[25] J. Alonso Moral and E. Jansson, “Input Admittance, Eigen-

modes and Quality of Violins,” in STL-QPSR, vol. 23, pp.

60–75. KTH Royal Institute of Technology, 1982.

[26] Henrik von Coler, Moritz Götz, and Steffen Lepa, “Paramet-

ric Synthesis of Glissando Note Transitions – A user Study

in a Real-Time Application,” in Proceedings of the 21st

International Conference on Digital Audio Effects (DAFx),

Aveiro, Portugal, 2018.

DAFX-7