Conference PaperPDF Available

Extended Source-Filter Model for Harmonic Instruments for Expressive Control of Sound Synthesis and Transformation

Authors:

Abstract and Figures

In this paper we present a revised and improved version of a re-cently proposed extended source-filter model for sound synthesis, transformation and hybridization of harmonic instruments. This extension focuses mainly on the application for impulsively ex-cited instruments like piano or guitar, but also improves synthesis results for continuously driven instruments including their hybrids. This technique comprises an extensive analysis of an instruments sound database, followed by the estimation of a generalized in-strument model reflecting timbre variations according to selected control parameters. Such an instrument model allows for natu-ral sounding transformations and expressive control of instrument sounds regarding its control parameters.
Content may be subject to copyright.
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-6, 2013
EXTENDED SOURCE-FILTER MODEL FOR HARMONIC INSTRUMENTS FOR
EXPRESSIVE CONTROL OF SOUND SYNTHESIS AND TRANSFORMATION
Henrik Hahn,
IRCAM-CNRS UMR 9912-STMS
Paris, France
henrik.hahn@ircam.fr
Axel Röbel
IRCAM-CNRS UMR 9912-STMS
Paris, France
axel.roebel@ircam.fr
ABSTRACT
In this paper we present a revised and improved version of a re-
cently proposed extended source-filter model for sound synthesis,
transformation and hybridization of harmonic instruments. This
extension focuses mainly on the application for impulsively ex-
cited instruments like piano or guitar, but also improves synthesis
results for continuously driven instruments including their hybrids.
This technique comprises an extensive analysis of an instruments
sound database, followed by the estimation of a generalized in-
strument model reflecting timbre variations according to selected
control parameters. Such an instrument model allows for natu-
ral sounding transformations and expressive control of instrument
sounds regarding its control parameters.
1. INTRODUCTION
Digital music synthesizers based on prerecorded sound samples
of an instrument (Sampler) only represent a discretized version
of the instruments possible timbre space. Creating such a data
base of sound examples always includes making a compromise be-
tween memory usage, recording efforts and granularity of timbre
space. This usually either leads to audible jumps during playback
of notes with similar intensity or to the so called Machine-Gun-
Effect, if one note is synthesized with equal intensity and pitch
several times, always leading to the playback of the same sound
sample and therefore making the sound synthesis becoming static.
There are contextual systems using small musical units [1]
or phrases [2] that also try to establish a less static sample based
sound synthesis. These systems however are not based on single
note recordings of musical instruments, but more complex sound
data sets and therefore cannot be established on existing libraries
like RWC [3], Vienna Symphonic Library or IRCAM Solo Instru-
ments. Current musical samplers therefore rely on sound transfor-
mations to somehow interpolate the discretized timbre space us-
ing the available sound samples. Such transformations are usually
based on the source–filter model of sound production [4] [5] with
an assumed white source and a coloring filter. This model is still
being used for transformations of sounds [6] [7], even though the
assumption of a white source and a single filter does not match
physical phenomena. The glottal source in voice signals as well as
the modes of excitation of musical instruments (strings, air pipes,
etc.) do not exhibit white distributions of harmonics, but highly
pronounced spectral envelopes. For voice processing, glottal source
estimation became first addressed with the Liljencrants-Fant model
This research has been financed by the french ANR project Sample
Orchestrator 2.
[8] and non–white source assumptions are now widely being ac-
cepted for all kind of voice signal transformations [9] [10] [11]
[12]. For instrument sounds, this is much more an emerging topic.
Models exploiting the idea of non–white sources and control of
gestural parameters like pitch and intensity have recently been ad-
dressed for instrumental sound synthesis [13], but also the use of
several separate filters with different parametrizations have been
proposed for instrument recognition [14] and source separation
[15]. An extension by using an additional sinusoid plus noise
model [16] [17] for sound morphing has been presented in [18].
We recently proposed a generalized instrument model [19]
for driven instruments like bowed strings and blown instruments,
which extends the basic source–filter approach by the separate
treatment of sinusoidal and noise signal components with individ-
ually established filter modules. For the sinusoidal component, a
white source with 2 filter functions with gestural control parame-
ters has been proposed. The noise component has been introduced
in equivalence to its sinusoidal counterpart, but with a single filter
function with equal control parameters. In our approach, model
parameters are learned from an instruments sound database using
a gradient decent method. In this paper we will present a revised
mathematical formalism and a generalized constraint framework
for an improved modeling of driven instruments and enhancements
dedicated to impulsively excited instruments to ameliorate synthe-
sis results in general. We further introduce a method for gradient
normalization to significantly decrease computational efforts for
model adaptation and give a description how to use the proposed
model for sound synthesis with expressive control parameters.
In section 2 we will give a summary of our proposed instru-
ment model with a detailed discussion of issues arising from the
different modes of excitations of musical instruments, followed by
the introduction of our new constraint framework in section 3 and
the description of the gradient normalization in 4. Section 5 de-
picts visual results of our instrument modeling, whereas section
6 describes our synthesis model and links to a number of sound
examples available online. The conclusion is given within section
7.
2. GENERALIZED INSTRUMENT MODEL
2.1. System Overview
The initial version of our source-filter instrument model was de-
scribed in [20] and extended in [19] for sound transformation and
interpolation. The model is learned from an instruments data base
of recorded sounds and represents timbre variations of an instru-
ment according to certain selected control parameters. In [19] we
proposed the use of the global note intensity, temporal intensity
DAFX-1
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-5, 2013
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-6, 2013
envelope and pitch to be these parameters, as we assumed them
to be the most influential ones for sound variations of a musi-
cal instrument. We established an instrument model, reflecting
these variations separately for a signals harmonic and noise con-
tent. With a model reflecting instrument timbre variations due to
certain gestural control parameters, filter envelopes can be gen-
erated to transform any selected sound sample according to ma-
nipulations of these control parameters. Therefore, transforma-
tions include changes of a notes intensity, pitch transpositions and
changes to the temporal intensity envelope. And since the instru-
ment model is learned from recorded sounds, these transforma-
tions are expected to generate synthesis results, which are indistin-
guishable from their natural recordings.
Hence, this approach comprises a sound analysis, a model
adaptation and a source signal creation stage to generate the de-
sired components for our purpose of an expressive sound synthesis
with control of gestural parameters.
2.2. Sound datasets
To establish our synthesis model we use several instruments from
the IRCAM Solo Instruments library. Each data set consists of
single note recordings with constant pitch and intensity. They are
always covering the whole instruments pitch range in chromatic
steps whereas 3 intensity levels (pp,mf,ff ) have been recorded for
the driven instruments and up to 8 different intensity values for the
piano. All sounds have been recorded in 24Bit and 44.1kHz.
2.3. Sound Analysis
The target of our sound analysis is to transform the given input
signals into a domain, which can be used for modeling as well
as for sound synthesis and since our aim is to treat the harmonic
and noise components of instrument signals separately, the first
step in sound analysis is their segregation. We therefore employ
a sinusoidal analysis by means of partial tracking [16]. Following
eq. (1) an input signal x(α)(t)can be expressed as a sum of K
sinusoids with time-varying amplitude a(α)
k(t)and phase φ(α)
k(t)
and some residual noise ν(α)(t), with (α)denoting a certain input
signal from the data set.
x(α)(t) =
K
X
k
a(α)
k(t)cos(φ(α)
k(t)) + ν(α)(t)(1)
Since all samples exhibit a constant pitch, the time-varying am-
plitude and frequency of the harmonic partials kcan be estimated
for each time frame nof a signal, using a filterbank with center
frequencies according to fk=k·f0. However, for the piano
as well as for all impulsively excited string based instruments, we
need to take their inharmonicity into account, since the frequen-
cies of the harmonics are not exact multiplies of its fundamental
but at increased positions. We are using a novel algorithm [21]
to reliably estimate the positions of the harmonics, which we use
for the partial tracking of the piano. These estimated sinusoids
with time-varying amplitude and frequency therefore, represents
our harmonic signal component. Considering the use of single
note recordings, we assume the sounds to be stationary in pitch
and harmonic contour and we therefore approximate the frequen-
cies of the harmonics by their assumed optimal position according
to the rule in eq. (2) with β= 0 for driven and an estimated β > 0
for piano instruments.
fk=kf0p1 + k2β(2)
As our instrument model shall reflect timbre variations of a musi-
cal instrument according to certain gestural control parameters, we
define these parameters to be the global note intensity I, the pitch
Pand the time-varying temporal intensity envelope E(n). The
global intensity reflects the notes overall loudness and is denoted
in musical terms by pp (pianissimo) for a very quiet and by ff (for-
tissimo) for a very loud playing style, which usually comes with
highly different spectral shapes. The pitch encodes the position
of the fundamental frequency and therefore of all higher harmon-
ics making it a decisive parameter as well. All global intensity as
well as pitch values will be organized for our model in equidistant
steps. This means, the pitch values will be scaled on a logarithmic
frequency axis using its MIDI values. Both, the intensity as well as
the pitch values are being taken from meta information belonging
to each sound data set.
These parameters however do not account for temporal timbre
changes due to the attack or release of a signal, but as we target for
an instrument model reflecting these temporal changes as well, we
assume them to be associated with the temporal intensity envelope.
Therefore, we will represent timbre changes due to the attack, sus-
tain and release phase of a signal by means of level changes of the
temporal intensity envelope.
With this assumption, timbre variations during the attack or
release will be reflected by lower values of the intensity envelope,
whereas an approximately steady timbre is assumed during the sig-
nals sustain phase showing a more or less constant temporal signal
intensity over time with maximum value. The temporal intensity
envelope for the sinusoidal signal is processed by the summation
of the energy of all detected harmonic sinusoids and represented on
a decibel scale normalized to 0dB. As the attack and release phase
share values below the maximum, we employ a threshold based
temporal segmentation of the intensity envelope to differ between
the attack, sustain and release signal components as shown in fig.
1. Since impulsively excited signals do not contain a steady state,
the end of the attack nAand the beginning of the release nRbe-
come equal. The segmentation is being done by creating two sets
s={n1, n2}of frames associated with the attack to sustain and
sustain to release region respectively. These regions are overlap-
ping for driven instrument signals, but distinct for impulsive ones.
nA
nR
Thr esh old
Att ack Su sta in R ele ase
E(n) / dB
nA=nR
Thr esh old
Att ack Re lea se
E(n) / dB
n1
n2
E(n) / dB
Tim e / [n ]
n1
n2
E(n) / dB
Tim e / [n ]
Figure 1: Temporal segmentation scheme for sustained (left) and
impulsive (right) instrument signals using a threshold method. Sig-
nal regions Attack, Sustain and Release are indicated (top) as well
as temporal segmentation (bottom) using overlapping segments for
driven instruments and distinct segments for impulsively excited
signals.
DAFX-2
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-5, 2013
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-6, 2013
Shown in def. (3), the harmonic signal component X(α)
his
being represented by the partial amplitudes A(α)
kin dB scale, their
approximated partial frequencies f(α)
k, the triplet of gestural con-
trol parameters denoted Θ(α)(n)and both temporal segments {n1, n2}.
Note that the gestural parameters are a function of time since the
temporal intensity envelope E(α)
h(n)is time-varying, whereas f(α)
k
is not, since we use the ideal frequency values of the partials for
modeling.
X(α)
h:
A(α)
k(n) ; k= 1 . . . K, n = 1 . . . N
Θ(α)(n) = {I, Eh(n), P }(α)
f(α)
k;k= 1 . . . K
ns={ns}s∈{1,2}
(3)
The residual noise signal x(α)
n(t)is generated by the subtraction
of the synthesized harmonic signal x(α)
h(t)from the original in-
put signal x(α)(t)using the estimated partials. The noise signal
is then analyzed by means of time-varying cepstral envelopes with
a small number Lof cepstral coefficients lfor a smooth model-
ing of noise envelopes. These time-varying cepstral coefficients
are written as C(α)
l(n)and again we establish a triplet of gestural
control parameters Θ(α)(n)with its own temporal intensity enve-
lope E(α)
r(n)and according temporal segmentation {n1, n2}to
obtain the noise signal representation shown in def. (4) similar to
its harmonic counterpart. It technically only differs in the missing
partial frequencies.
X(α)
r:
C(α)
l(n)
Θ(α)(n) = {I, Er(n), P }(α)
ns={ns}s∈{1,2}
(4)
These representations of the harmonic and a noise signal compo-
nent of an input signal x(α)(t)are being used to create the desired
models and since they share a similar design, both models can be
established in a quite analogous manner.
2.4. Harmonic Model
Since we represent the harmonic signal component by means of
an additive model, we are able to formulate a harmonic instru-
ment model with respect to the partial index as well as the partials
frequency to represent timbre variations according to the gestural
control parameters. This allows to incorporate the idea, that the
harmonics of an instrument sound exhibit certain features, which
are related to the signals fundamental frequency and other features,
who are not. In other words, there exist features which are be-
ing best described by the partial index and features being best de-
scribed by the partials frequency. The former may refer to known
effects like the missing even partials in clarinet sounds and there-
fore mainly represents the characteristics of the modes of excita-
tion of an instrument signal, whereas the latter features character-
ize an instruments resonance behavior at certain fixed frequencies
and may accordingly been interpreted as the instruments corpus.
Both interpretations are just roughly consistent with real physical
phenomena, but as we try to establish a generalized instrument
model, certain specific features and characteristics need to be ne-
glected for universality.
Following the distinction above, we introduce 2 separate filter
functions. The first filter is being established to reflect instrumen-
tal sound features by partial index and is created by separate am-
plitude functions for each possible partial kof that instrument and
both temporal segments s. We further assume the amplitude of a
partial kbeing mainly characterized by our gestural control param-
eters: global note intensity I, temporal intensity envelope E(n)
and the signals pitch P. This allows to model the characteristics
of each partial independently from its frequency and neighbors and
also independently for its attack and release phase. We denote it as
the source filter Sk,s(Θ). The second filter shall reflect the instru-
ments features by frequency and will consequently be a function
of the partials frequencies. We assume it to be independent of the
gestural control parameters intensity, temporal intensity envelope
and pitch, hence denoting it the resonance filter F(fk). This leads
to eq. (5), where the log-amplitude ˆ
Aof a partial kwithin a certain
temporal signal segment sdepends on the parameters {Θ, fk}and
is processed by the summation of the two filters.
ˆ
Ak,s, fk) = Sk ,s(Θ) + F(fk)(5)
In [19] we have introduced the use of B-splines [22] to model
both filters as continuous trajectories according to their parame-
ters. While for the second filter a univariate B-spline Bvwith its
weights δvis used, we proposed the use of a tensor-product B-
spline Bu[23] to model the partials source function with respect
to the gestural control parameters using the weights γu
k,s. We also
added an additional offset parameter Φhset to a desired amplitude
minimum to obtain eq. (6). In fig. 2 three exemplarily chosen
B-spline models are shown which are being used to generate the
tensor-product B-splines Buused to model the partial functions.
ˆ
Ak,s, fk) =
U
X
u
Bu(Θ)γu
k,s +
V
X
v
Bv(fk)δv+ Φh(6)
pp mf ff
0
0.5
1
I
−80 −60 −40 −20 0
0
0.5
1
E(n) / [dB ]
min max
0
0.5
1
P / MID I
Figure 2: Three B-Splines models used to create the tensor-
product B-spline model for Bu(Θ)
Cutouts of the tensor-product B-splines are shown in fig. 3 to
illustrate, how a hyperplane can be established in higher dimen-
sional space using a tensor-product of several univariate B-splines.
For the resonance filter we employ an univariate B-spline model
as shown in fig. 4 to continuously interconnect the positions of the
frequencies of all partials present in an instruments data set. This
allows for a smooth modeling of the filter function, while only ob-
serving a sampled version of its characteristic.
2.5. Noise Model
The noise model has been proposed in a similar manner as its har-
monic counterpart, but with only one filter. The filter is established
by separate functions for each cepstral coefficient land temporal
DAFX-3
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-5, 2013
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-6, 2013
Figure 3: Cutouts along two dimensions of the 3-dimensional
tensor-product B-splineBu(Θ)
2 4 6 8 10 12 14 16 18 20 22
0
0.5
1
Frequ ency / kHz
Figure 4: A B-spline model used for Bv(fk)
segment sand each with respect to the control parameters Θlead-
ing to the model in eq. 7. Again, we utilize a tensor-product B-
spline to represent the behavior of each cepstral coefficient accord-
ing to the gestural control parameters Θtogether with an additional
offset parameter Φrset some desired noise envelope minimum.
Note that this offset parameter has to be in the cepstral domain.
ˆ
Cl,s(Θ) = Sl,s(Θ)
=
W
X
w
Bw(Θ)w
k,s + Φr(7)
2.6. Parameter Estimation
To adapt the harmonic as well as the noise model to a selected data
set of recordings of an instrument, the parameters to estimate are
the B-spline weight parameters γu
k,s,δvand w
k,s. Even though,
both models are linear, a closed-form solution is impractical to
solve, as a single model can have up to 100k parameters and even
small data sets exhibit several millions of data points, since each
single partial in every frame represents one data point by its own,
this leads to unmanageable matrix sizes even with single precision.
Therefore, we utilize the scaled conjugate gradient method [24]
(SCG) and define an error function in a least-means-square sense
(8) and as all conjugate gradient techniques are offline methods,
the error R(α)
has well as its gradient have to be accumulated for
all data samples (α). The gradient is easy to determine and also
the functions for the noise model are easy to derivate from the
harmonic case.
R(α)
h=1
2X
sX
ns
K
X
k=1 A(α)
k(n)ˆ
Ak,s(α)(n), f (α)
k)2
(8)
2.7. Discussion
The most obvious issue of the harmonic model, is the summation
of two filters which exhibits an unlimited amount of optimal so-
lutions as any constant can always be added to one filter and sub-
tracted from the other. But there are also ambiguous solutions due
to the pitch dependency of the source filter function which can
also be seen as a log-frequency dependency and therefore might
lead to ambiguous results with the frequency dependent resonance
filter. More ambiguous solutions may become possible, if more or
different gestural control parameters are introduced.
Another major issue which needs to be addressed is the un-
equal distribution of the data with respect to the control param-
eters, meaning the existence of large areas in the model without
any or just few data. This has to be assumed for all instruments,
as it can be caused by partials only being present in fortissimo,
but never within pianissimo signals, as well as by partials only
present in lower pitches, but due to sample rate limits, never in
higher pitches. Another reason for such data gaps can be caused
by impulsively excited instruments, where, due to the shorter de-
cay times of higher partials, no partial amplitudes can be found
for regions of lower temporal intensity. Such data gaps can cause
almost random behaviors within these regions during the adapta-
tion process. This leads to highly malformed shapes within the
model and therefore may result in an erroneous synthesis while
transforming partial amplitudes.
Consequently, plausible shapes are needed to enable reason-
able transformations.
3. CONSTRAINTS
To solve the issues which arise from the ambiguities and data gaps,
we propose the use of constraints as additions to the model er-
ror function 8. These constraints are meant to measure deviations
from desired characteristics, such that by means of minimization
of the resulting objective function, not only model error minimiza-
tion is pursued, but at the same time, the deviations from a desired
behavior is minimized as well.
All these constraints can be formulated independently of the
B-spline representation and an advantage of using B-splines is, that
derivatives of arbitrary order can be taken easily. All constraints
have to be weighted by means of a weighting factor, balancing the
importance of the constraint. The choice of the weighting factor
is a problem, as it needs to balance the impact of the constraint
compared to the model error and the other constraints.
3.1. Constraint I
The first type of constraint samples a filter function with an arbi-
trary but high enough rate and squares the sum of all filter values,
such that it measures the offset of the function. We apply this
constraint only to the resonance filter as shown in eq. 9 with fbe-
ing some frequency sampling vector. This constraint increases the
value of the objective function for any offset of the resonance fil-
ter from zero and therefore its minimization solves the ambiguity
problem of the summation of the two filters by fixing the second
filter around 0dB.
CI=1
2λj
X
f
F(f)
2
(9)
DAFX-4
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-5, 2013
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-6, 2013
The constraints weighting parameter λjhas to be adjusted accord-
ing to the amount of data f(α)
kand sampling points f, by means of
obtaining a tradeoff for the constraint to be effective and still min-
imizing the error function properly. In eq. 10 we propose to use
the ratio of the l1-norm of all squared B-spline values of all data
points f(
k(α)) and the l1-norm of the sum of the squared B-spline
values of the sampling series f.
λj=λj
0
P(α)PsPnsPK
kBv(f(α)
k)2
1
Pf(Bv(f))2
1
(10)
This ratio is then adjusted with an initial weighting parameter λj
0
which is independent of the actual amount of data and sampling
points. It scales between 0, reflecting no constraint, and 1, making
the constraint as strong as all data points together. We are using
λ= 0.001 which works well for all our instrument sets.
3.2. Constraint II
The second class of constraints is designed to solve the data gap
issues mentioned in 2.7, by means of extrapolation from regions
containing meaningful data into sparsely filled or even empty ar-
eas. It further solves the ambiguity problem of the frequency de-
pendency of both filter functions . We establish this constraint for
the source filter function only.
Again, the constraint samples the filter function at an arbitrary
grid Θ, but sums the squares of the z-th order partial derivatives of
the filter function with respect to one of its dimensions as denoted
in eq. (11). With z= 1 or z= 2 the constraints value therefore
increases with either increasing slope or curvature respectively,
but both times along some selected dimension i {P, I , E(n)}.
Hence, minimizing the slope of the surface along one of its di-
mensions means flattening its shape and extrapolating constantly,
whereas minimizing its curvature can be used to linearly extrapo-
late along one dimension. Moreover, specifically targeted for im-
pulsively excited signals, we introduce a function ηi,zi)to lo-
cally emphasize a constraints weighting parameter λi,z
k,s, which al-
lows to not only adjust a weight constantly for a whole constraint,
but according to a certain control parameter i. This allows for
adjustments of the impact of a certain constraint, depending on
the value of a gestural control parameter, since these instruments
show rapid spectral and intensity changes. In other words, we like
to have stronger constraints for lower values of the local intensity
envelope than for higher values to smoothly fade the partial ampli-
tudes of higher index. In our case, ηi,z i)is linear along iand
constant along all others and only used for impulsively excited in-
strument signals.
We established this constraint generically for all orders of par-
tial derivatives and dimensions of the filter function.
Ci,z
II =1
2X
s
K
X
k=1
λi,z
k,s X
Θ
ηi,zi)z
Θz
i
Sk,s(Θ)2
(11)
Again, the weighting parameters of the constraints need to be pro-
cessed in a manner similar to the first constraint. But as we ap-
ply this constraint to the source filter function only, the according
weight has to be summed separately for each temporal segment s
and partial index kand also independently for all selected values
of dimension iand derivative order z, establishing a weight λi,z
k,s
with respect to these parameters.
Eq. (12) shows how an initial weighting parameter λi,z
0is be-
ing scaled with the ratio of the l1-norm of the data dependent sum
and and the l1-norm of the sum over the sampled filter function,
hence making the initial constraints weight λi,z
0independent of the
data and dependent only on the selected dimension and derivative
order.
λi,z
k,s =λi,z
0
P(α)PnsBu(α)(ns))2
1
PΘz
Θz
i
Bu(Θ)2
1
(12)
The ambiguity due to the log-frequency dependency of the source
filter and the frequency dependency of the resonance filter can be
dissolved by the assumption that a source varies slower with the
log-frequency than a resonance filter varies with frequency. This
follows the idea of an excitation signal being similar for different
pitches, but a resonance body may exhibit nearby resonances and
anti-resonances. Such a constraint can be created by using i=P
and z= 1, forcing the surface of the source to be rather constant
along the pitch dimension.
Extrapolation as described above is being carried out for the
global Ias well as the local intensity E, since the pitch is already
constrained using z= 1 for constant behavior. For the global in-
tensity we choose z= 2 to obtain some fade-out, rather than a
constant behavior while extrapolating ff to pp. The local intensity
envelope is always being constrained using a trade-off between
z= 1 and z= 2 for the effect of some smooth fade-out of the
partial amplitude values. Additionally, for impulsive signals, these
constraints are given boosted weights for lower local intensity val-
ues using ηi,zi)for improved modeling of its rapid changes and
huge gaps for lower values of E.
The initialization of the lambda parameters is crucial for all
training results and since we do not yet have an automatic selection
of suitable parameter configurations, several manual step by step
adjustments are required to achieve good training results which
can be used for sound synthesis.
Finally, the resulting objective function to optimize the har-
monic model according to the data samples (α)and constraints
CI,Ci,z
II is denoted as eq. 13.
Oh=X
(α)
R(α)
h+X
i,z
Ci,z
II +CI(13)
4. GRADIENT NORMALIZATION
As we already stated, the harmonic as well as the noise model are
both linear models and therefore their objective functions exhibit a
global minimum and both possess the shape of an ellipsoid. SCG
is a conjugate gradient optimization method designed for complex
and high dimensional non-linear problems. In the present case the
problem is linear, but very high-dimensional, such that an analytic
solution is not feasible. The convergence of conjugate gradient
methods is related to the condition number of the Hessian that
characterizes the problem. The condition number of the present
problem is unfortunately rather high, causing SCG to take several
thousand iterations until convergence for a linear problem with
up to 100k parameters. The high condition number is related to
the uneven distribution of the data points, according to our model
space of control parameters and partial frequencies, which had al-
ready been discussed in section 2.7.
DAFX-5
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-5, 2013
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-6, 2013
A well known solution for such cases is the preconditioning
of the problem, such that the condition number is reduced [25].
For the present problem, we propose to follow a rather simple ap-
proach of preconditioning, that consists of scaling all equations,
such that the diagonal elements of the Hessian matrix are all iden-
tical and equal to 1. Our practical experience with this precon-
ditioning has shown a significant increase in convergence speed,
reducing the required number of iterations from a few 1000 to 25
or 50 for the noise and harmonic parameter optimization respec-
tively. To normalize all second derivatives on the diagonal of the
Hessian, we introduce a substitution of all B-spline parameters of
the harmonic model as shown in eq. (14) and (15) into a new B-
spline parameter ¯γu
k,s and an additional data dependent weighting
parameter cu
k,s. The equations for the noise case can easily be de-
rived from the harmonic case.
γu
k,s = ¯γu
k,s ·cu
k,s (14)
δv=¯
δv·cv(15)
Now, the new B-spline parameters ¯γu
k,s need to be estimated from
the input data, with respect to all constraints and therefore, all gra-
dients of the objective function need to be reformulated using the
new weighting parameters, which is left to the reader. The intro-
duced parameters cu
k,s and cvneed to be calculated from eq. 16
and eq. 17 by solving all equations for the new cparameters.
2Oh
¯γu
k,s
2=X
(α)
2R(α)
h
¯γu
k,s
2+X
i,z
2Ci,z
II
¯γu
k,s
2= 1 (16)
2Oh
¯
δv2=X
(α)
2R(α)
h
¯
δv2+2CI
¯
δv2= 1 (17)
Solving the equations 16 and 16 for cu
k,s and cvgives the requested
normalization of the diameters of the error ellipsoid.
5. MODELING RESULTS
To demonstrate the universality of our approach for sound transfor-
mations utilizing a generalized instrument model, we selected sev-
eral different musical instruments, to cover various specific models
of excitation and gestural control. These instruments are trum-
pet, clarinet, violin and a grand piano. In the figures, we present
cutouts of the estimated hyperplanes of the source filter Sk,s(Θ)
for a certain partial kand temporal segment swith respect to two
out of the three control parameters and estimated univariate reso-
nance filters F(f).
Additionally, all figures depict data points from the input sig-
nals in accordance to their mapping onto the filters, to show the
fitting of the B-spline models to their respective data. This means,
the displayed amplitudes of the partial data are processed by sub-
tracting the not-displayed filter from the input data like shown in
eq. (18) for the source and in eq. (19) for the resonance filter
figures.
˜
A(α)
k,s (ns) = A(α)
k(ns)F(fk)(18)
˜
A(α)
k,s (ns) = A(α)
k(ns)Sk,s(α))(19)
Note, that for the source filter only partial amplitudes of a fixed
partial index kand temporal segment sare shown, since the source
hyperplane is specific for each kand sbut with respect to the con-
trol parameters. On the contrary, within the figures of the reso-
nance filters, the amplitudes of all partials kfrom both temporal
segments are displayed according to their frequency. It can be
Figure 5: Estimated filter F(f)of the trumpet (solid) and accord-
ing data points ˜
A(α)
k,s (ns)(grey)
Figure 6: 3-dimensional cutouts of the four-dimensional source
filter function for the 10-th partial and 2nd temporal segment of
the trumpet. The plane represents the source model Sk,s(Θ), with
the left showing the plane for mf and the right denoting the plane
for medium pitches. Data points fade from black to white indi-
cating their decreasing influence to the shown plane regarding the
B-splines of the left-out dimension.
Figure 7: Estimated filter F(f)of the clarinet (solid) and accord-
ing data points ˜
A(α)
k,s (ns)(grey)
observed from the figures, that all models exhibit a pretty strong
variance of the data regarding the filter functions. This is related to
timbre variations of each instrument, which are not covered by our
selected control parameters. As these variances are not modeled,
they will therefore retain in the signals for all transformations.
One major limitation of our model is due to the still missing
automatic adjustment of the configuration of the B-spline models
as well as the initialization of the constraint weighting parameters.
We therefore need to optimize these parameters manually by judg-
ing the training results visually.
DAFX-6
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-5, 2013
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-6, 2013
Figure 8: Estimated filter F(f)of the violin (solid) and according
data points ˜
A(α)
k,s (ns)(grey)
Figure 9: Estimated filter F(f)of the piano (solid) and according
data points ˜
A(α)
k,s (ns)(grey)
6. SYNTHESIS
6.1. Filter Envelope Processing
Since we use spectral domain filtering for our sound synthesis,
the estimated partial amplitudes ˆ
Ak,s as well as the cepstral coef-
ficients ˆ
Cl,s need to be transformed into spectral envelopes with
equidistant points ranging from 0Hz to half the sampling rate. In
both cases, we employ the cepstral smoothing method, as the es-
timated noise coefficients already represent a smoothed envelope
and we therefore just need to process a single DFT of the cepstral
coefficients to obtain the desired filter envelope. For the harmonic
envelope, we need to smoothly interpolate between the partial am-
plitudes, but as they are already in log domain, a small IDFT with
a cepstral domain filtering, followed by another small DFT is suf-
ficient to generate a smoothed envelope passing through the partial
amplitudes.
6.2. Source Signal Creation
Before we are able to utilize our models for sound transformations,
we filter all segregated harmonic xh(t)and noise input sounds
xn(t)of the instruments database with the inverse of the estimated
envelopes, using their associated gestural control parameters Θ.
This removes the timbre, estimated by the models, from each sig-
nal and creates a harmonic signal ¯xh(t)with almost white dis-
tributed partial amplitudes and an almost white noise signal ¯xn(t).
This procedure will remove all sound features from the input sig-
nals, which are covered by the instrument models, but leaves ev-
erything back in the created signals, which is not modeled.
These source signals can then be transformed with filter en-
velopes, generated using the model with manipulated gestural con-
trol parameters, to achieve signals with altered gestures. They will
exhibit the estimated target timbre, but with preserved variations,
which are not covered by the models. This makes the synthesis
sound natural and not static.
This concludes our proposed model, as we we have now gen-
erated all the needed components for our proposed sound synthe-
Figure 10: 3-dimensional cutouts of the four-dimensional source
filter function for the 5-th partial and 2nd temporal segment of
the piano. The plane represents the source model Sk,s(Θ), with
the left showing the plane for ff and the right denoting the plane
for medium lowest pitches. Data points fade from black to white
indicating their decreasing influence to the shown plane regarding
the B-splines of the left-out dimension.
sis. These components are namely the harmonic and noise source
signals as well as the two harmonic filter functions and finally the
noise filter function.
6.3. Sound Transformation
The transformation of sounds using our proposed extended source-
filter model has first to accomplish the resynthesis of the original
sounds without the introduction of artifacts. We denote it the neu-
tral synthesis and it can be achieved by applying filter envelopes to
source signals, generated by using the associated gestural control
parameters without any manipulation. This has to deliver sounds
which are indistinguishable from the originals.
Sound synthesis exhibiting transformations requires manipu-
lation of the gestural control parameters. For example, pitch or
global note intensity changes. Transformations due to manipu-
lations of the note intensity effects in altered spectral envelopes
according to the envelopes learned from the according sound sam-
ples. Changes of the pitch, however, requires an additional trans-
position step of the harmonic source signal, as the envelopes only
serve for the filtering of the spectral shape and hence, the desired
pitch shift has to be done by some additional algorithm.
Additionally to sound transformations using the instrument
model belonging to the source signals, all components can be ex-
changed with components from other instruments to create new
hybrid instruments with a huge variety of combinations of compo-
nents.
Though, the proposed instrument model still has some limita-
tions. First, as we do not add artificial partials to the source signals
while transforming, sound synthesis is rather limited for transfor-
mations of sounds from lower to higher global note intensities and
pitch transpositions from higher to lower notes, as such transfor-
mations would require the addition of partials. Another limitation
results from the still missing automatic adjustment of the tempo-
ral intensity envelope, which leads to less natural synthesis results,
especially while pitch shifting impulsive signals for more than an
octave or changing their note intensity significantly.
DAFX-7
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-5, 2013
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-6, 2013
6.4. Synthesis Results
Sound examples demonstrating the capabilities of our proposed in-
strument model are being made available through our demo web
page http://anasynth.ircam.fr/home/media/sor2-
instrument-model- demo.
7. CONCLUSION
We have presented a deeply revised version of our recently pro-
posed instrument model for sound transformation and hybridiza-
tion using a sample-based synthesis. This included an improved
mathematical formalism and the introduction of generalized classes
of constraints for enhanced modeling of an instruments timbre
variations, due to selected gestural control parameters. We further
revised the adaptation algorithm with a preconditioning technique
to significantly reduce the amount of iterations, needed for the gra-
dient decent method. Finally, a synthesis scheme has been de-
scribed to demonstrate the application of our proposed instrument
model for sample-based sound synthesis with independent control
of the gestural parameters, including natural sounding transforma-
tions as well as the creation of new hybrid instruments.
8. ACKNOWLEDGMENTS
Many thanks to the anonymous reviewers for their precious re-
marks and suggestions.
9. REFERENCES
[1] Diemo Schwarz, “A system for data-driven concatenative
sound synthesis,” in Digital Audio Effects (DAFx), Verona,
Italy, December 2000, pp. 97–102.
[2] E. Lindemann, “Music Synthesis with Reconstructive Phrase
Modeling,” Signal Processing Magazine, IEEE, vol. 24, no.
2, pp. 80 91, 2007.
[3] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka, “RWC
Music Database: Music Genre Database and Musical Instru-
ment Sound Database,” in 4th International Society for Mu-
sic Information Retrieval Conference, October 2003, pp. 229
230.
[4] Homer Dudley, “Remaking Speech,” J. Acoust. Soc. Am.,
vol. 11, no. 2, pp. 169 177, 1939.
[5] Wayne Slawson, Sound Color, University of California
Press, Berkerley, 1985.
[6] A. Röbel and X. Rodet, “Efficient spectral envelope estima-
tion and its application to pitch shifting and envelope preser-
vation, in 8th International Conference on Digital Audio
Effects, Madrid, Spain, Septembre 2005, pp. 30 35.
[7] D. Arfib, F. Keiler, U. Zölzer, and V. Verfaille, Digital Audio
Effects (eds. U. Zölzer), chapter 8 - Source-Filter Processing,
pp. 279 320, John Wiley & Sons, 2 edition, 2011.
[8] G. Fant, J. Liljencrants, and Q. Lin, “A four-parameter model
of glottal flow, STL-QPSR, vol. 4, no. 1985, pp. 1–13, 1985.
[9] D. G. Childers, “Glottal source modeling for voice conver-
sion,” Speech Communication, vol. 16, no. 2, pp. 127–138,
1995.
[10] Arantza del Pozo, Voice Source and Duration Modelling for
Voice Conversion and Speech Repair, Ph.D. thesis, Cam-
bridge University, Engineering Department, April 2008.
[11] Javier Perez Mayos, Voice Source Characterization for
Prosodic and Spectral Manipulation, Ph.D. thesis, Universi-
tat Politecnica de Catalunya, July 2012.
[12] Gilles Degottex, Pierre Lanchantin, Axel Roebel, and Xavier
Rodet, “Mixed source model and its adapted vocal tract fil-
ter estimate for voice transformation and synthesis, Speech
Communication, vol. 55, no. 2, pp. 278 294, 2012.
[13] Sean O’Leary, Physically Informed Spectral Modelling of
Musical Instrument Tones, Ph.D. thesis, The University of
Limerick, 2009.
[14] A. Klapuri, “Analysis of musical instrument sounds by
source-filter-decay model, in 2007 IEEE International Con-
ference on Acoustics, Speech, and Signal Processing, April
2007, vol. 1, pp. I–53 I–56.
[15] T. Heittola, A. Klapuri, and T. Virtanen, “Musical instrument
recognition in polyphonic audio using source-filter model for
sound separation,” in 10th International Society for Music
Information Retrieval Conference, October 2009, pp. 327
332.
[16] X. Serra, A System for Sound Analy-
sis/Transformation/Synthesis based on a Deterministic
plus Stochastic Decomposition, Ph.D. thesis, Stanford
University, 1989.
[17] X. Amatriain, J. Bonada, A. Loscos, and X. Serra, Digital
Audio Effects (eds. U. Zölzer), chapter 10 - Spectral Process-
ing, pp. 393 446, John Wiley & Sons, 2 edition, 2011.
[18] Marcelo Caetano, Morphing Isolated Quasi-Harmonic
Acoustic Musical Instrument Sounds Guided by Perceptually
Motivated Features, Ph.D. thesis, Universite Pierre et Marie
Curie, UPMC, Universite Paris VI, 2011.
[19] H. Hahn and A. Röbel, “Extended source-filter model of
quasi-harmonic instruments for sound synthesis, transforma-
tion and interpolation,” in Sound and Music Computing Con-
ference (SMC) 2012, Copenhagen, Denmark, July 2012.
[20] H. Hahn, A. Röbel, J. J. Burred, and S. Weinzierl, “Source-
filter model for quasi-harmonic instruments,” in 13th In-
ternational Conference on Digital Audio Effects, September
2010.
[21] H. Hahn and A. Röbel, “Joint f0 and inharmonicity estima-
tion using second order optimization,” in Sound and Music
Computing Conference (SMC) 2013, August 2013.
[22] C. de Boor, A Practical Guide to Splines, Springer, 1978.
[23] K. Höllig, Finite Element Methods with B-Splines, Society
for Industrial and Applied Mathematics, SIAM, 2003.
[24] M. F. Møller, “A scaled conjugate gradient algorithm for fast
supervised learning,” NEURAL NETWORKS, vol. 6, no. 4,
pp. 525 533, 1993.
[25] Jonathan R. Shewchuk, “An introduction to the con-
jugate gradient method without the agonizing pain,”
Tech. Rep., School of Computer Science, Carnegie
Mellon University, Pittsburgh, PA, 1994, Avail-
able at http://www.cs.cmu.edu/~quake-
papers/painless-conjugate- gradient.pdf.
DAFX-8
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-5, 2013
... Signal-based methods can model the instrument by analysis and synthesis of audio examples with hand-crafted models. Underlying models include additive synthesis [29] and source-filter models [30]. These lightweight approaches are controllable and flexible as they can be directly applied to other instruments, but they often lack realism in the synthesis because of insufficient representation of physical details of the instrument or too simplistic controls. ...
... Ongoing research thus deals with approaches which allow a more direct control or parameter management. An extended source-filter model, presented by Hahn et al. [3], models a database of instrument sounds with different pitches and intensities. The deterministic part is based on a non-white source and a resonator filter. ...
Conference Paper
Full-text available
Continuous State Markovian Spectral Modeling is a novel approach for parametric synthesis of spectral modeling parameters, based on the sines plus noise paradigm. The method aims specifically at capturing shimmer and jitter-micro-fluctuations in the partials' frequency and amplitude trajectories, which are essential for the timbre of musical instruments. It allows for parametric control over the timbral qualities, while removing the need for the more computationally expensive and restrictive process of the discrete state space modeling method. A qualitative comparison between an original violin sound and a re-synthesis shows the ability of the algorithm to reproduce the micro-fluctuations, considering their stochastic and spectral properties.
... An extended source-filter model has been proposed by Hahn et al. [5,6]. Partial and noise parameters are modeled in dependency of the control parameters pitch and intensity, by means of tensorproduct B-splines. ...
Conference Paper
Full-text available
Statistical sinusoidal modeling represents a method for transferring a sample library of instrument sounds into a data base of sinusoidal parameters for the use in real time additive synthesis. Single sounds, capturing a musical instrument in combinations of pitch and intensity, are therefor segmented into attack, sustain and release. Partial amplitudes, frequencies and Bark band energies are calculated for all sounds and segments. For the sustain part, all partial and noise parameters are transformed to probabilistic distributions. Interpolated inverse transform sampling is introduced for generating parameter trajectories during synthesis in real time, allowing the creation of sounds located at pitches and intensities between the actual support points of the sample library. Evaluation is performed by qualitative analysis of the system response to sweeps of the control parameters pitch and intensity. Results for a set of violin samples demonstrate the ability of the approach to model dynamic timbre changes, which is crucial for the perceived quality of expressive sound synthesis.
... B-Spline and spline models were compared by Lolive et al. [5] for the use of modeling fundamental frequency in speech synthesis systems. Within a sinusoidal modeling approach, Hahn et al. [6] used B-splines to model the temporal trajectories of partial parameters. Ardaillon et al. [7] evaluated a parametric f0 model based on B-splines within a concatenative singing voice synthesis system through listening tests. ...
Conference Paper
Full-text available
This paper investigates the applicability of different mathe- matical models for the parametric synthesis of fundamental fre- quency trajectories in glissando note transitions. Hyperbolic tan- gent, cubic splines and Bézier curves were implemented in a real- time synthesis system. Within a user study, test subjects were pre- sented two-note sequences with glissando transitions, which had to be re-synthesized using the three different trajectory models, employing a pure sine wave synthesizer. Resulting modeling er- rors and user feedback on the models were evaluated, indicating a significant disadvantage of the hyperbolic tangent in the modeling accuracy. Its reduced complexity and low number of parameters were however not rated to increase the usability.
Article
Full-text available
Sound morphing is a transformation that gradually blurs the distinction between the source and target sounds. For musical instrument sounds, the morph must operate across timbre dimensions to create the auditory illusion of hybrid musical instruments. The ultimate goal of sound morphing is to perform perceptually linear transitions, which requires an appropriate model to represent the sounds being morphed and an interpolation function to obtain intermediate sounds. Typically, morphing techniques directly interpolate the parameters of the sound model without considering the perceptual impact or evaluating the results. Perceptual evaluations are cumbersome and not always conclusive. In this work, we seek parameters of a sound model that favor linear variation of perceptually motivated temporal and spectral features used to guide the morph towards more perceptually linear results. The requirement of linear variation of feature values gives rise to objective evaluation criteria for sound morphing. We investigate several spectral envelope morphing techniques to determine which spectral representation renders the most linear transformation in the spectral shape feature domain. We found that interpolation of line spectral frequencies gives the most linear spectral envelope morphs. Analogously, we study temporal envelope morphing techniques and we concluded that interpolation of cepstral coefficients results in the most linear temporal envelope morph.
Article
Full-text available
A glottal flow model with four independent parameters is described: It is referred t o as the LF-model. Three of these pertain to the frequency, amplitude, and the exponential growth constant of a sinusoid. The fourth parameter is the time constant of an exponential recovery, i.e., return phase, from the point of maximum closing discontinuity towards maximum closure. The four parameters are interrelated by a condition of net flow gain within a fundamental period which is usually set to zero. The finite return phase with a time constant Ta is partly equivalent to a first order low-pass filtering with cutoff Frequency Fa = 1/(2 π Ta). The LF-model is optimal for non-interactive flow parameterization in the sense that it ensures an overall fit to commonly encountered wave shapes with a minimum number of parameters and is flexible in its ability to match extreme phonations. Apart from analytically com- plicated parameter interdependencies, it should led itself to simple digital implementations.
Article
p>A supervised learning algorithm (Scaled Conjugate Gradient, SCG) with superlinear convergence rate is introduced. The algorithm is based upon a class of optimization techniques well known in numerical analysis as the Conjugate Gradient Methods. SCG uses second order information from the neural network but requires only O(N) memory usage, where N is the number of weights in the network. The performance of SCG is benchmarked against the performance of the standard backpropagation algorithm (BP), the conjugate gradient backpropagation (CGB) and the one-step Broyden-Fletcher-Goldfarb-Shanno memoryless quasi-Newton algorithm (BFGS). SCG yields a speed-up of at least an order of magnitude relative to BP. The speed-up depends on the convergence criterion, i.e., the bigger demand for reduction in error the bigger the speed-up. SCG is fully automated including no user dependent parameters and avoids a time consuming line-search, which CGB and BFGS use in each iteration in order to determine an appropriate step size. Incorporating problem dependent structural information in the architecture of a neural network often lowers the overall complexity. The smaller the complexity of the neural network relative to the problem domain, the bigger the possibility that the weight space contains long ravines characterized by sharp curvature. While BP is inefficient on these ravine phenomena, it is shown that SCG handles them effectively.</p
Chapter
IntroductionSound-feature extractionMapping sound features to control parametersExamples of adaptive DAFXConclusions References
Article
Speech has been remade from a buzzer‐like tone and a hiss‐like noise with automatic control of the pitch and timbre by the talker's speech that is being remade. The newly created speech may be altered from the original in a variety of ways since the fundamental variables of speech are under experimental control between the steps of speechanalyzing and synthesizing that together remake the speech. The demonstrations will show in detail the steps in remaking speech as well as the effects of changing various elements. Included among these demonstrations will be the automatic raising and lowering of pitch, changing the octave range of pitch, reversing the inflection, removing the melody from song, transforming voiced speech to a whisper, the relative contributions of the buzz and hiss to the remade speech, etc. The demonstrations will show some striking characteristics of speech such as its recreation under the control of currents too low in frequency to be audible and the use of timbre for intelligibility and of inflection for emotional content. The synthesizing process here used provided the basis for the synthesis of speech by the manipulation of keys as accomplished in the Voder, one of the Bell System's exhibits at this year's World's Fair.