Conference PaperPDF Available

TSAM: a Tool for Analyzing, Modeling, and Mapping the Timbre of Sound Synthesizers

Authors:

Abstract and Figures

Synthesis algorithms often have a large number of adjustable parameters that determine the generated sound and its resultant psychoacoustic features. The relationship between parameters and timbre is important for end users, but it is generally unknown, complex, and difficult to analytically derive. In this paper we introduce a strategy for the analysis of the sonic response of synthesizers subject to the variation of an arbitrary set of parameters. We use an extensive set of sound descriptors which are ranked using a novel metric based on statistical analysis. This enables the study of how changes to a synthesis parameter affect timbral descriptors, and provides a multidimensional model for the mapping of the synthesis control through specific timbre spaces. The analysis, modeling and mapping are integrated in the Timbre Space Analyzer & Mapper (TSAM) tool, which enables further investigation into synthesis sonic response and on perceptually related sonic interactions.
Content may be subject to copyright.
TSAM: A TOOL FOR ANALYZING, MODELING, AND
MAPPING THE TIMBRE OF SOUND SYNTHESIZERS
Stefano Fasciani
Faculty of Engineering and Information Sciences
University of Wollongong in Dubai
stefanofasciani@stefanofasciani.com
ABSTRACT
Synthesis algorithms often have a large number of adjust-
able parameters that determine the generated sound and
its resultant psychoacoustic features. The relationship
between parameters and timbre is important for end us-
ers, but it is generally unknown, complex, and difficult to
analytically derive. In this paper we introduce a strategy
for the analysis of the sonic response of synthesizers sub-
ject to the variation of an arbitrary set of parameters. We
use an extensive set of sound descriptors which are
ranked using a novel metric based on statistical analysis.
This enables the study of how changes to a synthesis pa-
rameter affect timbral descriptors, and provides a multi-
dimensional model for the mapping of the synthesis con-
trol through specific timbre spaces. The analysis, model-
ing and mapping are integrated in the Timbre Space Ana-
lyzer & Mapper (TSAM) tool, which enables further in-
vestigation into synthesis sonic response and on percep-
tually related sonic interactions.
1. INTRODUCTION
The timbre generated by a sound synthesis algorithm de-
pends on the values assigned to the variable parameters,
typically user configurable. Regardless of the synthesis
method, the relationship between control and perceptual
features of the resultant sound is generally weak [1] and
difficult to determine. Modern synthesis algorithms pre-
sent a wide timbre range and a high dimensional control
space. The timbre, which is central in modern sonic arts,
has high dimensionality as well [2] and a blurry scientific
definition [3]. For designers of sonic interactive systems
and of musical instruments, knowing the parameter-to-
timbre relationship supports the implementation of the
intended sonic response. For sound designers and per-
formers this knowledge eases the development of control
intimacy [4]. Also, this insight can help in improving the
expressivity of musical instruments by reducing the con-
trol dimensionality while broadening the timbral re-
sponse. The heuristic estimation of the parameter-to-
timbre causality is workable, but is subjective and inaccu-
rate. This task is challenging due to nonlinearities and
correlations in the synthesis process, especially when a
large set of variable parameters are involved.
We address this issue by proposing a systematic and
generic method to analyze the timbre in relation to the
synthesis variables. The collected data is then processed
by computing a quality metric for each sound descriptor,
composed of four weighted components, each represent-
ing a specific statistical characteristic. Additionally, qual-
ity metrics for synthesis parameters are provided as well.
This information can be used in designing the mapping of
musical gestures to the synthesis control, providing a
tighter causal link with the timbral response of the sys-
tem. The tool we present here, the Timbre Space Analyz-
er & Mapper (TSAM), integrates these functionalities and
supports implementation of few-to-many lossless map-
pings [5], through an intermediate timbre-related layer
[6]. The tool, after analyzing the sonic response of the
synthesizer, computes a reduced timbre-to-parameter
model, which supports real-time interaction with the
sound synthesizer. In particular, we integrate an exten-
sion of the modeling and mapping strategy we introduced
in [7], highlighting the enhancement achieved when con-
sidering the quality metric for selecting the descriptor for
mapping purposes.
The TSAM is a flexible tool, exposing internal compu-
tation settings and options on a Graphical User Interface
(GUI), which supports a range of applications and aims.
The perceptual characteristics of synthesis method can be
studied, characterized, and compared numerically or
graphically. The relationship between timbre, spectrum
and different musical scales can be investigated [8]. Dif-
ferent mapping approaches for musical instrument can be
explored and compared. The rest of this paper is orga-
nized as follows. In Section 2 we describe the synthesis
analysis procedure and present the quality metric for de-
scriptors and parameters. Section 3 provides a summary
of the timbre space mapping strategy. The TSAM imple-
mentation is detailed in Section 4. Finally, Section 5 con-
cludes with discussion and future works.
2. TIMBRE RESPONSE ANALYSIS
Understanding the sonic variation resulting by tweaking
parameters is common when getting familiar with a
sound synthesizer. Different users may have distinct in-
Copyright: © 2016 Stefano Fasciani. This is an open-access article dis-
tributed under the terms of the Creative Commons Attribution License 3.0
Unported, which permits unrestricted use, distribution, and reproduction
in any medium, provided the original author and source are credited.
tents. Sound designers aim at synthesizer configurations
generating the their desired sound, whereas performers
and instrument builders look at a mapping that yields
sonic expressivity. Synthesizers generally feature a large
number of controllable parameters, representing the syn-
thesis algorithm variables. In analog synthesizers, each
parameter can theoretically assume an infinite number of
values, while in digital (or software) synthesizer we have
more than 4 billion possible values if considering single-
precision implementations (32 bit). Synthesizers inter-
faced using the MIDI protocol allow only up to 128 dis-
tinct values per parameter (7 bit), despite the resolution of
the internal circuitry. However with only three MIDI con-
trolled parameters we have more than 2 million (221) dif-
ferent parameter permutations or unique synthesis states.
This combinatorial explosion limits the feasibility of a
comprehensive analysis of the all timbre resultant from
each of these states.
Limiting the dimensionality of the parameter space al-
lows coping with the large number of synthesis states to
analyze, laving only a few variable parameters and fixing
the remaining to specific values. In this case the timbre
analysis is limited to a subset of the entire parameter
space, which is a scenario equivalent to users tweaking
only a few parameters of a synthesis configuration (or
preset). To further reduce the number of states to analyze
we use the principle of spatial locality: states close in the
parameter space generate similar timbres. Therefore we
can sample the parameter space with a larger step size,
and eventually interpolate at a later stage. This principle
is generally true if we exclude synthesis algorithms fea-
turing stochastic components, and parameters with strong
nonlinearities (e.g. binary switches). Generally, the oppo-
site of this principle does not hold. Proximity in the tim-
bre space does not necessarily imply similar parameter
configuration. The TSAM itself can be used to verify
these principles. A further reduction can be achieved lim-
iting the individual range of interest of each parameter.
Given k variable synthesis parameter, the synthesis state
space I (set of unique parameter permutations) is given
by the Equations (1)-(3) [9].
𝐈=[𝐢!,𝐢!,,𝐢!]
𝐢=[𝑖!,𝑖!,,𝑖!]
𝑛=
max 𝑖!min 𝑖!
step 𝑖!
!
!!!
Each synthesis state is represented with a vector i with
dimensionality k, as in Equation (2), while n, the number
of vectors in I, depends on the individual range and step
size of the k parameters, as in (3). I is the synthesis state
space we consider for the timbre analysis, presenting di-
mensionality k and cardinality n.
2.1 Descriptors Set and Computation
For each state i of the sound synthesizer we compute a set
of audio descriptors, that we indicate with d, representing
the timbral descriptors of the resulting synthetic sound. A
large set of low-level computational descriptors, includ-
ing eventual redundancies, is essential for the detailed
timbre analysis we require in this context. A few higher-
level timbre descriptors (e.g. brightness, noisiness, color-
ation), often subjective and language dependent semantic
[10], are suitable to discriminate sounds with major tim-
bral differences, but in this context they fail to capture the
subtle sonic nuances determined by small variations of
the synthesis parameters.
A posterior descriptor selection is possible considering
the quality metric we present in this paper. The method is
independent of the specific descriptors set. In the TSAM
we use the CUIDADO features set [11] implemented in
the IRCAM descriptors object for Max/MSP. The set
includes spectral and perceptual features listed in Table 1.
It includes 24 scalar and 7 vectorial descriptors, as speci-
fied in the dimensionality column, resulting in a dimen-
sionality q of d equal to 108, as in (4). Some of the scalar
descriptors in the set are closely related to traditional
timbre labels (e.g. spectral centroid to brightness).
𝐝=[𝑑!,𝑑!,,𝑑!]
Descriptor Name
Dimensionality
Total Energy
1
Signal Zero Crossing Rate
1
Spectral Centroid
1
Spectral Crest
4
Spectral Decrease
1
Spectral Flatness
4
Spectral Kurtosis
1
Spectral Rolloff
1
Spectral Skewness
1
Spectral Slope
1
Spectral Spread
1
Spectral Variation
1
Perceptual Odd To Even Ratio
1
Perceptual Spectral Centroid
1
Perceptual Spectral Decrease
1
Perceptual Spectral Deviation
1
Perceptual Spectral Kurtosis
1
Perceptual Spectral Rolloff
1
Perceptual Spectral Skewness
1
Perceptual Spectral Slope
1
Perceptual Spectral Spread
1
Perceptual Spectral Variation
1
Perceptual Tristimulus
3
Sharpness
1
Spread
1
Noise Energy
1
Noisiness
1
Chroma
12
MFCC
13
Relative Specific Loudness
24
Perceptual Model
24
Table 1. List of descriptors used in the TSAM.
The descriptors listed above are computed on a short
temporal window, typically in the range 2 ms to 200 ms.
They provide an instantaneous sonic representation suffi-
cient to characterize only absolutely periodic sounds. In
synthesis states we may observe and hear low rate timbre
variations, spanning beyond the largest temporal window
we consider for the descriptors. Hence an appropriate
characterization of the timbre requires computation and
merges of descriptors computed from multiple short time
windows. We propose two analysis modes named sus-
tainandenvelope mode. In the first, given a synthesis
state i, we compute m descriptor vectors and we combine
these taking their mean and optionally their range, as in
Equation (5), doubling the dimensionality of the de-
scriptor set. The second approach simply concatenates the
m descriptor vectors into a single vector, as in Equation
(6), increasing the dimensionality by m times.
𝐢𝐝=mean 𝐝!,𝐝!,,𝐝!
max 𝐝!,,𝐝!min 𝐝!,,𝐝!
𝐢𝐝=
𝐝!
𝐝!
Considering the synthesis as an binary process, and the
sound generated as almost periodic, the first approach
provides a sufficient approximation of the timbre. When
the synthesis produces dynamic timbres, such as texture-
like sounds, or when ADSR envelopes are applied to am-
plitude and other parameters, the second approach is pre-
ferred. However also in presence of ADSR envelopes, we
can still use the first approach, analyzing only the sustain
phase of the synthesis, intentionally discarding the attack,
decay and release phases, or because these do not signifi-
cantly change within the parameter space I we analyze.
The concatenation of short-term static descriptors to
analyze timbral dynamics is a simplification with respect
to the use of dynamic descriptors computed on longer
temporal windows. However this approach reduces the
time needed to execute the timbre analysis and allows
users to change the merging mode from ‘sustain’ to ‘en-
velope’ and vice versa without repeating the analysis.
In the TSAM implementation, presented in Section 4,
the computation of the descriptors is completely automat-
ed. Users are only required to identify the k variable pa-
rameters of the synthesizer, their range, step size, number
of descriptor vectors per state 𝑚, and analysis mode. The
tool computes I and drives the synthesizer with one i at a
time, computing and storing 𝑚 vectors d. For analysis in
envelope mode, the tool also manages the triggering of
the synthesizer at every i. Users can further specify the
temporal unfolding of the analysis, selecting only a sub-
set of the ADSR envelope. Advanced options related to
the descriptor computation, such as window size, hop
size, sampling rate, are exposed as well.
2.2 Descriptor Quality Metric
The quality metric we compute for each descriptor is
aimed at capturing the four characteristics listed below.
Noisiness: deviation of the descriptor from its
mean given a synthesis state i.
Variance: spread of descriptor value across the syn-
thesis state space I.
Independence: uniqueness of the descriptor varia-
tion pattern across the synthesis state space I.
Correlation: coherence of the descriptor variation
with synthesis parameters across the synthesis state
space I.
Ideally, a descriptor representative of I should present
low noisiness, high variance, high independence, and
high correlation. High noisiness indicates that a particular
descriptor and the associated timbral characteristic also
varies when synthesis parameters are fixed, and therefore
its eventual variance across I may be not significant. A
descriptor with low variance reveals that the related tim-
bral characteristic does not change significantly when
varying the synthesis parameters. Descriptors varying
with a similar trend are redundant, and thus less signifi-
cant, when computing a dimensionality-reduced timbre
space modeling I, instead those more independent carry a
larger amount of information. Descriptors can also be
highly independent when varying randomly across I. We
address this by also including the correlation between
descriptor and parameters in the metric, as we expect
representative descriptors to change accordingly to one or
more synthesis parameter.
For each descriptor, we compute the nosiness N!,𝐢 from
the m descriptor vectors in synthesis state i before these
are merged, as per Equations (5) and (6). The subscript x
is the index identifying the descriptor across the set of q
computed in the TSAM. For sustainmode, we measure
the deviation of the descriptor x in the state i using the
Relative Mean absolute Difference (RMD), as in Equa-
tion (7). The RMD is a scale invariant measure of statisti-
cal dispersion, hence allows the comparison of heteroge-
neous descriptors. For envelope mode, N!,𝐢 is estimated
as the zero crossing rate, as in Equation (8), of the for-
ward second order finite difference (discrete approxima-
tion of the second order derivative) of the series of m
descriptors, as in (9). This represents the rate at which a
descriptor inverts its trend (from increasing to decreasing
and vice versa) in the analyzed envelope. Noisy de-
scriptors invert their trends at higher rates.
N!,𝐢=𝑑𝑥,𝑗𝑑𝑥,𝑘
𝑚
𝑘=1
𝑚
𝑗=1
𝑑𝑥,𝑗
𝑚
𝑗=1
𝑚1
N!,𝐢=
1
𝑚2
𝕀Δ!𝑑!,!Δ!𝑑!,!!!<0
!!!
!!!
Δ!𝑑!,!=2
𝑘(1)!!!𝑑!,!!!
!
!!!
In Equations (7)-(9), 𝑑!,! represents the x-th descriptor
in the set of q, from the j-th vector d out of the m com-
puted for each state i. The indicator function 𝕀 is
equal to 1 if its argument is true, 0 otherwise. Δ! is
the forward second order finite difference function. The
overall noisiness of each descriptor N! is computed by
taking the average over the set of synthesis unique states
I we analyze.
Variance, independence, and correlation are computed
across I, after the m descriptors are merged as in (5)-(6).
The same method is used for both ‘sustain’ and ‘enve-
lope’ modes. The variance V! is computed as the RMD
over the n synthesis states i. We use the same expression
as in (7), replacing m with n, but in this case 𝑑!,! is the x-
th descriptor in the set of q, from the j-th vector d out of
the n we compute across I.
We assume that descriptors are independent if poorly
correlated, therefore we compute I! taking the comple-
ment of the averaged absolute value of the correlation
coefficient between the descriptor x and the other q-1
descriptors over I, as in Equation (10). Both positive and
negative correlations indicate dependence, therefore we
take the absolute value of the correlation coefficient
corr . We subtract 1 from the summation to remove
the correlation coefficient of the descriptor with itself,
when j=x. Finally, the correlation C! between descriptors
and parameters is computed taking the average correla-
tion coefficient between the x-th descriptor and the k var-
iable synthesis parameter, as in Equation (11).
I!=1
1
𝑞1
corr 𝐝!,𝐈,𝐝!,𝐈
!
!!!
1
(10)
C!=
1
𝑘corr 𝐝!,𝐈,𝐢!,𝐈
!
!!!
(11)
In (10) and (11) with 𝐝!,𝐈 we represent the vector con-
taining the n values of the x-th descriptor computed over
the synthesis state space I, while 𝐢!,𝐈 represents the vector
containing the n values of the x-th synthesis parameter
over I. Note that according to (5) and (6) each descriptor
may contribute with more than one component in each
vector d. In particular, for ‘sustain’ mode we have two
components per descriptor if the range is included in the
analysis, whereas for the ‘envelope’ mode we have m
components per descriptor. Therefore we compute multi-
ple V!, I! and C! per each x-th descriptor, and use their
average in the quality metric we introduce next.
The quality metric S! of each descriptor is computed
from the individual noisiness, variance, independence,
and correlation as in Equation (12). The noisiness, being
an undesirable feature, lowers the value of S!. The four
components are combined using individual weights 𝑤.
S!=𝑤!V!+𝑤!I!+𝑤!C!𝑤!N!
(12)
The selection of the 𝑤 values depends on the aim and
context of the timbre analysis, and also on individual
preferences. For instance, when analyzing a synthesizer
configuration with a texture-like timbre, we expect con-
siderable sonic variation within each synthesis state i,
therefore the noisiness has no significance and 𝑤! should
be close to zero. If the purpose of the analysis is the sole
study of the synthesizer timbre through the descriptors,
their independence has little relevance. Instead when de-
scriptors are used for mapping purposes, as in Section 3,
the independence has a higher significance. In the TSAM,
the default values of the weights are 0.33 for variance,
independence and correlation, and 0.66 for noisiness.
Users can change these in the unitary range. The four
components of the quality metric have different ranges. I!
and C! span between [0,1], while N! and V! can be zero
but do not have a theoretical maximum. In the TSAM we
include the option to normalize these to the unitary range,
easing the balancing through individual weights. Howev-
er when comparing the quality metrics S! across different
synthesizers, or between different state spaces I of the
same synthesizer, normalization should not be used. In
the TSAM we also rank also the k synthesis parameter by
their average correlation with the q descriptors, computed
as in (11) but replacing k with q and taking x as the sum-
mation index. Furthermore for each parameter, the
TSAM displays the two descriptors with associated high-
est and lowest correlation, and vice versa.
3. TIMBRE SPACE MODELING AND
MAPPING
Audio descriptors have been extensively used for visuali-
zation, measurement, classification, and recognition of
sounds. Works proposing the timbre as a control structure
for sound synthesis [12] or for interactive sonic systems
have recently proliferated [13][24]. These allow for ex-
plicit control of psychoacoustic characteristics of the
generated sound, hiding synthesis parameters from users,
simplifying the user interaction, facilitating the search for
specific timbres, and enhancing the expressivity of the
system. Similar benefits are provided by synthesis meth-
ods using a timbre representation derived by a prior anal-
ysis stage of the target sound [25], [26]. A model relating
parameters to sonic response of the sound synthesizer is
necessary to implement explicit timbre control. Our ge-
neric approach, introduced in [7] and extended here, de-
rives a model from the prior analysis stage, and therefore
it is independent of the specific synthesis method and
implementation.
The generative mapping is based on unsupervised ma-
chine learning techniques, and it provides a low dimen-
sional and perceptually related synthesis control. The
mapping maximizes the breadth of the explorable sonic
space covered by the synthesis space I, and minimizes
possible timbre losses due to the reduced dimensionality
of the control space (i.e. few-to-many mapping). The
timbre response analysis described in the previous section
returns a synthesis space I, with dimensionality k, and a
descriptor space D, with dimensionality q. Both spaces
present n entries i and d, which are pairwise associated,
representing a basic model relating parameters and tim-
bre. Hence we can explicitly express a timbre through the
q descriptors (e.g. mapped on a large bank of faders), find
the closest entry in D, and drive of the synthesizer with
the associated parameter set i. Such control is affected by
several drawbacks: the high dimensionality of the timbre-
based control, with q generally much greater than k; the
lack of accuracy due to the large parameter step size we
use in the analysis stage (3); entries in the timbre space D
are not evenly distributed as in I, hence regions of D with
low density determine a poor system response.
The real dimensionality of D is usually much less than
q. Generally the data of interest lies on an embedded non-
linear manifold within the q-dimensional space. There-
fore we reduce the dimensionality of D, using Isomap,
down to two or three dimensions, which are easy to map
to general-purpose controllers with low cognitive com-
plexity. In the TSAM users can explore the application of
34 different dimensionality reduction methods [27].
Before reducing the dimensionality of D, we use the
quality metric S! to discard those descriptors with a low
score. Particularly noisy or poorly correlated descriptors
present a large variance that have a significant impact in
the dimensionality reduction stage, but this would not be
not representative of the parameter-to-timbre relationship,
corrupting the timbre space mapping. The selection of
descriptors based on the quality metric determines im-
provements in accuracy and usability against our previ-
ous approach. Alternatively, users can bypass the dimen-
sionality reduction stage, and explicitly specify the two or
three descriptors composing the low dimensional timbre
space we use for the mapping to synthesis parameters.
To address the issue of the possible unresponsiveness of
the timbre space due to arbitrary distribution in D we
apply an iterative algorithm based on the Voronoi tessel-
lation, derived from [28], that redistribute the n entries d
into an uniformly distributed square or cube, while pre-
serving the local neighborhood relationships (homomor-
phic transformation). The inverse of this transformation
represent the required mapping to project a generic mul-
tidimensional control space C onto the case specific tim-
bre space. Hence we use an Artificial Neural Network
(ANN) to learn a function 𝑚 approximating the in-
verse of the redistribution process. We use 𝑚 to pro-
ject the generic multidimensional control vector c onto
the dimensionally reduced timbre space D*. The ANN
includes a single hidden layer and therefore can be
trained efficiently using a non-iterative algorithm [29]. In
Figure 1 we show an example of a highly clustered tim-
bre space reduced to three dimensions, and its transfor-
mation to a uniform cube. The side arrows identify the
two stages of the mapping computation. In the TSAM we
provide also an alternative mapping, skipping the ANN
and computing the synthesis parameters directly from the
uniformly distributed timbre space.
In the final stage of the mapping we compute the pa-
rameters to interact with the sound synthesizer. We use d*
to represent a descriptor vector in the dimensionality re-
duced timbre space D*. Driving the synthesis with the
parameters i associated with the d* closer to 𝑚𝐜 may
lead to discontinuities, that in turn may generate glitches
in the sonic output. These are due to the coarse parameter
step size used in the analysis stage, and due to the not
one-to-one relationship between parameters and sound.
Two synthesis states i, far apart in the synthesis state
space I, may be associated identical or similar descriptor
vectors d, hence close in D. The latter is an implicit
drawback of any methods for controlling sound synthesis
from any representation of the generated signal.
Figure 1. Example of a timbre space reduced to three
dimensions, and related transformation to a uniform cube.
We address these issues computing the synthesis pa-
rameter by spatial interpolation, including only entries of
D* from the neighborhood the current state i. The set of
parameters driving the synthesizer 𝐢!"#$ is computed by
Inverse Distance Weighting (IDW) as in Equations (12)
and (13), where represent the Euclidean distance.
𝐢!"#$ =𝐪!(𝑚(𝐜))𝐢!
!
!!!
𝐪!(𝑚(𝐜))
!
!!!
(12)
𝐪!(𝑚(𝐜))=
1
𝑚𝐜𝐝!
!
(13)
In (12) and (13) N represents the total number of points
considered in the interpolation, and the 𝐢! in (12) are
those pairwise associated with the 𝐝!
in (13). In the
TSAM instead of using the N closest point 𝐝!
in D*, we
select those 𝐝!
that limit the maximum variation of 𝐢!"#$
between two consecutive iterations, that is the set of 𝐝!
associated with the 𝐢! close to the current 𝐢!"#$ (within a
user-defined distance). In Figure 2 we show an example
of this interpolation points selection, where the green
entries are the 𝐝!
related to 𝐢! close to the current 𝐢!"#$,
which is in turn associated with the yellow one in figure.
The set of 𝐝!
used for IDW interpolation may include
entries distant from 𝑚𝐜, but these will poorly contrib-
ute in (12). In the IDW, p represents the power parame-
ter, which determines the influence of each point based
on the distance. This value should be larger than the di-
mensionality of the reduced timbre space D*, and increas-
ing p closer points has larger weight. In the TSAM, the
𝐢!"#$ maximum instantaneous distance and interpolation
power parameter p, are among the options exposed to
users to tune in real time the timbre mapping response.
The TSAM provides interactive timbre space visualiza-
tions, such as those in Figure 1 and 2.
Figure 2. Detail of a timbre space reduced to three di-
mensions. The green entries are those used in the interpo-
lation to compute the synthesis parameter, because close
to the yellow current entry in the synthesis state space.
4. IMPLEMENTATION AND USAGE
The TSAM1 is an open-source software implemented in
in Max/MSP using FTM extension2 [30], supported by a
background engine written and compiled in MATLAB.
The analysis of the synthesis timbre, the real-time timbre
space mapping and the visualizations are computed in
Max/MSP, whereas the background engine computes the
descriptor quality and the timbre space mapping (dimen-
sionality reduction, redistribution, ANN training), taking
as input the outcome of the analysis stage. The two com-
ponents of the system communicate via Open Sound
Control (OSC) protocol and large matrices are exchanged
using files. The TSAM can host software synthesizer
developed using Steinberg’s Virtual Studio Technology
(VST). It acts as a wrapper for VST synth, providing a
fully integrated environment. The TSAM allows full con-
trol of all parameters for analysis and mapping purposes.
It captures the synthetized signal for descriptor computa-
tion and playback, and manages the global state of the
synthesizer when saving and restoring presets. In Figure
3 there is a screenshot of the main TSAM GUI. This ex-
poses a large number of options for further exploration of
1 http://stefanofasciani.com/tsam.html
2 http://ftm.ircam.fr/
the mapping method we propose, and also for customiz-
ing analysis, mapping computation, real-time control, and
visualization. Default settings are provided for basic use.
Users can load a VST synth and select up to 10 variable
parameters, their range, analysis step size, and the num-
ber of vectors m per state i. Advanced analysis options
include digital signal processing settings and analysis
timing with respect to the synthesis triggering (note-on
and note-off messages). The TSAM estimates and shows
the total analysis time, and users may opt to reduce the
parameter step sizes, in (3), when this is excessive.
Thereafter the analysis is carried out automatically. In
Section 2 we discussed two analysis modes, ‘sustain’ and
envelope respectively. These, besides the automatic
mode, can also be carried out manually. Users arbitrarily
tune the synthesizer to a specific state i, and request for
the descriptor analysis of the related sonic response (both
modes are supported). Furthermore we included the inter-
active ‘sustain’ analysis mode [7] where descriptor vec-
tors d are computed while users vary in the MIDI mapped
synthesis parameters in real-time, dynamically generating
a stream of i. The latter analysis mode does not guarantee
to observe an identical number of descriptor vectors d per
state i, hence the noisiness in the quality metric result
may be inconsistent.
When the analysis stage is completed, users can request
the computation of the descriptor quality metric, which is
visualized in the TSAM as shown in Figure 4. In the de-
scriptors page, users can also specify the weights of
Equation (12), enable the normalization of its compo-
nents, find and rank the descriptors by highest score, ob-
serve the synthesis parameter ranking, and find the high-
est and lowest correlation between each parameter and
descriptor. Furthermore, users can specify which subset
of the 108 descriptors will be used for mapping purposes.
Options for the timbre space mapping computation in-
clude the dimensionality of the map, selection of the di-
mensionality reduction technique and the ANN activation
function. The mapping can be tuned at runtime using the
settings discussed in Section 3. The timbre analysis, qual-
ity metric, and mapping are saved into files that can be
individually recalled through the TSAM presets.
5. DISCUSSION AND FUTURE WORK
We presented a generic tool that integrates functionalities
to study and map the timbre of sound synthesizers. Pre-
liminary studies demonstrated that the adoption of large
sets of descriptors, and their selection based on the novel
quality metric, improves the accuracy of the timbre-based
interaction. The TSAM can be used for the study of the
sonic response of synthesizers, for an explicit control of
timbral character, or for a reduction of the synthesis con-
trol space, exposing only a few perceptually relevant con-
trol dimensions. Previous user studies on a system with a
similar mapping approach demonstrated that synthesis
parameters become transparent to users [31], which are
exclusively focused on the timbral interaction. Future
works include user studies with the TSAM to evaluate the
effectiveness of the timbre-based mapping, comparing it
against traditional and alternative approaches to sound
synthesis interaction, in performing and sound design
scenarios. Moreover we will investigate the relevance of
different descriptor categories for a more perceptually
related sonic control.
Figure 3. TSAM main page, including options for analysis, mapping computation, real-time control, and visualization.
Figure 4. TSAM descriptor page, providing an insight into the timbre response and parameter relationship of the synth.
6. REFERENCES
[1] T. Wishart, On Sonic Art. Harwood Academic Pub-
lishers, 1996.
[2] S. McAdams and A. Bergman, “Hearing musical
streams,” Comput. Music J., vol. 3, no. 4, pp. 2643,
60, 1979.
[3] J. C. Risset and D. Wessel, “Exploration of timbre
by analysis and synthesis,” Psychol. Music, pp. 113
169, 1999.
[4] S. Fels, “Intimacy and embodiment: implications for
art and technology,” in Proc. of the 2000 ACM
workshops on Multimedia, 2000, pp. 1316.
[5] E. R. Miranda and M. M. Wanderley, New digital
musical instruments: control and interaction beyond
the keyboard. A-R Editions, Inc., 2006.
[6] D. Arfib, J. M. Couturier, L. Kessous, and V. Ver-
faille, “Strategies of mapping between gesture data
and synthesis model parameters using perceptual
spaces,” Organ. Sound, vol. 7, no. 2, pp. 127144,
Aug. 2002.
[7] S. Fasciani, “Interactive Computation of Timbre
Spaces for Sound Synthesis Control,” in Proc. of the
2nd Int. Symposium on Sound and Interactivity, Sin-
gapore, 2015.
[8] W. A. Sethares, Tuning, Timbre, Spectrum, Scale.
Springer Science & Business Media, 2005.
[9] S. Fasciani and L. Wyse, “Adapting general purpose
interfaces to synthesis engines using unsupervised
dimensionality reduction techniques and inverse
mapping from features to parameters,” in Proc. of
the 2012 Int. Computer Music Conf., Ljubljana, Slo-
venia, 2012.
[10] A. Zacharakis, K. Pastiadis, and J. D. Reiss, “An
Interlanguage Unification of Musical Timbre,” Mu-
sic Percept. Interdiscip. J., vol. 32, no. 4, pp. 394
412, Apr. 2015.
[11] G. Peeters, “A Large Set of Audio Features for
Sound Description (Similarity and Classification) in
the Cuidado Project,” IRCAM, 2004.
[12] D. Wessel, “Timbre space as a musical control struc-
ture,” Comput. Music J., vol. 3, no. 2, pp. 4552,
1979.
[13] A. Lazier and P. R. Cook, “Mosievius: feature driv-
en interactive audio mosaicing,” in Proc. of the 7th
Int. Conf. on Digital Audio Effects, Napoli, Italy,
2003.
[14] M. Puckette, “Low-dimensional parameter mapping
using spectral envelopes,” in Proc. of the 2004 Int.
Computer Music Conf., Miami, US, 2004.
[15] C. Nicol, S. A. Brewster, and P. D. Gray, “Designing
Sound: Towards a System for Designing Audio In-
terfaces using Timbre Spaces.,” in Proc. of the 10th
Int. Conf. on Auditory Display, Sydney, Australia,
2004.
[16] D. Schwarz, G. Beller, B. Verbrugghe, and S. Brit-
ton, “Real-time corpus-based concatenative synthe-
sis with CARART,” in Proc. of the 9th Int. Conf. on
Digital Audio Effects, Montreal, Canada, 2006, pp.
279282.
[17] M. Hoffman and P. R. Cook, “Feature-based synthe-
sis: Mapping acoustic and perceptual features onto
synthesis parameters,” in Proc. of the 2006 Int.
Computer Music Conf., New Orleans, US, 2006.
[18] N. Schnell, M. A. S. Cifuentes, and J. P. Lambert,
“First steps in relaxed real-time typo-morphological
audio analysis/synthesis,” in Proceeding of the 7th
Sound and Music Computing Int. Conf., Barcelona,
Spain, 2010.
[19] T. Grill, “Constructing high-level perceptual audio
descriptors for textural sounds,” in Proc. of the 9th
Sound and Music Computing Int. Conf., Copenha-
gen, Denmark, 2012.
[20] A. Seago, “A New Interaction Strategy for Musical
Timbre Design,” in Music and Human-Computer In-
teraction, S. Holland, K. Wilkie, P. Mulholland, and
A. Seago, Eds. Springer, 2013, pp. 153169.
[21] A. Pošćić and G. Kreković, “Controlling a sound
synthesizer using timbral attributes,” in Proc. of the
10th Sound and Music Computing Int. Conf., Stock-
holm, Sweden, 2013.
[22] N. Klügel, T. Becker, and G. Groh, “Designing
Sound Collaboratively Perceptually Motivated Au-
dio Synthesis,” in Proc. of the 14th Int. Conf. on
New Interfaces for Musical Expression, London,
United Kingdom, 2014, pp. 327330.
[23] S. Ferguson, “Using Audio Feature Extraction for
Interactive Feature-Based Sonification of Sound,” in
Proc. of the 21st Int. Conf. on Auditory Display
(ICAD 2015), Graz, Austria, 2015.
[24] S. Stasis, R. Stables, and J. Hockman, “A Model For
Adaptive Reduced-Dimensionality Equalisation,” in
Proc. of the 18th Int. Conf. on Digital Audio Effects
(DAFx-15), Trondheim, Norway, 2015.
[25] X. Serra and J. Smith, “Spectral Modeling Synthesis:
A Sound Analysis/Synthesis System Based on a De-
terministic Plus Stochastic Decomposition,” Com-
put. Music J., vol. 14, no. 4, pp. 1224, 1990.
[26] T. Jehan and B. Schoner, “An audio-driven percep-
tually meaningful timbre synthesizer,” in Proc. of
the 2001 Int. Computer Music Conf., Havana, Cuba,
2001.
[27] L. J. P. Van Der Maaten, E. O. Postma, and H. J.
Van Den Herik, “Dimensionality reduction: a com-
parative review,” Tilburg University Technical Re-
port, 2009.
[28] H. Nguyen, J. Burkardt, M. Gunzburger, L. Ju, and
Y. Saka, “Constrained CVT meshes and a compari-
son of triangular mesh generators,” Comput. Geom.,
vol. 42, no. 1, pp. 119, Jan. 2009.
[29] G. B. Huang, Q. Y. Zhu, and C. K. Siew, “Extreme
learning machine: Theory and applications,” Neuro-
computing, vol. 70, no. 13, pp. 489501, Dec.
2006.
[30] N. Schnell, R. Borghesi, D. Schwarz, F. Bevilacqua,
and R. Muller, “FTM - Complex Data Structure for
Max,” in Proc. of the 2005 Int. Computer Music
Conf., Barcelona, Spain, 2005.
[31] S. Fasciani, “Voice-controlled interface for digital
musical instruments,” Ph.D. Thesis, National Uni-
versity of Singapore, Singapore, 2014.
... In this article we describe and evaluate the integration of a user-driven generative mapping framework based on several techniques we introduced in (Fasciani 2012;Fasciani and Wyse 2013;Fasciani 2016). The method is independent of the specific synthesis method and it measures the perceptual timbre response of any deterministic sound synthesizer, providing low-dimensional and perceptually-based interaction independent of the type and number of synthesis parameter controlled. ...
... There are two major components comprising the integrated system: the vocal gestural controller (Fasciani and Wyse 2013), and the synthesis timbre space mapping (Fasciani 2016), as illustrated in Figure 1. The first is built upon robust and noise-free control signals extracted from the voice (Fasciani 2012), representative of control intention expressed by sub-verbal vocal gestures. ...
... To provide a comprehensive study of the timbre, for each parameter combination we compute descriptors over multiple overlapping windows of the synthesized signal. These are further analyzed and processed using three different user-selected methods, detailed in (Fasciani 2016). In particular we provide techniques to analyze steady, dynamic, and decaying timbres. ...
Article
Full-text available
In this article we describe a user-driven adaptive method to control the sonic response of digital musical instruments using information extracted from the timbre of the human voice. The mapping between heterogeneous attributes of the input and output timbres is determined from data collected through machine-listening techniques and then processed by unsupervised machine-learning algorithms. This approach is based on a minimum-loss mapping that hides any synthesizer-specific parameters and that maps the vocal interaction directly to perceptual characteristics of the generated sound. The mapping adapts to the dynamics detected in the voice and maximizes the timbral space covered by the sound synthesizer. The strategies for mapping vocal control to perceptual timbral features and for automating the customization of vocal interfaces for different users and synthesizers, in general, are evaluated through a variety of qualitative and quantitative methods.
... In this article we describe and evaluate the integration of a user-driven generative mapping framework based on several techniques we introduced in (Fasciani 2012;Fasciani and Wyse 2013;Fasciani 2016). The method is independent of the specific synthesis method and it measures the perceptual timbre response of any deterministic sound synthesizer, providing low-dimensional and perceptually-based interaction independent of the type and number of synthesis parameter controlled. ...
... There are two major components comprising the integrated system: the vocal gestural controller (Fasciani and Wyse 2013), and the synthesis timbre space mapping (Fasciani 2016), as illustrated in Figure 1. The first is built upon robust and noise-free control signals extracted from the voice (Fasciani 2012), representative of control intention expressed by sub-verbal vocal gestures. ...
... To provide a comprehensive study of the timbre, for each parameter combination we compute descriptors over multiple overlapping windows of the synthesized signal. These are further analyzed and processed using three different user-selected methods, detailed in (Fasciani 2016). In particular we provide techniques to analyze steady, dynamic, and decaying timbres. ...
Article
Full-text available
In this article we describe a user-driven adaptive method to control the sonic response of digital musical instruments using information extracted from the timbre of the human voice. The mapping between heterogeneous attributes of the input and output timbres is determined from data collected through machine-listening techniques and then processed by unsupervised machine-learning algorithms. This approach is based on a minimum-loss mapping that hides any synthesizer-specific parameters and that maps the vocal interaction directly to perceptual characteristics of the generated sound. The mapping adapts to the dynamics detected in the voice and maximizes the timbral space covered by the sound synthesizer. The strategies for mapping vocal control to perceptual timbral features and for automating the customization of vocal interfaces for different users and synthesizers, in general, are evaluated through a variety of qualitative and quantitative methods.
... To mitigate this problem, a considerable number of studies have proposed various approaches to automatic selection of synthesis parameters [16][17][18][19][20][21][22][23][24][25][26][27][28]. Such solutions allow musicians to define desired sounds in a more intuitive way (e.g., using timbral attributes [23][24][25][26][27][28] like bright and hollow) and automatically select parameters as inputs to sound synthesizers so that the synthesized sound satisfies the given description. ...
Conference Paper
Full-text available
This paper presents a novel deep learning model for synthesizing single-cycle waveforms from timbral attributes. The motivation was to investigate a viable alternative to traditional wavetable oscillators with intuitive control. Based on a thorough literature review and practical considerations , we selected three attributes appropriate for describing timbral characteristics of steady and harmonic tones: bright, warm, and rich. A deep learning network was designed to map magnitudes of these attributes to single cycle waveforms. The architecture was based on stacking of upsampling and convolutional layers to model temporal dependencies within the waveform. The network was trained on a large number of waveforms extracted from NSynth dataset. Audio features closely related to the selected attributes were used as inputs, while the custom loss function was employed to minimize the difference in normalized power spectra between outputs and training wave-forms. Four models with different hyperparameters were trained and the best one was selected using the validation dataset. Further experiments with the selected model showed that synthesized waveforms generally match the input attributes well, as the mean absolute errors for normalized attributes were 0.07, 0.05, and 0.18 for bright, warm, and rich respectively on the testing dataset.
... Arbib et al. used machine learning techniques to map from perceptual space to synthesizer parameter space [4]. Other researchers used 2D [5, p42] and 3D interfaces [6] to visualise the timbre space of synthesizers, enabling more efficient exploration. Timbre word to parameter mappings provide another intuitive means to find parameters [7], [8]. ...
Article
Full-text available
Programming sound synthesizers is a complex and time-consuming task. Automatic synthesizer programming involves finding parameters for sound synthesizers using algorithmic methods. Sound matching is one application of automatic programming, where the aim is to find the parameters for a synthesizer that cause it to emit as close a sound as possible to a target sound. We describe and compare several sound matching techniques that can be used to automatically program the Dexed synthesizer, which is a virtual model of a Yamaha DX7. The techniques are a hill climber, a genetic algorithm, and three deep neural networks that have not been applied to the problem before. We define a sound matching task based on six sets of sounds, which we derived from increasingly complex configurations of the Dexed synthesis algorithm. A bidirectional, long short-term memory network with highway layers performed better than any other technique and was able to match sounds closely in 25% of the test cases. This network was also able to match sounds in near real time, once trained, which provides a significant speed advantage over previously reported techniques that are based on search heuristics. We also describe our open source framework, which makes it possible to repeat our study, and to adapt it to different synthesizers and algorithmic programming techniques.
Conference Paper
Full-text available
In this paper we present the first step towards a novel approach to visual programming for sound and music applications. To make the creative process more intuitive, our concept enables musicians to use timbral attributes for controlling sound synthesis and processing. This way, musicians do not need to think in terms of signal processing, but can rely on natural descriptions instead. A special point of interest was mapping timbral attributes into synthesis parameters. We proposed a solution based on fuzzy logic which can be applied to different synthesizers. For a particular synthesizer, an audio expert can conveniently define mappings in form of IF-THEN rules. A prototype of the system was implemented in Pure Data and demonstrated with a subtractive synthesizer. A survey conducted among amateur musicians has shown that the system works according to their expectations, but descriptions based on timbral attributes are imprecise and dependent on subjective interpretation.
Conference Paper
Full-text available
Expressive sonic interaction with sound synthesizers requires the control of a continuous and high dimensional space. Further, the relationship between synthesis variables and timbre of the generated sound is typically complex or unknown to users. In previous works, we presented an unsupervised mapping method based on machine listening and machine learning techniques, which addresses these challenges by providing a low-­‐ dimensional and perceptually related timbre control space. The mapping maximizes the breadth of the ex-­‐ plorable sonic space covered by the sound synthesizer, and minimizes possible timbre losses due to the low-­‐ dimensional control. The mapping is generated automatically by a system requiring little input from users. In this paper we present an improved method and an optimized implementation that drastically reduce the time for timbre analysis and mapping computation. Here we introduce the use of the extreme learning machines for the regression from control to timbre spaces, and an interactive approach for the analysis of the synthesiz-­‐ er sonic response, performed as users explore the parameters of the instrument. This work is implemented in a generic and open-­‐source tool that enables the computation of ad hoc synthesis mappings through timbre spaces, facilitating and speeding up the workflow to get a customized sonic control system.
Article
Full-text available
cote interne IRCAM: McAdams79a
Article
Full-text available
The current study expands our previous work on interlanguage musical timbre semantics by examining the relationship between semantics and perception of timbre. Following Zacharakis, Pastiadis, and Reiss (2014), a pairwise dissimilarity listening test involving participants from two separate linguistic groups (Greek and English) was conducted. Subsequent multidimensional scaling analysis produced a 3D perceptual timbre space for each language. The comparison between perceptual spaces suggested that timbre perception is unaffected by native language. Additionally, comparisons between semantic and perceptual spaces revealed substantial similarities which suggest that verbal descriptions can convey a considerable amount of perceptual information. The previously determined semantic labels “auditory texture” and “luminance” featured the highest associations with perceptual dimensions for both languages. “Auditory mass” failed to show any strong correlations. Acoustic analysis identified energy distribution of harmonic partials, spectral detail, temporal/spectrotemporal characteristics and the fundamental frequency as the most salient acoustic correlates of perceptual dimensions.
Article
It is clear that the learning speed of feedforward neural networks is in general far slower than required and it has been a major bottleneck in their applications for past decades. Two key reasons behind may be: (1) the slow gradient-based learning algorithms are extensively used to train neural networks, and (2) all the parameters of the networks are tuned iteratively by using such learning algorithms. Unlike these conventional implementations, this paper proposes a new learning algorithm called extreme learning machine (ELM) for single-hidden layer feedforward neural networks (SLFNs) which randomly chooses hidden nodes and analytically determines the output weights of SLFNs. In theory, this algorithm tends to provide good generalization performance at extremely fast learning speed. The experimental results based on a few artificial and real benchmark function approximation and classification problems including very large complex applications show that the new algorithm can produce good generalization performance in most cases and can learn thousands of times faster than conventional popular learning algorithms for feedforward neural networks.1
Chapter
Sound creation and editing in hardware and software synthesizers presents usability problems and a challenge for HCI research. Synthesis parameters vary considerably in their degree of usability, and musical timbre itself is a complex and multidimensional attribute of sound. This chapter presents a user-driven search-based interaction style where the user engages directly with sound rather than with a mediating interface layer. Where the parameters of a given sound synthesis method do not readily map to perceptible sonic attributes, the search algorithm offers an alternative means of timbre specification and control. However, it is argued here that the method has wider relevance for interaction design in search domains which are generally well-ordered and understood, but whose parameters do not afford a useful or intuitive means of search.