Content uploaded by Stefano Fasciani

Author content

All content in this area was uploaded by Stefano Fasciani on Aug 20, 2016

Content may be subject to copyright.

TSAM: A TOOL FOR ANALYZING, MODELING, AND

MAPPING THE TIMBRE OF SOUND SYNTHESIZERS

Stefano Fasciani

Faculty of Engineering and Information Sciences

University of Wollongong in Dubai

stefanofasciani@stefanofasciani.com

ABSTRACT

Synthesis algorithms often have a large number of adjust-

able parameters that determine the generated sound and

its resultant psychoacoustic features. The relationship

between parameters and timbre is important for end us-

ers, but it is generally unknown, complex, and difficult to

analytically derive. In this paper we introduce a strategy

for the analysis of the sonic response of synthesizers sub-

ject to the variation of an arbitrary set of parameters. We

use an extensive set of sound descriptors which are

ranked using a novel metric based on statistical analysis.

This enables the study of how changes to a synthesis pa-

rameter affect timbral descriptors, and provides a multi-

dimensional model for the mapping of the synthesis con-

trol through specific timbre spaces. The analysis, model-

ing and mapping are integrated in the Timbre Space Ana-

lyzer & Mapper (TSAM) tool, which enables further in-

vestigation into synthesis sonic response and on percep-

tually related sonic interactions.

1. INTRODUCTION

The timbre generated by a sound synthesis algorithm de-

pends on the values assigned to the variable parameters,

typically user configurable. Regardless of the synthesis

method, the relationship between control and perceptual

features of the resultant sound is generally weak [1] and

difficult to determine. Modern synthesis algorithms pre-

sent a wide timbre range and a high dimensional control

space. The timbre, which is central in modern sonic arts,

has high dimensionality as well [2] and a blurry scientific

definition [3]. For designers of sonic interactive systems

and of musical instruments, knowing the parameter-to-

timbre relationship supports the implementation of the

intended sonic response. For sound designers and per-

formers this knowledge eases the development of control

intimacy [4]. Also, this insight can help in improving the

expressivity of musical instruments by reducing the con-

trol dimensionality while broadening the timbral re-

sponse. The heuristic estimation of the parameter-to-

timbre causality is workable, but is subjective and inaccu-

rate. This task is challenging due to nonlinearities and

correlations in the synthesis process, especially when a

large set of variable parameters are involved.

We address this issue by proposing a systematic and

generic method to analyze the timbre in relation to the

synthesis variables. The collected data is then processed

by computing a quality metric for each sound descriptor,

composed of four weighted components, each represent-

ing a specific statistical characteristic. Additionally, qual-

ity metrics for synthesis parameters are provided as well.

This information can be used in designing the mapping of

musical gestures to the synthesis control, providing a

tighter causal link with the timbral response of the sys-

tem. The tool we present here, the Timbre Space Analyz-

er & Mapper (TSAM), integrates these functionalities and

supports implementation of few-to-many lossless map-

pings [5], through an intermediate timbre-related layer

[6]. The tool, after analyzing the sonic response of the

synthesizer, computes a reduced timbre-to-parameter

model, which supports real-time interaction with the

sound synthesizer. In particular, we integrate an exten-

sion of the modeling and mapping strategy we introduced

in [7], highlighting the enhancement achieved when con-

sidering the quality metric for selecting the descriptor for

mapping purposes.

The TSAM is a flexible tool, exposing internal compu-

tation settings and options on a Graphical User Interface

(GUI), which supports a range of applications and aims.

The perceptual characteristics of synthesis method can be

studied, characterized, and compared numerically or

graphically. The relationship between timbre, spectrum

and different musical scales can be investigated [8]. Dif-

ferent mapping approaches for musical instrument can be

explored and compared. The rest of this paper is orga-

nized as follows. In Section 2 we describe the synthesis

analysis procedure and present the quality metric for de-

scriptors and parameters. Section 3 provides a summary

of the timbre space mapping strategy. The TSAM imple-

mentation is detailed in Section 4. Finally, Section 5 con-

cludes with discussion and future works.

2. TIMBRE RESPONSE ANALYSIS

Understanding the sonic variation resulting by tweaking

parameters is common when getting familiar with a

sound synthesizer. Different users may have distinct in-

Copyright: © 2016 Stefano Fasciani. This is an open-access article dis-

tributed under the terms of the Creative Commons Attribution License 3.0

Unported, which permits unrestricted use, distribution, and reproduction

in any medium, provided the original author and source are credited.

tents. Sound designers aim at synthesizer configurations

generating the their desired sound, whereas performers

and instrument builders look at a mapping that yields

sonic expressivity. Synthesizers generally feature a large

number of controllable parameters, representing the syn-

thesis algorithm variables. In analog synthesizers, each

parameter can theoretically assume an infinite number of

values, while in digital (or software) synthesizer we have

more than 4 billion possible values if considering single-

precision implementations (32 bit). Synthesizers inter-

faced using the MIDI protocol allow only up to 128 dis-

tinct values per parameter (7 bit), despite the resolution of

the internal circuitry. However with only three MIDI con-

trolled parameters we have more than 2 million (221) dif-

ferent parameter permutations or unique synthesis states.

This combinatorial explosion limits the feasibility of a

comprehensive analysis of the all timbre resultant from

each of these states.

Limiting the dimensionality of the parameter space al-

lows coping with the large number of synthesis states to

analyze, laving only a few variable parameters and fixing

the remaining to specific values. In this case the timbre

analysis is limited to a subset of the entire parameter

space, which is a scenario equivalent to users tweaking

only a few parameters of a synthesis configuration (or

preset). To further reduce the number of states to analyze

we use the principle of spatial locality: states close in the

parameter space generate similar timbres. Therefore we

can sample the parameter space with a larger step size,

and eventually interpolate at a later stage. This principle

is generally true if we exclude synthesis algorithms fea-

turing stochastic components, and parameters with strong

nonlinearities (e.g. binary switches). Generally, the oppo-

site of this principle does not hold. Proximity in the tim-

bre space does not necessarily imply similar parameter

configuration. The TSAM itself can be used to verify

these principles. A further reduction can be achieved lim-

iting the individual range of interest of each parameter.

Given k variable synthesis parameter, the synthesis state

space I (set of unique parameter permutations) is given

by the Equations (1)-(3) [9].

𝐈=[𝐢!,𝐢!,…,𝐢!]

(1)

𝐢=[𝑖!,𝑖!,…,𝑖!]

(2)

𝑛=

max 𝑖!−min 𝑖!

step 𝑖!

!

!!!

(3)

Each synthesis state is represented with a vector i with

dimensionality k, as in Equation (2), while n, the number

of vectors in I, depends on the individual range and step

size of the k parameters, as in (3). I is the synthesis state

space we consider for the timbre analysis, presenting di-

mensionality k and cardinality n.

2.1 Descriptors Set and Computation

For each state i of the sound synthesizer we compute a set

of audio descriptors, that we indicate with d, representing

the timbral descriptors of the resulting synthetic sound. A

large set of low-level computational descriptors, includ-

ing eventual redundancies, is essential for the detailed

timbre analysis we require in this context. A few higher-

level timbre descriptors (e.g. brightness, noisiness, color-

ation), often subjective and language dependent semantic

[10], are suitable to discriminate sounds with major tim-

bral differences, but in this context they fail to capture the

subtle sonic nuances determined by small variations of

the synthesis parameters.

A posterior descriptor selection is possible considering

the quality metric we present in this paper. The method is

independent of the specific descriptors set. In the TSAM

we use the CUIDADO features set [11] implemented in

the IRCAM descriptors object for Max/MSP. The set

includes spectral and perceptual features listed in Table 1.

It includes 24 scalar and 7 vectorial descriptors, as speci-

fied in the dimensionality column, resulting in a dimen-

sionality q of d equal to 108, as in (4). Some of the scalar

descriptors in the set are closely related to traditional

timbre labels (e.g. spectral centroid to brightness).

𝐝=[𝑑!,𝑑!,…,𝑑!]

(4)

Descriptor Name

Dimensionality

Total Energy

1

Signal Zero Crossing Rate

1

Spectral Centroid

1

Spectral Crest

4

Spectral Decrease

1

Spectral Flatness

4

Spectral Kurtosis

1

Spectral Rolloff

1

Spectral Skewness

1

Spectral Slope

1

Spectral Spread

1

Spectral Variation

1

Perceptual Odd To Even Ratio

1

Perceptual Spectral Centroid

1

Perceptual Spectral Decrease

1

Perceptual Spectral Deviation

1

Perceptual Spectral Kurtosis

1

Perceptual Spectral Rolloff

1

Perceptual Spectral Skewness

1

Perceptual Spectral Slope

1

Perceptual Spectral Spread

1

Perceptual Spectral Variation

1

Perceptual Tristimulus

3

Sharpness

1

Spread

1

Noise Energy

1

Noisiness

1

Chroma

12

MFCC

13

Relative Specific Loudness

24

Perceptual Model

24

Table 1. List of descriptors used in the TSAM.

The descriptors listed above are computed on a short

temporal window, typically in the range 2 ms to 200 ms.

They provide an instantaneous sonic representation suffi-

cient to characterize only absolutely periodic sounds. In

synthesis states we may observe and hear low rate timbre

variations, spanning beyond the largest temporal window

we consider for the descriptors. Hence an appropriate

characterization of the timbre requires computation and

merges of descriptors computed from multiple short time

windows. We propose two analysis modes named ‘sus-

tain’ and ’envelope’ mode. In the first, given a synthesis

state i, we compute m descriptor vectors and we combine

these taking their mean and optionally their range, as in

Equation (5), doubling the dimensionality of the de-

scriptor set. The second approach simply concatenates the

m descriptor vectors into a single vector, as in Equation

(6), increasing the dimensionality by m times.

𝐢𝐝=mean 𝐝!,𝐝!,…,𝐝!

max 𝐝!,…,𝐝!−min 𝐝!,…,𝐝!

(5)

𝐢𝐝=

𝐝!

⋮

𝐝!

(6)

Considering the synthesis as an binary process, and the

sound generated as almost periodic, the first approach

provides a sufficient approximation of the timbre. When

the synthesis produces dynamic timbres, such as texture-

like sounds, or when ADSR envelopes are applied to am-

plitude and other parameters, the second approach is pre-

ferred. However also in presence of ADSR envelopes, we

can still use the first approach, analyzing only the sustain

phase of the synthesis, intentionally discarding the attack,

decay and release phases, or because these do not signifi-

cantly change within the parameter space I we analyze.

The concatenation of short-term static descriptors to

analyze timbral dynamics is a simplification with respect

to the use of dynamic descriptors computed on longer

temporal windows. However this approach reduces the

time needed to execute the timbre analysis and allows

users to change the merging mode from ‘sustain’ to ‘en-

velope’ and vice versa without repeating the analysis.

In the TSAM implementation, presented in Section 4,

the computation of the descriptors is completely automat-

ed. Users are only required to identify the k variable pa-

rameters of the synthesizer, their range, step size, number

of descriptor vectors per state 𝑚, and analysis mode. The

tool computes I and drives the synthesizer with one i at a

time, computing and storing 𝑚 vectors d. For analysis in

envelope mode, the tool also manages the triggering of

the synthesizer at every i. Users can further specify the

temporal unfolding of the analysis, selecting only a sub-

set of the ADSR envelope. Advanced options related to

the descriptor computation, such as window size, hop

size, sampling rate, are exposed as well.

2.2 Descriptor Quality Metric

The quality metric we compute for each descriptor is

aimed at capturing the four characteristics listed below.

• Noisiness: deviation of the descriptor from its

mean given a synthesis state i.

• Variance: spread of descriptor value across the syn-

thesis state space I.

• Independence: uniqueness of the descriptor varia-

tion pattern across the synthesis state space I.

• Correlation: coherence of the descriptor variation

with synthesis parameters across the synthesis state

space I.

Ideally, a descriptor representative of I should present

low noisiness, high variance, high independence, and

high correlation. High noisiness indicates that a particular

descriptor and the associated timbral characteristic also

varies when synthesis parameters are fixed, and therefore

its eventual variance across I may be not significant. A

descriptor with low variance reveals that the related tim-

bral characteristic does not change significantly when

varying the synthesis parameters. Descriptors varying

with a similar trend are redundant, and thus less signifi-

cant, when computing a dimensionality-reduced timbre

space modeling I, instead those more independent carry a

larger amount of information. Descriptors can also be

highly independent when varying randomly across I. We

address this by also including the correlation between

descriptor and parameters in the metric, as we expect

representative descriptors to change accordingly to one or

more synthesis parameter.

For each descriptor, we compute the nosiness N!,𝐢 from

the m descriptor vectors in synthesis state i before these

are merged, as per Equations (5) and (6). The subscript x

is the index identifying the descriptor across the set of q

computed in the TSAM. For ‘sustain’ mode, we measure

the deviation of the descriptor x in the state i using the

Relative Mean absolute Difference (RMD), as in Equa-

tion (7). The RMD is a scale invariant measure of statisti-

cal dispersion, hence allows the comparison of heteroge-

neous descriptors. For ‘envelope’ mode, N!,𝐢 is estimated

as the zero crossing rate, as in Equation (8), of the for-

ward second order finite difference (discrete approxima-

tion of the second order derivative) of the series of m

descriptors, as in (9). This represents the rate at which a

descriptor inverts its trend (from increasing to decreasing

and vice versa) in the analyzed envelope. Noisy de-

scriptors invert their trends at higher rates.

N!,𝐢=𝑑𝑥,𝑗−𝑑𝑥,𝑘

𝑚

𝑘=1

𝑚

𝑗=1

𝑑𝑥,𝑗

𝑚

𝑗=1

𝑚−1

(7)

N!,𝐢=

1

𝑚−2

𝕀Δ!𝑑!,!Δ!𝑑!,!!!<0

!!!

!!!

(8)

Δ!𝑑!,!=2

𝑘(−1)!!!𝑑!,!!!

!

!!!

(9)

In Equations (7)-(9), 𝑑!,! represents the x-th descriptor

in the set of q, from the j-th vector d out of the m com-

puted for each state i. The indicator function 𝕀 is

equal to 1 if its argument is true, 0 otherwise. Δ! is

the forward second order finite difference function. The

overall noisiness of each descriptor N! is computed by

taking the average over the set of synthesis unique states

I we analyze.

Variance, independence, and correlation are computed

across I, after the m descriptors are merged as in (5)-(6).

The same method is used for both ‘sustain’ and ‘enve-

lope’ modes. The variance V! is computed as the RMD

over the n synthesis states i. We use the same expression

as in (7), replacing m with n, but in this case 𝑑!,! is the x-

th descriptor in the set of q, from the j-th vector d out of

the n we compute across I.

We assume that descriptors are independent if poorly

correlated, therefore we compute I! taking the comple-

ment of the averaged absolute value of the correlation

coefficient between the descriptor x and the other q-1

descriptors over I, as in Equation (10). Both positive and

negative correlations indicate dependence, therefore we

take the absolute value of the correlation coefficient

corr . We subtract 1 from the summation to remove

the correlation coefficient of the descriptor with itself,

when j=x. Finally, the correlation C! between descriptors

and parameters is computed taking the average correla-

tion coefficient between the x-th descriptor and the k var-

iable synthesis parameter, as in Equation (11).

I!=1−

1

𝑞−1

corr 𝐝!,𝐈,𝐝!,𝐈

!

!!!

−1

(10)

C!=

1

𝑘corr 𝐝!,𝐈,𝐢!,𝐈

!

!!!

(11)

In (10) and (11) with 𝐝!,𝐈 we represent the vector con-

taining the n values of the x-th descriptor computed over

the synthesis state space I, while 𝐢!,𝐈 represents the vector

containing the n values of the x-th synthesis parameter

over I. Note that according to (5) and (6) each descriptor

may contribute with more than one component in each

vector d. In particular, for ‘sustain’ mode we have two

components per descriptor if the range is included in the

analysis, whereas for the ‘envelope’ mode we have m

components per descriptor. Therefore we compute multi-

ple V!, I! and C! per each x-th descriptor, and use their

average in the quality metric we introduce next.

The quality metric S! of each descriptor is computed

from the individual noisiness, variance, independence,

and correlation as in Equation (12). The noisiness, being

an undesirable feature, lowers the value of S!. The four

components are combined using individual weights 𝑤.

S!=𝑤!V!+𝑤!I!+𝑤!C!−𝑤!N!

(12)

The selection of the 𝑤 values depends on the aim and

context of the timbre analysis, and also on individual

preferences. For instance, when analyzing a synthesizer

configuration with a texture-like timbre, we expect con-

siderable sonic variation within each synthesis state i,

therefore the noisiness has no significance and 𝑤! should

be close to zero. If the purpose of the analysis is the sole

study of the synthesizer timbre through the descriptors,

their independence has little relevance. Instead when de-

scriptors are used for mapping purposes, as in Section 3,

the independence has a higher significance. In the TSAM,

the default values of the weights are 0.33 for variance,

independence and correlation, and 0.66 for noisiness.

Users can change these in the unitary range. The four

components of the quality metric have different ranges. I!

and C! span between [0,1], while N! and V! can be zero

but do not have a theoretical maximum. In the TSAM we

include the option to normalize these to the unitary range,

easing the balancing through individual weights. Howev-

er when comparing the quality metrics S! across different

synthesizers, or between different state spaces I of the

same synthesizer, normalization should not be used. In

the TSAM we also rank also the k synthesis parameter by

their average correlation with the q descriptors, computed

as in (11) but replacing k with q and taking x as the sum-

mation index. Furthermore for each parameter, the

TSAM displays the two descriptors with associated high-

est and lowest correlation, and vice versa.

3. TIMBRE SPACE MODELING AND

MAPPING

Audio descriptors have been extensively used for visuali-

zation, measurement, classification, and recognition of

sounds. Works proposing the timbre as a control structure

for sound synthesis [12] or for interactive sonic systems

have recently proliferated [13]–[24]. These allow for ex-

plicit control of psychoacoustic characteristics of the

generated sound, hiding synthesis parameters from users,

simplifying the user interaction, facilitating the search for

specific timbres, and enhancing the expressivity of the

system. Similar benefits are provided by synthesis meth-

ods using a timbre representation derived by a prior anal-

ysis stage of the target sound [25], [26]. A model relating

parameters to sonic response of the sound synthesizer is

necessary to implement explicit timbre control. Our ge-

neric approach, introduced in [7] and extended here, de-

rives a model from the prior analysis stage, and therefore

it is independent of the specific synthesis method and

implementation.

The generative mapping is based on unsupervised ma-

chine learning techniques, and it provides a low dimen-

sional and perceptually related synthesis control. The

mapping maximizes the breadth of the explorable sonic

space covered by the synthesis space I, and minimizes

possible timbre losses due to the reduced dimensionality

of the control space (i.e. few-to-many mapping). The

timbre response analysis described in the previous section

returns a synthesis space I, with dimensionality k, and a

descriptor space D, with dimensionality q. Both spaces

present n entries i and d, which are pairwise associated,

representing a basic model relating parameters and tim-

bre. Hence we can explicitly express a timbre through the

q descriptors (e.g. mapped on a large bank of faders), find

the closest entry in D, and drive of the synthesizer with

the associated parameter set i. Such control is affected by

several drawbacks: the high dimensionality of the timbre-

based control, with q generally much greater than k; the

lack of accuracy due to the large parameter step size we

use in the analysis stage (3); entries in the timbre space D

are not evenly distributed as in I, hence regions of D with

low density determine a poor system response.

The real dimensionality of D is usually much less than

q. Generally the data of interest lies on an embedded non-

linear manifold within the q-dimensional space. There-

fore we reduce the dimensionality of D, using Isomap,

down to two or three dimensions, which are easy to map

to general-purpose controllers with low cognitive com-

plexity. In the TSAM users can explore the application of

34 different dimensionality reduction methods [27].

Before reducing the dimensionality of D, we use the

quality metric S! to discard those descriptors with a low

score. Particularly noisy or poorly correlated descriptors

present a large variance that have a significant impact in

the dimensionality reduction stage, but this would not be

not representative of the parameter-to-timbre relationship,

corrupting the timbre space mapping. The selection of

descriptors based on the quality metric determines im-

provements in accuracy and usability against our previ-

ous approach. Alternatively, users can bypass the dimen-

sionality reduction stage, and explicitly specify the two or

three descriptors composing the low dimensional timbre

space we use for the mapping to synthesis parameters.

To address the issue of the possible unresponsiveness of

the timbre space due to arbitrary distribution in D we

apply an iterative algorithm based on the Voronoi tessel-

lation, derived from [28], that redistribute the n entries d

into an uniformly distributed square or cube, while pre-

serving the local neighborhood relationships (homomor-

phic transformation). The inverse of this transformation

represent the required mapping to project a generic mul-

tidimensional control space C onto the case specific tim-

bre space. Hence we use an Artificial Neural Network

(ANN) to learn a function 𝑚 approximating the in-

verse of the redistribution process. We use 𝑚 to pro-

ject the generic multidimensional control vector c onto

the dimensionally reduced timbre space D*. The ANN

includes a single hidden layer and therefore can be

trained efficiently using a non-iterative algorithm [29]. In

Figure 1 we show an example of a highly clustered tim-

bre space reduced to three dimensions, and its transfor-

mation to a uniform cube. The side arrows identify the

two stages of the mapping computation. In the TSAM we

provide also an alternative mapping, skipping the ANN

and computing the synthesis parameters directly from the

uniformly distributed timbre space.

In the final stage of the mapping we compute the pa-

rameters to interact with the sound synthesizer. We use d*

to represent a descriptor vector in the dimensionality re-

duced timbre space D*. Driving the synthesis with the

parameters i associated with the d* closer to 𝑚𝐜 may

lead to discontinuities, that in turn may generate glitches

in the sonic output. These are due to the coarse parameter

step size used in the analysis stage, and due to the not

one-to-one relationship between parameters and sound.

Two synthesis states i, far apart in the synthesis state

space I, may be associated identical or similar descriptor

vectors d, hence close in D. The latter is an implicit

drawback of any methods for controlling sound synthesis

from any representation of the generated signal.

Figure 1. Example of a timbre space reduced to three

dimensions, and related transformation to a uniform cube.

We address these issues computing the synthesis pa-

rameter by spatial interpolation, including only entries of

D* from the neighborhood the current state i. The set of

parameters driving the synthesizer 𝐢!"#$ is computed by

Inverse Distance Weighting (IDW) as in Equations (12)

and (13), where represent the Euclidean distance.

𝐢!"#$ =𝐪!(𝑚(𝐜))∙𝐢!

!

!!!

𝐪!(𝑚(𝐜))

!

!!!

(12)

𝐪!(𝑚(𝐜))=

1

𝑚𝐜−𝐝!

∗!

(13)

In (12) and (13) N represents the total number of points

considered in the interpolation, and the 𝐢! in (12) are

those pairwise associated with the 𝐝!

∗ in (13). In the

TSAM instead of using the N closest point 𝐝!

∗ in D*, we

select those 𝐝!

∗ that limit the maximum variation of 𝐢!"#$

between two consecutive iterations, that is the set of 𝐝!

∗

associated with the 𝐢! close to the current 𝐢!"#$ (within a

user-defined distance). In Figure 2 we show an example

of this interpolation points selection, where the green

entries are the 𝐝!

∗ related to 𝐢! close to the current 𝐢!"#$,

which is in turn associated with the yellow one in figure.

The set of 𝐝!

∗ used for IDW interpolation may include

entries distant from 𝑚𝐜, but these will poorly contrib-

ute in (12). In the IDW, p represents the power parame-

ter, which determines the influence of each point based

on the distance. This value should be larger than the di-

mensionality of the reduced timbre space D*, and increas-

ing p closer points has larger weight. In the TSAM, the

𝐢!"#$ maximum instantaneous distance and interpolation

power parameter p, are among the options exposed to

users to tune in real time the timbre mapping response.

The TSAM provides interactive timbre space visualiza-

tions, such as those in Figure 1 and 2.

Figure 2. Detail of a timbre space reduced to three di-

mensions. The green entries are those used in the interpo-

lation to compute the synthesis parameter, because close

to the yellow current entry in the synthesis state space.

4. IMPLEMENTATION AND USAGE

The TSAM1 is an open-source software implemented in

in Max/MSP using FTM extension2 [30], supported by a

background engine written and compiled in MATLAB.

The analysis of the synthesis timbre, the real-time timbre

space mapping and the visualizations are computed in

Max/MSP, whereas the background engine computes the

descriptor quality and the timbre space mapping (dimen-

sionality reduction, redistribution, ANN training), taking

as input the outcome of the analysis stage. The two com-

ponents of the system communicate via Open Sound

Control (OSC) protocol and large matrices are exchanged

using files. The TSAM can host software synthesizer

developed using Steinberg’s Virtual Studio Technology

(VST). It acts as a wrapper for VST synth, providing a

fully integrated environment. The TSAM allows full con-

trol of all parameters for analysis and mapping purposes.

It captures the synthetized signal for descriptor computa-

tion and playback, and manages the global state of the

synthesizer when saving and restoring presets. In Figure

3 there is a screenshot of the main TSAM GUI. This ex-

poses a large number of options for further exploration of

1 http://stefanofasciani.com/tsam.html

2 http://ftm.ircam.fr/

the mapping method we propose, and also for customiz-

ing analysis, mapping computation, real-time control, and

visualization. Default settings are provided for basic use.

Users can load a VST synth and select up to 10 variable

parameters, their range, analysis step size, and the num-

ber of vectors m per state i. Advanced analysis options

include digital signal processing settings and analysis

timing with respect to the synthesis triggering (note-on

and note-off messages). The TSAM estimates and shows

the total analysis time, and users may opt to reduce the

parameter step sizes, in (3), when this is excessive.

Thereafter the analysis is carried out automatically. In

Section 2 we discussed two analysis modes, ‘sustain’ and

‘envelope’ respectively. These, besides the automatic

mode, can also be carried out manually. Users arbitrarily

tune the synthesizer to a specific state i, and request for

the descriptor analysis of the related sonic response (both

modes are supported). Furthermore we included the inter-

active ‘sustain’ analysis mode [7] where descriptor vec-

tors d are computed while users vary in the MIDI mapped

synthesis parameters in real-time, dynamically generating

a stream of i. The latter analysis mode does not guarantee

to observe an identical number of descriptor vectors d per

state i, hence the noisiness in the quality metric result

may be inconsistent.

When the analysis stage is completed, users can request

the computation of the descriptor quality metric, which is

visualized in the TSAM as shown in Figure 4. In the de-

scriptors page, users can also specify the weights of

Equation (12), enable the normalization of its compo-

nents, find and rank the descriptors by highest score, ob-

serve the synthesis parameter ranking, and find the high-

est and lowest correlation between each parameter and

descriptor. Furthermore, users can specify which subset

of the 108 descriptors will be used for mapping purposes.

Options for the timbre space mapping computation in-

clude the dimensionality of the map, selection of the di-

mensionality reduction technique and the ANN activation

function. The mapping can be tuned at runtime using the

settings discussed in Section 3. The timbre analysis, qual-

ity metric, and mapping are saved into files that can be

individually recalled through the TSAM presets.

5. DISCUSSION AND FUTURE WORK

We presented a generic tool that integrates functionalities

to study and map the timbre of sound synthesizers. Pre-

liminary studies demonstrated that the adoption of large

sets of descriptors, and their selection based on the novel

quality metric, improves the accuracy of the timbre-based

interaction. The TSAM can be used for the study of the

sonic response of synthesizers, for an explicit control of

timbral character, or for a reduction of the synthesis con-

trol space, exposing only a few perceptually relevant con-

trol dimensions. Previous user studies on a system with a

similar mapping approach demonstrated that synthesis

parameters become transparent to users [31], which are

exclusively focused on the timbral interaction. Future

works include user studies with the TSAM to evaluate the

effectiveness of the timbre-based mapping, comparing it

against traditional and alternative approaches to sound

synthesis interaction, in performing and sound design

scenarios. Moreover we will investigate the relevance of

different descriptor categories for a more perceptually

related sonic control.

Figure 3. TSAM main page, including options for analysis, mapping computation, real-time control, and visualization.

Figure 4. TSAM descriptor page, providing an insight into the timbre response and parameter relationship of the synth.

6. REFERENCES

[1] T. Wishart, On Sonic Art. Harwood Academic Pub-

lishers, 1996.

[2] S. McAdams and A. Bergman, “Hearing musical

streams,” Comput. Music J., vol. 3, no. 4, pp. 26–43,

60, 1979.

[3] J. C. Risset and D. Wessel, “Exploration of timbre

by analysis and synthesis,” Psychol. Music, pp. 113–

169, 1999.

[4] S. Fels, “Intimacy and embodiment: implications for

art and technology,” in Proc. of the 2000 ACM

workshops on Multimedia, 2000, pp. 13–16.

[5] E. R. Miranda and M. M. Wanderley, New digital

musical instruments: control and interaction beyond

the keyboard. A-R Editions, Inc., 2006.

[6] D. Arfib, J. M. Couturier, L. Kessous, and V. Ver-

faille, “Strategies of mapping between gesture data

and synthesis model parameters using perceptual

spaces,” Organ. Sound, vol. 7, no. 2, pp. 127–144,

Aug. 2002.

[7] S. Fasciani, “Interactive Computation of Timbre

Spaces for Sound Synthesis Control,” in Proc. of the

2nd Int. Symposium on Sound and Interactivity, Sin-

gapore, 2015.

[8] W. A. Sethares, Tuning, Timbre, Spectrum, Scale.

Springer Science & Business Media, 2005.

[9] S. Fasciani and L. Wyse, “Adapting general purpose

interfaces to synthesis engines using unsupervised

dimensionality reduction techniques and inverse

mapping from features to parameters,” in Proc. of

the 2012 Int. Computer Music Conf., Ljubljana, Slo-

venia, 2012.

[10] A. Zacharakis, K. Pastiadis, and J. D. Reiss, “An

Interlanguage Unification of Musical Timbre,” Mu-

sic Percept. Interdiscip. J., vol. 32, no. 4, pp. 394–

412, Apr. 2015.

[11] G. Peeters, “A Large Set of Audio Features for

Sound Description (Similarity and Classification) in

the Cuidado Project,” IRCAM, 2004.

[12] D. Wessel, “Timbre space as a musical control struc-

ture,” Comput. Music J., vol. 3, no. 2, pp. 45–52,

1979.

[13] A. Lazier and P. R. Cook, “Mosievius: feature driv-

en interactive audio mosaicing,” in Proc. of the 7th

Int. Conf. on Digital Audio Effects, Napoli, Italy,

2003.

[14] M. Puckette, “Low-dimensional parameter mapping

using spectral envelopes,” in Proc. of the 2004 Int.

Computer Music Conf., Miami, US, 2004.

[15] C. Nicol, S. A. Brewster, and P. D. Gray, “Designing

Sound: Towards a System for Designing Audio In-

terfaces using Timbre Spaces.,” in Proc. of the 10th

Int. Conf. on Auditory Display, Sydney, Australia,

2004.

[16] D. Schwarz, G. Beller, B. Verbrugghe, and S. Brit-

ton, “Real-time corpus-based concatenative synthe-

sis with CARART,” in Proc. of the 9th Int. Conf. on

Digital Audio Effects, Montreal, Canada, 2006, pp.

279–282.

[17] M. Hoffman and P. R. Cook, “Feature-based synthe-

sis: Mapping acoustic and perceptual features onto

synthesis parameters,” in Proc. of the 2006 Int.

Computer Music Conf., New Orleans, US, 2006.

[18] N. Schnell, M. A. S. Cifuentes, and J. P. Lambert,

“First steps in relaxed real-time typo-morphological

audio analysis/synthesis,” in Proceeding of the 7th

Sound and Music Computing Int. Conf., Barcelona,

Spain, 2010.

[19] T. Grill, “Constructing high-level perceptual audio

descriptors for textural sounds,” in Proc. of the 9th

Sound and Music Computing Int. Conf., Copenha-

gen, Denmark, 2012.

[20] A. Seago, “A New Interaction Strategy for Musical

Timbre Design,” in Music and Human-Computer In-

teraction, S. Holland, K. Wilkie, P. Mulholland, and

A. Seago, Eds. Springer, 2013, pp. 153–169.

[21] A. Pošćić and G. Kreković, “Controlling a sound

synthesizer using timbral attributes,” in Proc. of the

10th Sound and Music Computing Int. Conf., Stock-

holm, Sweden, 2013.

[22] N. Klügel, T. Becker, and G. Groh, “Designing

Sound Collaboratively Perceptually Motivated Au-

dio Synthesis,” in Proc. of the 14th Int. Conf. on

New Interfaces for Musical Expression, London,

United Kingdom, 2014, pp. 327–330.

[23] S. Ferguson, “Using Audio Feature Extraction for

Interactive Feature-Based Sonification of Sound,” in

Proc. of the 21st Int. Conf. on Auditory Display

(ICAD 2015), Graz, Austria, 2015.

[24] S. Stasis, R. Stables, and J. Hockman, “A Model For

Adaptive Reduced-Dimensionality Equalisation,” in

Proc. of the 18th Int. Conf. on Digital Audio Effects

(DAFx-15), Trondheim, Norway, 2015.

[25] X. Serra and J. Smith, “Spectral Modeling Synthesis:

A Sound Analysis/Synthesis System Based on a De-

terministic Plus Stochastic Decomposition,” Com-

put. Music J., vol. 14, no. 4, pp. 12–24, 1990.

[26] T. Jehan and B. Schoner, “An audio-driven percep-

tually meaningful timbre synthesizer,” in Proc. of

the 2001 Int. Computer Music Conf., Havana, Cuba,

2001.

[27] L. J. P. Van Der Maaten, E. O. Postma, and H. J.

Van Den Herik, “Dimensionality reduction: a com-

parative review,” Tilburg University Technical Re-

port, 2009.

[28] H. Nguyen, J. Burkardt, M. Gunzburger, L. Ju, and

Y. Saka, “Constrained CVT meshes and a compari-

son of triangular mesh generators,” Comput. Geom.,

vol. 42, no. 1, pp. 1–19, Jan. 2009.

[29] G. B. Huang, Q. Y. Zhu, and C. K. Siew, “Extreme

learning machine: Theory and applications,” Neuro-

computing, vol. 70, no. 1–3, pp. 489–501, Dec.

2006.

[30] N. Schnell, R. Borghesi, D. Schwarz, F. Bevilacqua,

and R. Muller, “FTM - Complex Data Structure for

Max,” in Proc. of the 2005 Int. Computer Music

Conf., Barcelona, Spain, 2005.

[31] S. Fasciani, “Voice-controlled interface for digital

musical instruments,” Ph.D. Thesis, National Uni-

versity of Singapore, Singapore, 2014.