Content uploaded by John Ashley Burgoyne
Author content
All content in this area was uploaded by John Ashley Burgoyne on Apr 07, 2014
Content may be subject to copyright.
NON-LINEAR SCALING TECHNIQUES FOR UNCOVERING
THE PERCEPTUAL DIMENSIONS OF TIMBRE
John Ashley Burgoyne and Stephen McAdams
Centre for Interdisciplinary Research in Music and Media Technology
Schulich School of Music of McGill University
555 Sherbrooke Street West
Montr´
eal, Qu´
ebec, Canada H3A 1E3
{ashley,smc}@music.mcgill.ca
ABSTRACT
Seeking to identify the constituent parts of the multidi-
mensional auditory attribute that musicians know as tim-
bre, music psychologists have made extensive use of mul-
tidimensional scaling (MDS), a statistical technique for
visualising the geometric spaces implied by perceived dis-
similarity. MDS is also well known in the machine learn-
ing community, where it is used as a basic technique for
dimensionality reduction. We adapt a popular non-linear
variant of MDS in machine learning, Isomap, for use in
analysing psychological data and re-analyse an earlier ex-
periment on human perception of timbre. Isomap is de-
signed to be better than linear MDS at maintaining the
local relationships of each data point with its neighbours,
and our results show that it can produce a more musically
intuitive timbre space without sacrificing correlation with
those aspects of timbre perception that have already been
discovered. Future work should explore the new timbral
dimensions uncovered by our algorithm.
1. INTRODUCTION
As any computer musician knows, timbre is one of the
most important compositional parameters, and yet it re-
mains one of the most under-theorised. Part of the rea-
son for the relative lack of theory may be due to the fact
that, unlike pitch, timbre is a multi-dimensional auditory
attribute, and it it is difficult to draw general conclusions
about timbre as a whole without first identifying its con-
stituent parts. Nonetheless, there have been a number of
attempts to uncover the underlying dimensionality of tim-
bre space over the past few decades, most based on per-
ceptual experiments with synthesised and recorded tones.
Early experiments with synthetic tones identified spectral
centroid and attack time as primary components of tim-
bre, in addition, at times, to a third dimension that was
more difficult to interpret and dependent on the stimulus
set [4, 5]. Later studies with more sophisticated models
came to similar conclusions, and suggested that the third
component might be a measure of irregularity in the spec-
tral envelope [6, 7]. A recent confirmatory study has veri-
fied these interpretations [2].
All of these studies are based on a statistical technique
known as multidimensional scaling (MDS) [9]. The ba-
sic idea of MDS is to take the set of proximities between
all members of some set of data points, e.g., sample tim-
bres, and to model them as distances in a Euclidean space
of as few dimensions as possible. In the context of timbre,
these proximities are usually taken from psychological ex-
periments in which human subjects have rated their per-
ception of the (dis)similarity between timbre pairs. The
trouble with MDS in this context is that its classical form
was designed to interpret a single set of dissimilarities
among items, not the average over all subjects of an ex-
periment. The first robust solution to this problem was the
INDSCAL algorithm [3], which models a special weight
on each dimension for each subject in the experiment in
order to better fit model distances to the set of empiri-
cal dissimilarities. The more sophisticated CLASCAL al-
gorithm reduces the number of parameters in INDSCAL
by modelling weights not for individual subjects but for a
smaller number of aggregate subject groups, called latent
classes [11]. CLASCAL and its variants are the standard
techniques for analysing timbre spaces today.
There is another problem with these techniques, how-
ever: being linear, they consider all distances estimated by
the human subjects to be equally reliable and of equal rel-
ative scale. Although some relatively straightforward ex-
tensions to MDS can treat the latter problem, e.g., CON-
SCAL [12], the former requires more aggressive modifica-
tions. One such modification, known as Isomap, replaces
large distances in the original distance matrices with so-
called geodesic distances along a hypothetical manifold
[8]. Previous papers at this conference have demonstrated
that Isomap and its relatives can uncover meaningful mu-
sical relationships that traditional linear MDS will always
miss [1].
This paper combines the CLASCAL and Isomap mod-
els to re-analyse the data from a major study of timbre [7].
Section 2 provides a more detailed explanation of these
algorithms and best practises for interpreting their results.
Section 3 presents the results of our new scaling and com-
pares them to the original study. Section 4 concludes with
suggestions for future applications of non-linear scaling to
the study of musical timbre.
2. CLASCAL AND ISOMAP
2.1. CLASCAL
Traditional MDS was designed to handle a single set of
pairwise proximities only. A number of models have been
presented to adapt MDS for multiple-subject experiments,
of which the most important for studying timbre has been
CLASCAL [11]. The CLASCAL model seeks to min-
imise the approximation error in the following equation:
di jk ≈"R
∑
r=1
wC(i),r(xjr −xkr )2#1
/2
(1)
where di jk is the dissimilarity rating that subject iassigned
to stimulus pair (j,k),Ris the number of dimensions in
the output set, wC(i),ris a special weight for the so-called
latent class C(i)to which CLASCAL has assigned sub-
ject i, and xjr and xkr are the r-th components of the output
vectors for stimuli jand k. Latent classes are meant to rep-
resent groups of subjects who pursue similar rating strate-
gies. The number of latent classes used is a compromise
between over-parametrisation, e.g., the INDSCAL model,
which assigns each subject to its own class, and over-
generalisation, e.g., ignoring differences between subjects
by taking the average over all dissimilarity matrices. A
Monte-Carlo likelihood-ratio technique is used to deter-
mine the optimal number of classes.
Another problem with traditional MDS is that it as-
sumes all of the variance in a data set can be explained by
dimensions common to all stimuli. This assumption does
not hold for timbres: many include instrument-specific
components such as the sound of the returning hopper in a
harpsichord. A more sophisticated version of CLASCAL
separates these components, known as specificities, using
the following model:
di jk ≈"R
∑
r=1
wC(i),r(xjr −xkr )2+vC(i)(sj+sk)#1
/2
(2)
where sjand skare the specificities for stimuli jand kand
vC(i)represents the weight subjects in class C(i)give to
specificities when distinguishing timbres [7, 10].
2.2. Isomap
Isomap arose as a solution to the problem of dimensional-
ity reduction for data sets like the famous ‘Swiss roll’ pic-
tured in Figure 1 [8]. Looking at the plot, it is obvious to
a human that the data are arranged on a two-dimensional
plane that has been coiled and presented in three dimen-
sions. This fact is not obvious to traditional MDS, which
strives to preserve every pairwise distance in the set, in-
cluding those between the ends of the roll and the inner or
outer loops. The ingenious solution in Isomap is to throw
away all pairwise distances in the set except those at the
local level, i.e., those in a small region immediately sur-
rounding each point in the data set. These regions can be
selected as a fixed number kof the nearest neighbours to
(a) Embedded in 3-D (b) Unrolled in 2-D
Figure 1: The ‘Swiss roll’ data set. On the left, the data is
presented in its original form. On the right, the data is pre-
sented as it should be unrolled for human interpretation.
Traditional MDS can never arrive at this solution, how-
ever, because it seeks to preserve the distances between
the ends of the roll and the inner/outer loops.
each point in the data set or as those points that fall within
a sphere of fixed radius εaround each point. The other
distances are then recomputed using an all-pairs shortest-
path algorithm, yielding an approximation of the so-called
geodesic distances, or distances in the lower-dimensional
form. After these approximate distances are computed,
traditional MDS is applied.
At first glance, the Swiss roll appears to be a funda-
mentally different problem than that of estimating tim-
bre spaces. There is little reason to believe that human
subjects would willfully twist their ratings of the simi-
larities between timbre pairs into more dimensions than
are already present. The larger message of Isomap, how-
ever, is that unless a space is perfectly linear, large dis-
tances in a scaling model can mask important structures
in the data. It seems prudent to check for such structures
in psychological data, and because Isomap is based on
classical MDS, unlike a number other non-linear scaling
techniques, it lends itself naturally to combination with
CLASCAL. Each subject’s dissimilarity matrix is processed
according to the Isomap algorithm up to the final MDS
step. After this pre-processing is complete, the new dis-
similarity matrices are fed to CLASCAL as usual.
3. EXPERIMENTS AND RESULTS
We did not perform a new perceptual study for this paper,
but rather re-examined data from McAdams et al.’s 1995
study of 88 subjects [7]. We chose to examine the judge-
ments of the professional musicians in the study only, 24
in all, in order to simplify our analysis. The timbres used
in the study overlapped considerably with those used in
[6], which were a set of recorded, FM-synthesised tim-
bres designed to mimic traditional musical instruments.
In addition to 12 timbres from this set, McAdams et al.,
following the legacy of [5], also synthesised 6 hybrid tim-
bres, e.g., the oboleste, a combination of the perceptual
features of oboe and celesta sounds. Each subject had an
opportunity to rate the dissimilarity between all 153 pairs
of these 18 timbres.
Isomap (non-linear)
Rise -0.87 -0.45 -0.72
S.C. 0.03 -0.79 0.07
Flux -0.15 0.18 0.22
Linear
-0.94 -0.00 0.02 0.92 0.42 0.82
0.44 -0.14 0.10 -0.35 0.18 -0.67
0.30 0.09 0.49 -0.53 -0.21 0.24
0.26 0.91 -0.15 -0.15 -0.88 -0.16
Table 1: Correlation coefficients of significant dimensions
in the linear and Isomap models against each other and
log rise (attack) time, spectral centroid, and spectral flux.
Values in bold are significant at p=0.005; values in italic
are significant at p=0.1
3.1. Dimensionality and class membership
As the authors of CLASCAL recommend, we used the
standard Bayesian information criterion (BIC) to estab-
lish the dimensionality of the MDS spaces and the special
Monte-Carlo technique mentioned above to determine the
number of latent subject classes. We found that the tra-
ditional linear model produces a four-dimensional space
when including specificities, which surprisingly, is of higher
dimensionality than the space uncovered in [7] for a larger
set of subjects. The reason for this difference is likely that
only three latent classes are necessary to get the best fit for
our data as opposed to the five necessary for the larger set.
The Isomap model was able to reduce the space back to
three dimensions with the same number of latent classes,
although the class membership differs.
The lower right-hand corner of Table 1 presents the
correlation coefficients between the dimensions of the lin-
ear and Isomap models. The leading dimensions are very
strongly correlated (r=0.92). The second dimension of
the linear space, however, correlates best with the trail-
ing dimension of the non-linear space, and the trailing di-
mension of the linear space correlates strongly with the
second dimension of the non-linear space. Because the
dimensions are ordered according to their perceptual im-
portance, this last result is somewhat surprising: although
[7] employs the same linear model as ours, our non-linear
model yields a better match to their results. In a further
wrinkle, the leading dimension of the linear model corre-
lates better with the trailing dimension of the non-linear
model than the second does.
The table also presents correlations with McAdams et
al’s acoustic correlates of the dimensions of timbre space:
(log) rise, or attack, time, the spectral centroid, and spec-
tral flux, i.e., a measure of change in the spectral shape
throughout the duration of the stimulus. The first two
dimensions of the non-linear space correlate with attack
time and spectral centroid; in keeping with the relation-
ship between the dimensions of the linear and non-linear
models, it is the leading and trailing dimensions of the
linear model that exhibit similar correlations. Consistent
with the findings of [7], spectral flux correlates only weakly
with the model dimensions. The problem trailing dimen-
sion of the non-linear space again exhibits an unexpected
Instrument Abbreviation Lin. N.-L.
French horn hrn 0.83 2.75
Trumpet tpt 0.47 1.56
Trombone trn 2.01 1.80
Harp hrp 1.29 0.87
Trumpar tpr = tpt + gtr 1.93 1.73
Oboleste ols = obo + cel 1.65 2.46
Vibraphone vbs 1.74 3.43
Striano sno = str + pno 1.99 1.78
Harpsichord hcd 2.56 3.55
English horn can 1.23 2.85
Bassoon bsn 1.50 0.77
Clarinet cnt 2.10 4.22
Vibrone vbn = vbs + trn 2.12 2.36
Obochord obc = obo + hcd 0.00 3.53
Guitar gtr 1.53 2.47
Strings stg 1.35 1.76
Piano pno 2.20 2.71
Guitarnet gtn 1.26 2.98
Table 2: Square roots of the specificity values for the lin-
ear and non-linear models. Names of hybrid timbres are
printed in italic.
correlation, correlating with log rise time at a comparable
level of significance (p=0.005) to the leading dimension.
These coefficients confirm the commonly held result
that attack time is a dominating component of the percep-
tion of timbre and that spectral centroid is also a signifi-
cant component. Like previous studies, however, our re-
sults leave room to explain at least one further component
that could explain the trailing dimension of our non-linear
space or the second and third dimensions of our linear one.
Spectral flux, the explanatory power of which has already
been called into question [2], fares notably poorly.
3.2. Outliers and specificities
A somewhat surprising result emerged when examining
our data for outliers. Although no outliers were detected
in the untransformed data, an analysis of the latent class
assignments revealed one outlier after the transformation.
Fundamentally, the Isomap transformation emphasises the
effect of fine-grained distinctions between those timbres a
subject perceives as fairly similar and discounts the effect
of coarse-grained distinctions between timbres a subject
perceives as largely dissimilar. Thus, this outlying subject
uses comparable strategies to other subjects’ when mak-
ing coarse distinctions but a unique strategy to make fine
distinctions. It bears further study to determine exactly
how this strategy differs from the others.
Although the outlier situation and class membership
differ between the two models overall, they appear to have
one latent class in common, i.e, with the same member-
ship. Looking at the weights wC(i),rand vC(i), one can see
that the defining characteristic of this common group is
an emphasis on specificities when making timbre similar-
ity judgements. Given this result, one would expect that
the specificity values would be similar between the two
models, but this is not the case. Table 2 presents these val-
ues for both models. One can see that specificity values
!5
0
5
!4
!2
0
2
4
!4
!3
!2
!1
0
1
2
3
4
hcd
obc
tpr
can
bsn
tpt
stg
sno
gtr
hrp
Spectral centroid
pno
hrn
gtn
ols
tbn
cnt
vbs
vbn
Attack time
(a) Linear MDS
!5
0
5
!4
!2
0
2
4
!3
!2
!1
0
1
2
3hcd
bsn
can
tpr
stg
tpt
obc
sno
gtr
cnt
ols
gtn
vbs
Spectral centroid
pno
hrn
hrp
tbn
vbn
Attack time
(b) Isomap
Figure 2: Two timbre spaces based on [7], one generated with linear MDS and the other generated with Isomap on
averaged subject spaces. The unlabelled axes correspond to dimensions 2 and 3 in the respective CLASCAL spaces.
are higher for the non-linear model in general and that the
correlation is poor (r=0.23).
3.3. Timbre spaces
The principal advantage of the non-linear model is that it
can preserve spatial structure with fewer dimensions. Fig-
ure 2 presents the complete non-linear space and dimen-
sions 1, 2, and 4 of the linear space (so as to preserve the
dimensions that correlate with the non-linear model). The
third axes are left blank because it is difficult to interpret
the meaning of these dimensions. As one would expect
from the strong correlations in Table 1, the spatial groups
of the models are similar, although three-dimensional ro-
tation illustrates that the points in the non-linear space are
clustered more tightly.
4. SUMMARY AND FUTURE WORK
When applied to experiments on human timbre percep-
tion, Isomap appears to retain the most desirable aspects
of the global structure of linear models while tightening
the local structure of the resulting timbre space. Our re-
sults raise interesting questions about differing strategies
subjects use when distinguishing highly similar vs. highly
dissimilar timbres and suggest that there remains at least
one dimension of human timbre perception for the com-
munity to discover how to interpret.
5. REFERENCES
[1] J. A. Burgoyne and L. K. Saul, “Visualization of
low-dimensional structure in tonal pitch space,” in
Proc. Int. Comp. Mus. Conf., 2005, pp. 243–46.
[2] A. Caclin, S. McAdams, B. K. Smith, and S. Winsberg,
“Acoustic correlates of timbre space dimensions: A confir-
matory study using synthetic tones,” J. Acoust. Soc. Am.,
vol. 118, no. 1, pp. 471–82, 2005.
[3] J. D. Carroll and J.-J. Chang, “Analysis of individual dif-
ferences in multidimensional scaling via an n-way general-
ization of ‘Eckart-Young’ decomposition,” Psychometrika,
vol. 35, no. 3, pp. 283–319, 1970.
[4] J. M. Grey, “Multidimensional perceptual scaling of mu-
sical timbre,” J. Acoust. Soc. Am., vol. 61, pp. 1270–77,
1977.
[5] J. M. Grey and J. W. Gordon, “Perceptual effects of spec-
tral modifications on musical timbres,” J. Acoust. Soc. Am.,
vol. 63, no. 5, pp. 1493–1500, 1978.
[6] C. L. Krumhansl, “Why is musical timbre so hard to un-
derstand?” in Structure and Perception of Electroacoustic
Sound and Music, ser. Excerpta Medica, S. Nielzen and
O. Olsson, Eds. Amsterdam: Elsevier, 1989, no. 846.
[7] S. McAdams, S. Winsberg, S. Donnadieu, G. D. Soete, and
J. Krimphoof, “Perceptual scaling of synthesized musi-
cal timbres: Common dimensions, specificities, and latent
subject classes,” Psych. Res., vol. 58, pp. 177–92, 1995.
[8] J. B. Tennenbaum, V. de Silva, and J. C. Langford, “A
global geometric framework for nonlinear dimensionality
reduction,” Science, vol. 290, pp. 2319–23, 22 December
2000.
[9] W. S. Torgerson, Theory and Methods of Scaling. New
York: Wiley, 1958.
[10] S. Winsberg and J. D. Carroll, “A quasi-nonmetric method
for multidimensional scaling via an extended Euclidean
model,” Psychometrika, vol. 54, no. 2, pp. 217–29, 1989.
[11] S. Winsberg and G. De Soete, “A latent class approach to
fitting the weighted Euclidean model, CLASCAL,” Psy-
chometrika, vol. 58, no. 2, pp. 315–30, 1993.
[12] ——, “Multidimensional scaling with constrained dimen-
sions: CONSCAL,” Br. J. Math. Stat. Psychol., vol. 50, pp.
55–72, 1997.