Analysis of Feature Dependencies in Sound Description

Article · May 2003with34 Reads
DOI: 10.1023/A:1022864925044 · Source: DBLP
Multimedia data, including sound databases, require signal processing and parameterization to enable automatic searching for a specific content. Indexing of musical audio material with high-level timbre information requires extraction of low-level sound parameters first. In this paper, we analyze regularities in musical sound description, for the data representing musical instrument sounds by means of spectral and time-domain features. We examined digital audio recordings of singular sounds for 11 instruments of definite pitch. Woodwinds, brass, and strings used in contemporary orchestras were investigated, for various fundamental frequencies of sound and articulation techniques. General-purpose data mining system Forty-Niner was applied to investigate dependencies between the sound attributes, and the results of the experiments are presented and discussed. We also indicate a broad range of possible industry applications, which may influence directions of further research in this domain. We summarize our paper with conclusions on representation of musical instrument sound, and the emerging issue of exploration of audio databases.
Journal of Intelligent Information Systems, 20:3, 285–302, 2003
2003 Kluwer Academic Publishers. Manufactured in The Netherlands.
Analysis of Feature Dependencies
in Sound Description
Polish-Japanese Institute of Information Technology, ul. Koszykowa 2, 02-008 Warsaw, Poland
JAN M. ˙
The University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, NC 28223, USA
Received March 1, 2002; Revised and Accepted November 28, 2002
Abstract. Multimedia data, including sound databases, require signal processing and parameterization to enable
automatic searching for a specific content. Indexing of musical audio material with high-level timbre information
requires extraction of low-level sound parameters first. In this paper, we analyze regularities in musical sound
description, for the data representing musical instrument sounds by means of spectral and time-domain features.
We examined digital audio recordings of singular sounds for 11 instruments of definite pitch. Woodwinds, brass,
and strings used in contemporary orchestras were investigated, for various fundamental frequencies of sound and
articulation techniques. General-purpose data mining system Forty-Niner was applied to investigate dependencies
between the sound attributes, and the results of the experiments are presented and discussed. We also indicate a
broad range of possible industry applications, which may influence directions of further research in this domain.
We summarize our paper with conclusions on representation of musical instrument sound, and the emerging issue
of exploration of audio databases.
Keywords: music information retrieval, knowledge discovery in databases, sound recognition
1. Introduction
Huge amount of digital audio data became publicly available in recent years through the
World Wide Web, broadcast data streams and also as databases on PCs. However, manage-
ability of this growing amount is very low. Efficient searching for the user-specified audio
content and easy access to it is crucial for usefulness of applications like multimedia editing,
digital libraries, and so on. To enable this, labeling of audio contents must be feasible, and
structures for audio classification must be provided. There are several commercial audio
database management systems, but the design and implementation of content-based audio
archives is just emerging (Subrahmanian, 1998). Content-based retrieval of information
from audio databases is still relatively unexplored field.
Any automatic classification and labeling of sounds is practically impossible to perform
on the basis of row sound data. As a preprocessing, sounds should be parameterized. The
created set of parameters (descriptors) should match the recognition task, i.e., different
descriptors will be necessary to classify room acoustics, musical style, melody, harmonics,
Corresponding author.
musical instruments, and so on. The parameters allow us to classify sounds coming from
various recordings, prepared in various recording conditions and in various sound formats.
In this paper, we focus on classication of musical instrument sounds.
Automatic classication and labeling of sounds with information on musical instrument,
playing in a given excerpt, can be performed at various levels. We can classify instrument
itself, like bassoon, trombone, violin, etc. But we can also label the audio sample with
the instrument family, like woodwinds, brass, strings, percussion, etc. This assignment can
be hierarchical, according to musical instrument classication criteria, with Hornbostel
and Sachs classication being most common (Hornbostel and Sachs, 1914). Instrument
sounds can be also grouped according to the articulation technique, like pizzicato, muting,
or vibrato.
Audio indexing is needed for classication of musical material in automatic transcription,
Internet search and so on. Sound les labeled with timbre information can be useful in many
applications. The appropriately labeled data would form a good basis for content-based
searching for user-dened audio information.
2. Background
Research on automatic instrument sound classication has been progressing recently, and
various sound parameterization techniques have been applied as preprocessing of sound
data. The work starts with digitally recorded audio samples that represent instantaneous
values of the analog signal, i.e., amplitude data. Parameterization (feature extraction), based
on analysis of a sequence of samples, produces more meaningful sound representation,
usually indicating frequency components of the sound (De Poli et al., 1991). The extracted
features derive from various sound analysis methods, like Fourier transform or wavelet
analysis. Spectral parameters are usually based on fast Fourier transform (FFT), where the
length of the analyzing frame is equal 2nfor some natural n. FFT-based parameters include
spectral moments (Fujinaga and MacMillan, 2000), statistical properties of spectrum (Ando
and Yamaguchi, 1993), contents of the selected groups of partials in sound spectrum (Pollard
and Jansson, 1982; Kostek and Wieczorkowska, 1997), inharmonicity, spectral envelope
(Martin and Kim, 1998), and others. Next group of parameters describe the time domain
attributes of sound, i.e., features of the recorded waveform. Time domain features include
low-level parameters, like density of zero-crossings in time, or parameters of the envelope
of for the whole sound. These time-related parameters include onset duration (Martin and
Kim, 1998; Eronen and Klapuri, 2000), amplitude envelope (Kaminskyj, 2000), and so
on. Parameterization can be also based on wavelet analysis coefcients (Wieczorkowska,
1999a; Kostek and Czyzewski, 2001), and on cepstral coefcients (Brown, 1999; Eronen and
Klapuri, 2000; Brown et al., 2001). For instance, Brown (1999) applied cepstral coefcients
based on constant Q transform. Yet another sound representation uses multidimensional
scaling analysis (MSA) trajectories (Kaminskyj, 2000). There are also other parameters
applied for musical sound description (Wieczorkowska, 1999a; Herrera et al., 2000).
The sound attributes that may become common, thus facilitating comparison of results, are
the of descriptors given in MPEG-7 standard (ISO/IEC, 2002; Manjunath et al., 2002). This
standard comprises various low-level sound descriptors that can be useful for classication
purposes. The audio description framework in MPEG-7 includes 17 temporal and spectral
descriptors, which can be grouped as follows (ISO/IEC, 2002):
basic: instantaneous waveform and power values
basic spectral: log-frequency power spectrum envelopes, spectral centroid, spectrum
spread, and spectrum atness
signal parameters: fundamental frequency and harmonicity of signal
timbral temporal: log attack time and temporal centroid
timbral spectral: spectral centroid, harmonic spectral centroid, harmonic spectral devia-
tion, harmonic spectral spread, and harmonic spectral variation
spectral basis representations: spectrum basis and spectrum projection
These features basically describe a segment with a summary value that applies to the entire
segment or with a series of sampled values; the values of descriptors from timbral temporal
group apply only to segments as a whole. All these descriptors characterize waveform and
spectrum of sound, as well as evolution of sound in time, so they constitute a good basis
for efcient sound representation for sound classication purposes. Additionally, MPEG-7
anticipates indexing of sound with higher-level timbre information as well.
The set of sound descriptors is an input for classication algorithms, also investigated
in the research in this domain. Classication algorithms applied in papers mentioned in
the previous paragraph include k-nearest neighbor (Martin and Kim, 1998; Fujinaga and
MacMillan, 2000; Kaminskyj, 2000), statistical methods (Martin and Kim, 1998), decision
trees, rough set based algorithms, neural networks (Wieczorkowska, 1999a), and others
(Herrera et al., 2000). The quality (accuracy) of classication differs depending on the
data set (audio samples), parameterization applied, classication method and their settings.
The results published so far are hardly comparable, since the data sets and testing pro-
cedures differ from experiment to experiment. For small data sets, representing 4 classes
(instruments) only, the recognition accuracy ranges from 79%84% (Brown et al., 2001) to
90%99% (Kostek and Czyzewski, 2001). Larger data sets produce lower results. For in-
stance, Wieczorkowska (1999a, 1999b) experimented with the data representing 18 classes
(11 orchestral instruments, various articulation), obtaining about 78% correctness for test-
ing procedures including 70%/30% and 90%/10% splits. Martin and Kim (1998) obtained
approximately 72% recognition rate for 1023 isolated tones over the full pitch ranges of
14 orchestral instruments, testing the results with multiple 70%/30% splits. Kaminskyj
(2000) obtained 82% accuracy for 19 instruments and leave-one-out testing. Fujinaga and
MacMillan (2000) reported 50% recognition rate for the 39-timbre group, representing 23
orchestral instruments, and leave-one-out validation procedure. Eronen and Klapuri (2000)
recognized musical instruments with 80% rate for 1498 samples representing 30 orchestral
instruments, played with various articulation techniques, using validation with 70%/30%
splits of training and test data.
As we mentioned in the previous section, musical instrument sounds can be classied
on various levels. Basically, sounds are classied at instrument level, but the method of
sound production (articulation) or families of instruments are also considered (Martin and
Kim, 1998; Wieczorkowska, 1999a; Herrera et al., 2000; Eronen and Klapuri, 2000). The
recognition accuracy for families of instruments or articulation is higher than for single
instruments. Also, hierarchical classication, with pizzicato/sustained and family recog-
nition performed rst, increases recognition at the instrument level. Distinction between
pizzicato and sustained sounds is very easy, close to faultless. Martin and Kim (1998) distin-
guished between pizzicato and continuant sounds with almost 99% accuracy, and Eronen
and Klapuri (2000) also obtained 99%. This is because of very distinctive time-domain
envelope, practically with transients only, with the ending one occupying major part of the
pizzicato sound. Therefore, descriptors of time-domain envelope should be placed in a fea-
ture vector representing musical instrument sound. As far as the recognition of instrument
family is concerned, the accuracy usually ranges around 90%. For instance, Martin and Kim
(1998) identied 3 instrument families: string, woodwind and brass, with 86.9% accuracy,
and correct recognition for pizzicato vs. continuant tones reached 98.8%. Wieczorkowska
(1999a) obtained results exceeding 89% for the same 3 instrument families. Eronen and
Klapuri (2000) recognized these instrument families with 94% correctness. Of course, cor-
rect identication of examples representing 3 classes only is much easier than identication
within about dozen of classes, therefore higher recognition rate for instrument families than
for particular instruments is obvious.
In this paper, we classify musical instrument sounds at various levels, additionally looking
into inside structure of the applied parameterization using Forty-Niner system ( ˙
Zytkow and
Zembowicz, 1993). The parameterization is based on spectral and temporal analysis. We test
dependencies within conditional attributes, and also between decision and conditional ones.
Obtained results are useful in research on sound representation and automatic classication
of musical sound.
3. Searching for regularities in the data using 49er system
Parametric representation of audio data can be based on any features outlined in Section 2,
and derived by various analysis methods. Additionally, many features can be calculated in
different time moments, giving us time-variant information about the data. Therefore, the
question is whether there are any internal (including temporal) dependencies within the
chosen parameterization of musical sounds. In particular, we are interested in searching for
dependencies between decision and conditional attributes in the database obtained through
the parameterization of musical instrument sounds, since such dependencies should facili-
tate correct classication of the data. In search for possible regularities in that database, we
apply Forty-Niner system ( ˙
Zytkow and Zembowicz, 1993).
Forty-Niner (49er) is a general-purpose database mining system. 49er conducts large-
scale search for regularities in subsets of data. Examples of regularities (or patterns) include
contingency tables, equations, and logical equivalences (Zembowicz and ˙
Zytkow, 1996).
If the data indicate a functional relationship, then more costly search for equations is
conducted. The search can be applied to any relational table (data matrix). Initially, 49er
examines the contingency table for each pair of attributes, and then search in the space
of equation is invoked for the regularities discovered. Contingency table shows actual
distribution of values for a pair of attributes. Each entry in the table is equal to the number
of records that have the corresponding combinations of values of both attributes. Exemplary
contingency table for strongly correlated attributes xand yis shown in Table 1.
Table 1. Contingency table for highly correlated attributes xand yof domains {1,2,3}and {a,b,c,d}
Attribute x
3 1072 0
abc dAttribute y
Regularities are approved based on their signicance, measured by the probability that
they are random uctuations. This probability is determined basing on χ2test, which
measures the distance between tables of actual and expected counts in the following way:
(Aij Eij)2
where Eij =nxi·nyj
Nexpected number of records with x=xiand y=yj, where x,y
parameters, xi,yjparameter values, Ntotal number of records; Aijactual frequency
Since χ2depends on the size of the data set, CramersVcoefcient is calculated to
abstract from the number of data:
N·min(Mro w 1,Mcol 1),
where Mrow ×Mcol size of the contingency table.
For ideal correlation, CramersV=1, and on the other extreme V=0. In this paper,
we investigated correlations between various decision and conditional attributes, using
CramersVas a measure of their reciprocal dependency. Also, we present contingency
tables for the examined attributes, in order to illustrate details of these dependencies.
4. Description of records in the database
The experiments described in this paper are based on musical instrument sound parameter-
ization, based on spectral and temporal analysis. The sounds used for database construction
come from a collection of CDs, prepared at McGill University (MUMS) for sampling
(Opolko and Wapnick, 1987). MUMS features stereophonic samples of musical instrument
sounds recorded chromatically within the standard playing range of instruments, with many
timbral variations. 679 audio samples of 16-bit stereo recording with 44.1 kHz sampling
frequency were made on the basis of these CDs (Wieczorkowska, 1999a). In the described
experiments, the following instruments have been chosen:
bowed strings: violin, viola, cello and double bass, played vibrato and pizzicato;
woodwinds: ute, oboe and b-at clarinet;
brass: trumpet, tenor trombone and French horn, played with and without muting and
tuba without muting.
Each record in the database corresponds to a singular sound of one instrument. The records
include 69 attributes; 62 of them are numerical. These 62 conditional attributes (parameters)
describe general properties of the sound, as well as specic properties of various parts of
the sound. The parameters describe both time domain and spectrum of sounds. Spectrum
was calculated by means of Fourier transform, for analyzing frame of at least 23 ms length,
containing integer multiple of sound periods (at least 2), and rectangular window function.
The most descriptive parts of the sound are the initial phase (the attack), when the sound
changes very rapidly, and the quasi-steady phase, when the sound is the most stable. The
endpoints of the quasi-steady state are extracted as the time moments when at least 75%
of maximal sound amplitude is reached, and there are no amplitude variation within the
quasi-steady state exceeding 10% of maximal amplitude.
Since the attack is very important for correct classication by human experts, the calcu-
lated parameters describe the beginning, the middle and the end of the attack. In the quasi-
steady state the sound also may undergo some uctuation, increasing the artistic expression,
especially when it is played vibrato. Therefore, for the quasi-steady state the parameters
were calculated for the maximal and minimal amplitude of this stage of the sound. The
parameters calculated for the attack and for the quasi-steady state in the selected points are:
nfvnumber of partial nfv∈{1,...,5}of the greatest frequency variation,
fvweighted mean frequency variation for 5 lowest partials
where Akamplitude of k-th partial, fkfrequency of k-th partial,
Tr1modied rst Tristimulus parameter (Pollard and Jansson, 1982)
A12difference of amplitudes between the rst partial (the fundamental) and the second
A12=20 log A1
H3,4,H5,6,7,H8,9,10,Hrestcontents of the selected groups of harmonics in spectrum
H8,9,10 =n=8,9,10 A2
Hrest =N
n=11 A2
Odcontents of odd harmonics in spectrum, excluding fundamental
Od =N/2+1
Evcontents of even harmonics in spectrum
Brbrightness of sound
Br =N
General properties of the whole sound are described by the following parameters:
Vbdepth of vibrato
Vb =|f1max f1min|,
where f1max,f1min fundamental frequency in the quasi-steady state for maximal and
minimal amplitude respectively,
f1fundamental frequency in the middle of the sound [Hz],
dfrapproximate fractal dimension of graph of spectrum amplitude envelope in decibel
scale (Lubniewski and Stepnowski, 1998)
dfr =−
log N(s)
log s,
where smesh length of grid that covers the plane where the graph is drawn (here
s=1010), N(s)number of nonempty mesh, for sampling frequency fs=44.1
kHz and analysis frame 5520 samples; such long frame was used in order to take at least 2
periods of any analyzed sound, since 5520 samples correspond to 2 periods of the lowest
audible (16 Hz) sound for this fs,
f1/2contents of subharmonics in the spectrum (overblow)
Qtduration of the quasi-steady state in proportion to the total sound time, Qt [0,1],
Etduration of the ending transient of the sound in proportion to the total sound time,
Et [0,1],
Rlvelocity of fading of ending transient [dB/s]
Rl =
10 log S
where Tssampling period, Rtime moment of the end of the sound, Stime moment
when the ending transient begins, lnumber of samples in the sound period, A(t)
amplitude for the time instant t.
Attributes 6369 are decision attributes, representing various ways of classication of
sounds in the database. They mark from 2 up to 18 classes, grouping all objects from the
attribute 63: 18 classes; each one contains records representing sounds of the same in-
strument, played with the same technique (articulation),
–flute, oboe, clarinet, trumpet, trumpet muted, trombone, trombone muted, French horn,
French horn muted, tuba, violin vibrato, violin pizzicato, viola vibrato, viola pizzicato,
cello vibrato, cello pizzicato, double bass vibrato, and double bass pizzicato,
attribute 64: 2 classesgroups of instruments,
strings and winds,
attribute 65: 3 classesgroups of instruments,
strings, woodwinds, and brass,
attribute 66: 2 classes, representing articulation techniques,
vibrato and non-vibrato (including pizzicato),
attribute 67: 5 classes, representing groups of instruments, played with the same
strings vibrato, strings pizzicato, winds vibrato not muted, winds non-vibrato not
muted, and winds muted,
attribute 68: 5 classes, representing groups of instruments played with the same articula-
tion, with woodwinds distinguished,
strings vibrato, strings pizzicato, woodwinds, brass not muted, and brass muted,
attribute 69: 11 classesinstruments,
–flute, oboe, clarinet, trumpet, trombone, French horn, tuba, violin, viola, cello, and
double bass.
These groups are based on general classication of musical instruments (Hornbostel and
Sachs, 1914; Fletcher and Rossing, 1991), as well as on methods of sound production, i.e.,
articulation. Such grouping also reects research in this domain, since musical instrument
sounds are classied on instrument, family, or articulation level.
5. Experiments
The main problem in automatic classication of audio data is the appropriate sound pa-
rameterization, since even the most advanced classier will not yield good results for
poorly parameterized data. As we mentioned in Section 2, there exist many parameteriza-
tion methods that can be applied. We performed a series of experiments for spectral and
timbral parameters described in Section 4, in order to check internal dependencies within
the conditional attributes, as well as estimate dependencies of the decision attributes on
particular conditional attributes. Such experiments allow us to assess our parameterization.
Also, we can look directly for dependencies within the data by searching for regularities in
form of equations (Langley et al., 1986). Whatever means of parameterization assessment
we choose, obtained results widen our knowledge of the data.
The parametric description of our data, represented by 62 conditional attributes, can be
placed in a table, where 5 columns for attributes 155 represent 5 moments of the sound: the
beginning, the middle and the end of the attack, and the moments of maximal and minimal
amplitude during the quasi-steady state of the sound. These attributes are presented in
Table 2 (Wieczorkowska, 1999b). Rows of the rst 5 columns represent the same features
of sound, but measured in various time instants.
The experiments performed with use of 49er system allowed us to check how classication
attributes depend on the conditional ones. As we expected, there are dependencies between
time-domain conditional attributes no. 60 (duration of ending transient) and 61 (duration of
the quasi-steady state), and decision attributes. The main purpose to include these parameters
in the feature vector was to classify pizzicato sounds, which have short quasi-steady state
and long ending transient. For quantization of the attributess domain into 6 interval of equal
width, dependencies with CramersV>0.5 were found in the following cases:
for attributes no. 61 and 63, CramersV=0.527,
for attributes no. 60 and 63, CramersV=0.552,
for attributes no. 60 and 64, CramersV=0.557,
for attributes no. 60 and 67, CramersV=0.504,
for attributes no. 60 and 68, CramersV=0.511.
The contingency table for the rst pair of attributes, i.e., no. 61 and 63, is shown in Table 3.
As we can see, despite moderate correlation between these parameters, the attribute no. 61
alone allows almost correct recognition of pizzicato. For a61 0.3333 the sound can be
Table 2. Conditional attributes in the database of musical instrument sounds.
Attack Steady state General
Beginning Middle End Maximum Minimum
1. nfv12. nfv23. nfv34. nfv45. nfv56. Vb
2. fv13. fv24. fv35. fv46. fv57. f1
3. Tr114. Tr125. Tr136. Tr147. Tr158. dfr
4. A1215. A1226. A1237. A1248. A1259. f1/2
5. H3,416. H3,427. H3,438. H3,449. H3,460. Qt
6. H5,6,717. H5,6,728. H5,6,739. H5,6,750. H5,6,761. Et
7. H8,9,10 18. H8,9,10 29. H8,9,10 40. H8,9,10 51. H8,9,10 62. Rl
8. Hrest 19. Hrest 30. Hrest 41. Hrest 52. Hrest
9. Od 20. Od 31. Od 42. Od 53. Od
10. Ev21. Ev32. Ev43. Ev54. Ev
11. Br 22. Br 33. Br 44. Br 55. Br
Table 3. Contingency table for attributes no. 63 (a63) and 61 (a61). The domain of attribute no. 61 was quantized
into 6 intervals of equal width (left endpoints of the intervals shown in the table). CramersVfor this table is
Violin vibrato 41 4 0 0 0 0
Violin pizzicato 0 0 2 25 12 1
Viola vibrato 30 11 1 0 0 0
Viola pizzicato 0 0 6 24 4 0
Tuba non-vibrato 28 4 0 0 0 0
Trombone non-vibrato muted 33 0 0 0 0 0
Trombone non-vibrato 35 1 0 0 0 0
Oboe vibrato 10 22 0 0 0 0
French horn non-vibrato muted 1 33 3 0 0 0
French horn non-vibrato 14 23 0 0 0 0
Flute vibrato 37 0 0 0 0 0
Trumpet non-vibrato muted 8 11 9 3 0 0
Trumpet non-vibrato 18 12 4 0 0 0
Double bass vibrato 34 10 0 0 0 0
Double bass pizzicato 1 3 11 24 3 0
Clarinet 37 0 0 0 0 0
Cello vibrato 25 18 3 1 0 0
Cello pizzicato 0 0 5 24 10 0
0.0 0.1666 0.3333 0.5 0.6666 0.8333 a61
Table 4. Contingency table for attributes no. 64 (a64) and 58 (a58) for the investigated data. The bottom row
represents cutting points (left endpoints) of the attributes domain in the quantization process. CramersVfor this
table is V=0.39910099208415772.
Winds 1 0 4 0201 2162646171431915
Strings 0 1 11 3177184259391301320
.404 .408 .41 .412 .413 .414 .415 .416 .417 .418 .419 .42 .421 .422 .423 a58
Table 5. Contingency table for attributes no. 65 (a65) and 58 (a58) for the investigated data. The bottom row
represents cutting points (left endpoints) of the attributes domain in the quantization process. CramersVfor this
table is V=0.4479197065980941.
Brass 1 0 0 0201 2162646714318140
Woodwinds 0 0 4 0000 00001000101
Strings 0 1 11 31771842593913013200
.404 .408 .41 .412 .413 .414 .415 .416 .417 .418 .419 .42 .421 .422 .423 .43 a58
classied as pizzicato with only 24 out of 679 cases misclassied (4 double bass pizzicato
sounds omitted and 20 other sounds mistaken for pizzicato), so recognition for pizzicato
reaches 96.47% for a single conditional attribute.
Also, attribute no. 58, which represents approximate fractal dimension of graph of spec-
trum amplitude envelope in decibel scale, alone allows identication of class in some cases.
The contingency table for this attribute and the decision attribute no. 64 (winds and strings)
is shown in Table 4. CramersVin this case is about 0.4, so these attributes are moderately
correlated. However, higher values of attribute no. 58 indicate winds (above 1.423 with
full accuracy), and values between 4.11 and 4.116 indicate strings with high accuracy. The
dependencies for subdivision of winds into woodwinds and brass are shown in Table 5. As
we can see, within wind instruments, the brass subgroup is somehow easier to classify than
woodwinds on the basis of attribute 58.
Apart from observing how classication attributes depend on conditional ones, we also
traced internal dependencies between these conditional attributes. The dependencies may
appear between attributes taken from the same column of Table 2, i.e., for spectral attributes
calculated at the same time instant, or within any row of the table, i.e., between the same
attribute, calculated for various time instants. We may expect that dependencies within one
column in Table 2, if any, may also appear in adjacent columns, especially representing the
same stage of the sound. Our expectations has been fullled to some extent in experiments.
For instance, attributes no. 33 and 55, and also 44 and 55, representing brightness, show
strong functional dependencies (high CramersV) for the division of attribute domain into
6 intervals of equal width. These dependencies are shown in Table 6 and Table 7.
Table 6. Contingency table for conditional attributes no. 55 (a55) and 33 (a33). The domain of each attribute
was quantized into 6 intervals of equal width (left endpoints of the intervals shown in the table). CramersVfor
this table is V=0.71332504107522376.
16.5 0 0 0 0 0 5
13.4 0 0 1 4 3 0
10.3 0 0 7 11 1 0
7.2 01233000
4.0999 27 140 14 1 0 0
1.0 330 82 8 0 0 0
1.05 4.1616 7.2733 10.385 13.496 16.608 a33
Table 7. Contingency table for conditional attributes no. 55 (a55) and 44 (a44). The domain of each attribute
was quantized into 6 intervals of equal width (left endpoints of the intervals shown in the table). CramersVfor
this table is V=0.71883893741475813.
16.5 0 0 0 0 0 5
13.4 0 0 0 3 3 2
10.3 0 0 4 14 0 1
7.2 0 19 25 1 0 0
4.0999 54 126 1 0 1 0
1.0 404 16 0 0 0 0
1.0 4.7166 8.4333 12.15 15.866 19.583 a44
Since attributes a33, a44, and a55 are quite strongly dependent, we decided to use Forty-
Niner to search for equations that describe these dependencies. Since the results depend on
quantization applied to real-valued attributes, we performed searching for above-mentioned
quantization into 6 intervals and for more dense quantization, into 20 intervals. 49er found
the following approximate equations:
for quantization into 6 intervals:
a55 =log(0.15773866890352523 exp a33 +0.86497209655608553),
a55 =log(0.34725611904424447 ·a33 +1.0135896217313733),
a55 =log(1.0790066406606611 ·a44 +1.0236649568472604),
a55 =0.91970676829729114 ·a44 +0.12014294505684338,
a55 =log(0.19493285552362866 exp a44 +0.88363857593852102),
for quantization into 20 intervals:
a55 =0.94737351948337389 ·a33 +0.94737351948337389,
a55 =0.9383914871946869 ·a44 +0.37121004565206117.
We can see that brightness of sound (which is below 5 in more than a half of cases and
generally below 7), is quite predictable when the attack of the sound has nished, although
the form of the dependencies depends on the quantization performed. We can also see that
the range of brightness values is lower for maximal than for minimal amplitude during the
steady state.
Generally, most of attributes from the same row for 3rd, 4th and 5th column show
dependencies, apart from the rst 2 rows, i.e., number of a partial with the greatest frequency
deviation (for 5 low partials) and mean frequency deviation for low 5 partials. For instance,
H3,4(content of 3rd and 4th partial), i.e., attributes no. 27, 38 and 49, show dependencies with
CramersV>0.5. In general, the dependencies within conditional attributes are stronger
that those between decision and conditional ones. It conrms the observation that sounds
basically stabilize at the end of the attack, but still changes can happen in inharmonicity of
low partials.
Apart from dependencies between columns, there are also dependencies within attribute
values of the same column of Table 2. Attributes Tr1,A12, and Ev, i.e., values of energy of
the fundamental in proportion to the whole spectrum, amplitude difference between 1st and
2nd partial, and contents of even partials in the spectrum, respectively, show dependencies
of both types among each other, if calculated starting from the end of the attack. This is
illustrated in Tables 8 and 9. Probably, predominant fundamental and week higher partials
are the cause of detected dependencies, and we expect that by representing attributes in log-
arithmic scale we may lessen these dependencies. Another factor inuencing dependencies
is quantization. CramersVis decreasing when quantization divides the domain into more
intervals, and is increasing when the number of intervals is getting smaller (see Table 10).
Strong dependencies appear between attributes no. 30, 33, 41, 44, 52, and 55, i.e., for
energy of the higher partials in the spectrum and for brightness of the sound, starting with
Table 8. Contingency table for conditional attributes no. 48 (a48) and 47 (a47). The domain of each attribute
was quantized into 6 intervals of equal width (left endpoints of the intervals shown in the table). CramersVfor
this table is V=0.44151702557398309.
33.583 0 0 0 3 4 9
20.166 0 3 3 2 4 46
6.75 18 5 15 14 33 103
6.666 85 66 46 37 30 0
20.08 123 0 0 0 0 0
33.5 30 0 0 0 0 0
0.0 0.1666 0.3333 0.5 0.6666 0.8333 a47
Table 9. Contingency table for conditional attributes no. 48 (a48) and 37 (a37). The domain of each attribute
was quantized into 6 intervals of equal width (left endpoints of the intervals shown in the table). CramersVfor
this table is V=0.57730878183310763.
33.583 0 0 0 1 4 11
20.166 0 0 1 13 35 9
6.75 0 1 26 117 42 2
6.666 1 6 194 57 6 0
20.08 5 47 61 9 1 0
33.5 19 10 1 0 0 0
39.1 25.11 11.13 2.85 16.833 30.816 a37
Table 10. Contingency table for conditional attributes no. 48 (a48) and 37 (a37), with quantization of the domain
of each attribute into 2 intervals. CramersVfor this table is V=0.70233561024363467.
>6.666 28 234
≤−6.666 344 73
≤−11.13 >11.13 a37
Table 11. Contingency table for conditional attributes no. 52 (a52) and 44 (a44). The domain of each attribute
was quantized into 6 intervals of equal width (left endpoints of the intervals containing data are shown in the
table). CramersVfor this table is V=0.58320664069863415.
0.6666 0 0 0 2 0 4
0.5 0 0 0 8 0 0
0.3333 0 0 3 3 3 1
0.1666 0 0 7 5 0 1
0.0 458 161 20 0 1 2
1.0 4.7166 8.4333 12.15 15.866 19.583 a44
the end of the attack. Such dependencies are quite obvious, since for low brightness contents
of high partials must be low. These dependencies appear not only within a column or row,
which is shown in Table 11. Therefore we can see that brightness of sound and contents of
higher partials behave similarly after the sound is stabilized.
For the rst two columns, there are no signicant dependencies between attributes from
the same rows, see for example Table 12. Second and third column (apart from attributes no.
18 and 29, i.e., energy of harmonics no. 8, 9, and 10 in the spectrum) also do not show sig-
nicant dependencies, thus conrming that musical instrument sounds evolve dramatically
in their initial phase.
Table 12. Contingency table for conditional attributes no. 23 (a23) and 1 (a1). CramersVfor this table is
5 45381717 9
4 33242015 9
3 3732282112
2 2840212011
1 9734332216
Table 13. Recognition accuracy for the investigated data, obtained using decision trees (C4.5 algorithm).
Attribute no.: 63 64 65 66 67 68 69
70/30 split (%) 63.40 89.92 85.26 87.72 80.10 81.08 65.11
90/10 split (%) 77.00 93.33 89.63 91.11 93.41 86.67 77.00
All the results commented above conrm our presumption that during the attack sound
evolves with signicant changes. After the end of attack, sound basically stabilizes, but
still there are changes in its spectrum. This also reects attention that human experts pay
to the attack of the soundthe onset is necessary for them to classify musical instruments
The data investigated above have also been used for training and testing of classi-
cation algorithms. The results obtained in experiments with decision trees are presented
in Table 13. These experiments were performed using C4.5 algorithm, in this case with
standard settings and mparameter set to 1 (i.e., where sensible test required 2 branches
with 1 cases (Quinlan, 1993)). Validation technique inuences the results, and smaller
testing sets produce more optimistic results. Also, the algorithm settings adjusted to the
data for each decision attribute separately, contribute to improvement of the recognition
The investigations and observations discussed above were calculated for our data, de-
scribing musical instrument sounds. Such experiments can be applied to any data, espe-
cially when parameterization is not standardized and can be performed in any way. Finding
dependencies within the data gives us hints for most successful parameterization.
6. Considerations on knowledge discovery in sound data:
Possible industrial applications
This paper concentrates on discovering dependencies between attributes used for musical
instrument sound description. However, one can imagine further research and applications
of investigations on sound databases.
First of all, further research on automatic classication of musical sound can be ex-
pected, based on MPEG-7 standard for multimedia content description (ISO/IEC, 2002).
This standard supports some degree of interpretation of the informationís meaning, and
addresses a wide range of applications that involve search and browsing (Manjunath et al.,
2002), so MPEG-7 is gaining increasing interest worldwide. Automatic labeling of multi-
media data is vital for any content based search of these data, so extraction of audio-visual
information became the domain of interest for many researchers. The experts involved into
elaboration of MPEG-7 standard expect industry competition to lead to improvements in
automatic extraction of sound features (at least low-level ones) and content-based searching
Discovering dependencies between audio descriptors may lead to more efcient sound
representation and searching through audio databases. This can be useful for many purposes,
including precise labeling of audio recordings and searching for specic sounds in the
recording. Also, sounds can be labeled with subjective descriptors of timbre (warm, sharp,
nasal, etc.), character of the piece (pop, romantic, hard rock, folk, country or region and time
of origin, etc.), quality of recording (perfect, noises, cracks, etc.), and so on. Yet another
labels can be used to represent information necessary to select illustrative sounds or sound
effects for radio and television broadcasts, like news, programs for children, soap operas,
science ction, documentary, etc. Specicity of the scene and action can be also taken into
consideration to reect relation between image and sound, like in crowd scenes, landscapes,
battles, etc. Of course, creation of databases of audio data represented by appropriate,
on-demand generated digital feature vectors, will not be easy. In any case, preliminary
task is to identify possible applications of the data, and jobs requiring human presence
today, especially tedious jobs, and next to elaborate sound description that can be useful in
automating of these tasks. Generally, groups of users that can be of interest here include
sound engineers, producers, journalists, musicians, and so on. Sound databases can be
created for specic groups of users. Of course, musical databases are not the only examples.
We can easily imagine numerous technical application, like diagnosis of mechanical devices
(machine defects etc.), since human experts also use acoustic data in tests of performance
of the machines.
All these general ideas require specication of details, like length of analyzing frame for
sound description, and dealing with complicated and noisy recordings. Therefore extensive
research on sound databases is necessary. Forty-Niner system can be of great help here.
For instance, dependencies between time points of the same sound can be found, as well
as groups of objects, neglecting decision attributes given by the user. The system can
nd regularities that may describe the data in a new, fresh way, discovering dependencies
unknown so far, and lead to discovering of hidden, hierarchical structure of the data. For
any regularity discovered in a sound database, there are potential uses of such regularities,
like reduction of number of low-level descriptors.
There are no direct references between client-level problems and problems mentioned
in our attribute-related discussion. The client-related problems lie at the upper level. Also,
parameterization (and also segmentation) should be performed for the whole recordings, not
only for the selected sounds, representing musical instruments in our case. The gap between
high and low level of description is common in any multimedia data, including image and
video databases. However, the use of systems like 49er allows us to nd dependencies in data
description, and thus helps create a model of relations within audio data on the client level
as well. This connects high and low level of sound description, thus improving usefulness
of sound databases.
7. Conclusions
The considerations given in our paper focus on searching for the best description of mu-
sical instrument sounds for timbre classication (instrument classication) purposes. This
research can also be useful when searching for similarities and taxonomy for any sounds. In
our research, we were interested in discovering regularities in the investigated database and
expanding our knowledge about sound representation. Our aim was to nd dependencies
between attributes and interpret them. We investigated dependencies between various condi-
tional attributes calculated in the same time instant, dependencies between the same feature
vectors calculated for various stages of sound evolution, and dependencies between decision
and conditional attributes. As a result, we found various regularities, like dependency be-
tween brightness of sound and contents of higher harmonics in spectrum, and dependencies
between attributes describing contents of selected groups of partials in spectrum. We also
observed partial stabilization of sound properties after the end of the attack, although during
the attack sound features evolve quite dramatically, and there are still changes during the
quasi-steady state of the sound. Our investigation is an example of research on musical data
description, and similar experiments on any arbitrary chosen parametrical representation of
data should shed light on the data representation and feasibility of classication.
We can expect that development of MPEG-7 standard for multimedia content description
will re extensive work on audio databases, although this domain is still a little bit neglected
in comparison with image and video databases. We hope that nding dependencies in sound
description and hidden hierarchies in audio data may help in exploration of sound databases
in the future.
Presented work was partially done while preparing Ph.D. dissertation by A. Wieczorkowska
under the direction of A. Czy˙zewski at Sound and Vision Engineering Department, Faculty
of Electronics, Telecommunications and Informatics at the Gda´nsk University of
Technology. This research was partially nanced by the Research Center of PJIIT, sup-
ported by the Polish National Committee for Scientic Research (KBN).
Ando, S. and Yamaguchi, K. (1993). Statistical Study of Spectral Parameters in Musical Instrument Tones.
J. Acoust. Soc. of America, 94(1), 3745.
Brown, J.C. (1999). Computer Identication of Musical Instruments Using Pattern Recognition with Cepstral
Coefcients as Features. J. Acoust. Soc. of America, 105, 19331941.
Brown, J.C., Houix, O., and McAdams, S. (2001). Feature Dependence in the Automatic Identication of Musical
Woodwind Instruments. J. Acoust. Soc. of America, 109, 10641072.
De Poli, G., Piccialli, A., and Roads, C. (1991). Representations of Musical Signals. MIT Press.
Eronen, A. and Klapuri, A. (2000). Musical Instrument Recognition Using Cepstral Coefcients and Temporal
Features. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing
ICASSP 2000 (pp. 753756). Plymouth, MA.
Fletcher, N.H. and Rossing, T.D. (1991). The Physics of Musical Instruments. Springer-Verlag.
Fujinaga, I. and MacMillan, K. (2000). Realtime Recognition of Orchestral Instruments. In Proceedings of the
International Computer Music Conference (pp. 141143).
Herrera, P., Amatriain, X., Batlle, E., and Serra, X. (2000). Towards Instrument Segmentation for Music Content
Description: A Critical Review of Instrument Classication Techniques. In Proc. International Symp. on Music
Information Retrieval ISMIR 2000, Plymouth, MA.
Hornbostel, E.M. and Sachs, C. (1914). Systematik der Musikinstrumente. Ein Versuch. Zeitschrift f¨
ur Ethnolo-
gie, 46(4/5), 55390. Also available at
ISO/IEC JTC1/SC29/WG11. (2002). MPEG-7 Overview (Version 8). Available at http://mpeg.telecomitalialab.
Kaminskyj, I. (2000). Multi-Feature Musical Instrument Sound Classier. MikroPolyphonie, The Online
Contemporary Music Journal, 6.
Kostek, B. and Czyzewski, A. (2001). Representing Musical Instrument Sounds for Their Automatic Classication.
J. Audio Eng. Soc., 49(9), 768785.
Kostek, B. and Wieczorkowska, A. (1997). Parametric Representation of Musical Sounds. Archives of Acoustics,
22, 326.
Langley, P., Zytkow, J.M., Simon, H.A., and Bradshaw, G.L. (1986). The Search for Regularity: Four Aspects of
Scientic Discovery. In R. Michalski, J. Carbonell, and T. Mitchell (Eds.), Machine Learning, Vol. 2 (pp. 425
469), Palo Alto, CA: Morgan Kaufmann Publishers.
Lubniewski, Z. and Stepnowski, A. (1998). Sea Bottom Recognition Method Using Fractal Analysis and Scattering
Impulse Response. Archives of Acoustics, 23, 499511.
Manjunath, B.S., Salembier, P., and Sikora, T. (2002). Introduction to MPEG-7. Multimedia Content Description
Interface. Chichester, UK: John Wiley and Sons. .
Martin, K.D. and Kim, Y.E. (1998). 2pMU9. Musical Instrument Identication: A Pattern-Recognition Approach.
Presented at the 136th Meeting of the Acoustical Society of America.
Opolko, F. and Wapnick, J. (1987). MUMS—McGill University Master Samples (in compact discs). Montreal,
Canada: McGill University.
Pollard, H.F. and Jansson, E.V. (1982). A Tristimulus Method for the Specication of Musical Timbre. Acustica,
51, 162171.
Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.
Subrahmanian, V.S. (1998). Principles of Multimedia Database Systems. San Francisco, CA: Morgan Kaufmann.
Wieczorkowska, A. (1999a). The Recognition Efciency of Musical Instrument Sounds Depending on Parame-
terization and Type of a Classier. Ph.D. Thesis, Technical University of Gdansk, Poland.
Wieczorkowska, A. (1999b). Rough Sets as a Tool for Audio Signal Classication. In Z.W. Ras and A. Skowron
(Eds.), Foundations of Intelligent Systems (pp. 367375), LNCS/LNAI 1609. Springer.
Zembowicz, R. and ˙
Zytkow, J.M. (1996). From Contingency Tables to Various Forms of Knowledge in Databases.
In U.M. Kobsa, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.), Advances in Knowledge Discovery
and Data Mining (pp. 329349). AAAI Press.
Zytkow, J.M. and Zembowicz, R. (1993). Database Exploration in Search of Regularities. Journal of Intellingent
Information Systems,2,3981.
  • [Show abstract] [Hide abstract] ABSTRACT: Signal Processing Methods for Music Transcription is the first book dedicated to uniting research related to signal processing algorithms and models for various aspects of music transcription such as pitch analysis, rhythm analysis, percussion transcription, source separation, instrument recognition, and music structure analysis. Following a clearly structured pattern, each chapter provides a comprehensive review of the existing methods for a certain subtopic while covering the most important state-of-the-art methods in detail. The concrete algorithms and formulas are clearly defined and can be easily implemented and tested. A number of approaches are covered, including, for example, statistical methods, perceptually-motivated methods, and unsupervised learning methods. The text is enhanced by a common reference and index. This book aims to serve as an ideal starting point for newcomers and an excellent reference source for people already working in the field. Researchers and graduate students in signal processing, computer science, acoustics and music will primarily benefit from this text. It could be used as a textbook for advanced courses in music signal processing. Since it only requires a basic knowledge of signal processing, it is accessible to undergraduate students. © 2006 Springer Science+Business Media LLC. All rights reserved.
    Book · Jan 2006 · Journal of Intelligent Information Systems
  • [Show abstract] [Hide abstract] ABSTRACT: Academics and librarians have yet to reach a consensus on the indexing of print resources about music, nor have they developed satisfactory means of indexing sheet music. With the increasing presence of audio music on the Internet, the need to properly index MP3s and other audio files has reached a new level of urgency, and with it the need to label these items satisfactorily to enable retrieval. While the importance of these fields has been constant since the beginning of indexing and cataloguing, increased availability of sources means that there is more music available to users than ever before, but little in the way of sorting through it. Luckily studies are being undertaken with the aim of solving these problems. This article seeks to explore and explain some of these developments.
    Full-text · Article · Dec 2010
  • [Show abstract] [Hide abstract] ABSTRACT: This paper presents a novel feature extraction scheme for automatic classification of musical instruments using Fractional Fourier Transform (FrFT)-based Mel Frequency Cepstral Coefficient (MFCC) features. The classifier model for the proposed system has been built using Counter Propagation Neural Network (CPNN). The discriminating capability of the proposed features have been maximized for between-class instruments and minimized for within-class instruments compared to other conventional features. Also, the proposed features show significant improvement in classification accuracy and robustness against Additive White Gaussian Noise (AWGN) compared to other conventional features. McGill University Master Sample (MUMS) sound database has been used to test the performance of the system.
    Article · May 2015