Conference PaperPDF Available

Using Psycho-Acoustic Models and Self-Organizing Maps to Create a Hierarchical Structuring of Music by Musical Styles.



No caption available
Content may be subject to copyright.
Using Psycho-Acoustic Models and to create a Hierarchical Structuring of Music
Using Psycho-Acoustic Models and Self-Organizing Maps
to Create a Hierarchical Structuring of
Music by Sound Similarity
Andreas Rauber
Dept. of Software Technology
Vienna Univ. of Technology
A-1040 Vienna, Austria
Elias Pampalk
Austrian Research Institute for
Artificial Intelligence
A-1010 Vienna, Austria
Dieter Merkl
Dept. of Software Technology
Vienna Univ. of Technology
A-1040 Vienna, Austria
With the advent of large musical archives the need to provide an
organization of these archives becomes eminent. While artist-based
organizations or title indexes may help in locating a specific piece
of music, a more intuitive, genre-based organization is required to
allow users to browse an archive and explore its contents. Yet,
currently these organizations following musical styles have to be
designed manually.
In this paper we propose an approach to automatically create a
sound similarity. More specifically, characteristics of frequency
spectra are extracted and transformed according to psycho-acoustic
models. Subsequently, the Growing Hierarchical Self-Organizing
Map, a popular unsupervised neural network, is used to create a
hierarchical organization, offering both an interface for interactive
exploration as well as retrieval of music according to perceived
sound similarity.
With the availability of high-quality audio file formats at sufficient
compression rates, we find music increasingly being distributed
electronically via large music archives, offering music from the
public domain, selling titles, or streaming them on a pay-per-play
basis, or simply in the form of on-line retailers for conventional
distribution channels. A core requirement for these archives is the
possibility for the user to locate a title he or she is looking for, or to
find out which types of music are available in general.
Thus, those archives commonly offer several ways to find a desired
piece of music. A straightforward approach is to use text based
queries to search for the artist, the title or some phrase in the lyrics.
While this approach allows the localization of a desired piece of mu-
sic, it requires the user to know and actively input information about
the title he or she is looking for. An alternative approach, allow-
ing users to explore the music archive, searching for musical styles,
rather than for a specific title or group, is thus usually provided in the
form of genre hierarchies such as , , . Hence, a
customer looking for an opera recording might look into the
section, and will there find - depending on the further organization of
the music archive - a variety of interpretations, being similar in style,
and thus possibly suiting his or her likings. However, such organi-
zations rely on manual categorizations and usually consist of several
hundred categories which involve high maintenance costs, in par-
ticular for dynamic music collections, where multiple contributors
Permission to make digital or hard copies of all or part of this
work for personal or classroom use is granted without fee provided
that copies are not made or distributed for profit or commercial
advantage and that copies bear this notice and the full citation on
the first page. 2002 IRCAM - Centre Pompidou
have to file their contributions accordingly. The inherent difficul-
ties of such taxonomies have been analyzed, for example, in [22].
Another approach taken by on-line music stores is to analyze the
behavior of customers to give those showing similar interests rec-
ommendations on music which they might appreciate. For example,
a simple approach is to give a customer looking for pieces similar
to recommendations on music which is usually bought
by people who also purchased . However, extensive and
detailed customer profiles are rarely available.
The , i.e. the system, outlined
in [26], facilitates exploration of music archives withoutrelying on
further information such as customer profiles or predefined cate-
gories. It does not require the availability of detailed, high-quality
meta-data on the various pieces of music, or musical scores. Rather,
we rely on the sound information, present in the form of any acous-
tical wave format, as it is available e.g. from CD tracks or MP3 files.
Based on the sound signal we extract low-level features based on fre-
quency spectra dynamics, and process them using psycho-acoustic
models of our auditory system. The resulting representation allows
us to calculate to a certain degree the perceived similarity between
two pieces of music. We use this form of data representation as
input to the (
) [6], an extension to the popular self-organizing map [13].
This neural network provides cluster analysis by mapping similar
data items close to each other on a map display. Specifically, the
is capable of detecting hierarchical relationships in the
data, and thus produces a hierarchy of maps representing various
styles of music, into which the pieces of music are organized.
Theremainder of this paper is organizedasfollows. Section2briefly
reviews the related work. The feature extraction process is presented
in detail in Section 3, followed by a description of the principles and
training procedure of the , and the
inSection 4. Wethendescribe
experimental results, using both a reduced collection of 77 pieces
of music, as well as a larger archive consisting of 359 pieces in
Section 5. Finally, in Section 6 some conclusions are drawn.
A vast amount of research has been conducted in the area of content-
based music and audio retrieval. For example, methods have been
developed to search for pieces of music with a particular melody.
The queries can be formulated by humming and are usually trans-
formed into a symbolic melody representation, which is matched
against a database of scores usually given in MIDI format. Re-
search in this direction is reported in, e.g. [1, 2, 10, 16, 28]. Other
than melodic information it is also possible to extract and search for
style information using the MIDI format. For example, in [4] solo
improvised trumpet performances are classified into one of the four
styles: , , , or .
The MIDI format offers a wealth of possibilities, however, only
a small fraction of all electronically available pieces of music are
Using Psycho-Acoustic Models and to create a Hierarchical Structuring of Music
available as MIDI. A more readily available format is the raw audio
signal to which all other audio formats can be decoded. One of the
first audio retrieval approaches dealing with music was presented
in [35], where attributes such as the pitch, loudness, brightness and
bandwidth of speech and individual musical notes were analyzed.
Several overviews of systems based on the raw audio data have been
presented, e.g. [9, 18]. However, most of these systems do not treat
content-based music retrieval in detail, but mainly focus on speech
or partly-speech audio data, with one of the few exceptions being
presented in [17], using hummed queries against an MP3 archive
for melody-based retrieval.
Furthermore, only fewapproachesin the area of content-based music
analysis have utilized the framework of psychoacoustics. Psychoa-
brain’s interpretation of them, cf. [37]. One of the first exceptions
was [8], where psychoacoustic models are used to describe the simi-
larity of instrumental sounds. The approach was demonstrated using
a collection of about 100 instruments, which were organized using
ain a similar way as presented in this pa-
per. For each instrument a 300 milliseconds sound was analyzed and
steady state sounds with a duration of 6 milliseconds were extracted.
These steady state sounds can be regarded as the smallest possible
building blocks of music. A model of the human perceptual behav-
ior of music using psychoacoustic findings was presented in [30]
together with methods to compute the similarity of two pieces of
music. A more practical approach to the topic was presented in [33]
where music given as raw audio is classified into genres based on
musical surface and rhythm features. The features are similar to
the rhythm patterns we extract, the main difference being that we
analyze them separately in 20 frequency bands.
Our work is based on first experiments reported in [26]. In particular
we have redesigned the feature extraction process using psychoa-
coustic models. Additionally, by using a hierarchical extension of
the neural network for data clustering we are able to detect the
hierarchical structure within our archive.
The architecture of the system may be divided into 3
stages as depicted in Figure 1. Digitized music in good sound qual-
ity (44kHz, stereo) with a duration of one minute is represented
by approximately 10MB of data in its raw format describing the
physical properties of the acoustical waves we hear. In a prepro-
cessing stage, the audio signal is transformed, down-sampled and
split into individual segments (steps P1 to P3). We then extract fea-
tures which are robust towards non-perceptive variations and on the
other hand resemble characteristics which are critical to our hear-
ing sensation, i.e. rhythm patterns in various frequency bands. The
feature extraction stage can be divided into two subsections, consist-
ing of the extraction of the specific loudness sensation expressed in
(steps S1 to S6), as well as the conversion into time-invariant
frequency-specific rhythm patterns (step R1 to R3). Finally, the data
may be optionally converted, before being organized into clusters in
steps A1 to A3 using the . The feature extraction steps are
further detailed in the following subsections, with the clustering pro-
cedure being described in Section 4, with the visualization metaphor
being only touched upon briefly due to space considerations.
3.1 Preprocessing
( ) The pieces of music may be given in any audio file format,
such as e.g. MP3 files. We first decode these to the raw
(PCM) audio format.
( ) The raw audio format of music in good quality requires huge
amounts of storage. As humans can easily identify the genre of a
piece of music even if its sound quality is rather poor we can safely
reduce the quality of the audio signal. Thus, stereo sound quality
is first reduced to mono and the signal is then down-sampled from
P1: Audio -> PCM
P2: Stereo -> Mono, 44kHz->11kHz
P3: music -> segments
S1: Power Spectrum
S2: Critical Bands
S3: Spectral Masking
S4:Decibel - dB-SPL
S5: Phon: Equal Loudness
S6: Sone: Specific Loudness Sens.
Rhythm R1: Modulation Amplitude
R2: Fluctuation Strength
R3: Modified Fluctuation Strength
A1: Median vector (opt.)
A2: Dimension Reduction (opt.)
A3: GHSOM Clustering
Visualization: Islands of Music and Weather Charts
44kHz to 11kHz, leaving a distorted, but still easily recognizable
sound signal comparable to phone line quality.
( ) We subsequently segment each piece into 6-second sequences.
The duration of 6 seconds ( samples) was chosen heuristically
because it is long enough for humans to get an impression of the
style of a piece of music while being short enough to optimize
the computations. However, analyses with various settings for the
segmentation have shown no significant differences with respect to
segment length. After removing the first and the last 2 segments
of each piece of music to eliminate lead-in and fade-out effects,
we retain only every third of the remaining segments for further
analysis. Again, the information lost by this type of reduction has
shown insignificant in various experimental settings.
We thus end up with several segments of 6 seconds of music every
18 seconds at 11kHz for each piece of music. The preprocessing
results in a data reduction by a factor of over 24 without losing
relevant information, i.e. a human listener is still able to identify the
genre or style of a piece of music given the few 6-second sequences
in lower quality.
3.2 Specific Loudness Sensation - Sone
Loudness belongs to the category of intensity sensations. The loud-
ness of a sound is measured by comparing it to a reference sound.
The 1kHz tone is a very popular reference tone in psychoacoustics,
Using Psycho-Acoustic Models and to create a Hierarchical Structuring of Music
and the loudness of the 1kHz tone at 40dB is defined to be .
A sound perceived to be twice as loud is defined to be 2 and
so on. In the first stage of the feature extraction process, this spe-
cific loudness sensation (Sone) per critical-band (Bark) in short time
intervals is calculated in 6 steps starting with the PCM data.
( ) First the power spectrum of the audio signal is calculated. To
do this, the raw audio data is first decomposed into its frequencies
using a . We use a window
size of 256 samples, which corresponds to about 23ms at 11kHz,
and a Hanning window with 50% overlap. We thus obtain a Fourier
transform of 11 / 2 kHz, i.e. 5.5 kHz signals.
( ) The inner ear separates the frequencies and concentrates them
at certain locations along the basilar membrane. The inner ear
can thus be regarded as a complex system of a series of band-
pass filters with an asymmetrical shape of frequency response. The
center frequencies of these band-pass filters are closely related to the
critical-band rates, where frequencies are bundled into 24 critical-
bands according to the scale [37]. Where these bands should
be centered, or how wide they should be, has been analyzed through
several psychoacoustic experiments. Since our signal is limited to
5.5 kHz we use only the first 20 critical bands, summing up the
values of the power spectrum within the upper and lower frequency
limits of each band, obtaining a power spectrum of the 20 critical
bands for the segments.
() Spectral Masking is the occlusion of a quiet sound by a louder
soundwhen both sounds are presentsimultaneouslyandhavesimilar
frequencies. Spectral masking effects are calculated based on [31],
with a spreading function defining the influence of the -th critical
band on the -th being used to obtain a spreading matrix. Using
this matrix the power spectrum is spread across the critical bands
obtained in the previous step, where the masking influence of a
critical band is higher on bands above it than on those below it.
( ) The intensity unit of physical audio signals is sound pressure
and is measured in (Pa). The values of the PCM data
correspond to the sound pressure. Before calculating values
it is necessary to transform the data into decibel. The decibel value
of a sound is calculated as the ratio between its pressure and the
pressure of the hearing threshold, also known as dB-SPL, where
SPL is the abbreviation for sound pressure level.
() The relationship between the sound pressure level in decibel
and our hearing sensation measured in is not linear. The
perceived loudness depends on the frequency of the tone. From the
dB-SPL values we thus calculate the equal loudness levels with their
unit Phon. The levels are defined through the loudness in dB-
SPL of a tone with 1kHz frequency. A level of 40 resembles
the loudness level of a 40dB-SPL tone at 1kHz. A pure tone at
any frequency with 40 is perceived as loud as a pure tone
with 40dB at 1kHz. We are most sensitive to frequencies around
2kHz to 5kHz. The hearing threshold rapidly rises around the lower
and upper frequency limits, which are respectively about 20Hz and
16kHz. Although the values for the equal loudness contour matrix
are obtained from experiments with pure tones, they may be applied
to calculate the specific loudness of the critical band rate spectrum,
resulting in loudness level representations for the frequency ranges.
( ) Finally, as the perceived loudness sensation differs for different
loudness levels, the specific loudness sensation in is calculated
based on [3]. The loudness of the 1kHz tone at 40dB-SPL is defined
to be 1 Sone. A tone perceived twice as loud is defined to be 2
and so on. For values up to 40 the sensation rises slowly,
increasing at a faster rate afterwards.
Figure 2 illustrates the data after each of the feature extraction steps
using the first 6-second sequences extracted from
and from . The sequence of
0.05 PCM Audio Signal
Power Spectrum [dB]
Frequency [kHz]
Critical−Band Rate Spectrum [dB]
Critical−band [bark]
Spread Critical−Band Rate Spectrum [dB]
Critical−band [bark]
Specific Loudness Level [phon]
Critical−band [bark]
Specific Loudness Sensation [sone]
Time [s]
Critical−band [bark]
0 2 4
1PCM Audio Signal
Power Spectrum [dB]
Frequency [kHz]
Critical−Band Rate Spectrum [dB]
Critical−band [bark]
Spread Critical−Band Rate Spectrum [dB]
Critical−band [bark]
Specific Loudness Level [phon]
Critical−band [bark]
Specific Loudness Sensation [sone]
Time [s]
Critical−band [bark]
0 2 4
Beethoven, Für Elise Korn, Freak on a Leash
The specific loudness sensation depicts each piano key played. On
the other hand, , which is classified as
, is quite aggressive. Melodic elements do not
play a major role and the specific loudness sensation is a rather
complex pattern spread over the whole frequency range, whereas
only the lower critical bands are active in . Notice further,
that the values of the patterns of are up to 18
times higher compared to those of .
3.3 Rhythm Patterns
After the first preprocessing stage a piece of music is represented
by several 6-second sequences. Each of these sequences contains
information on how loud the piece is at a specific point in time in a
specific frequency band. Yet, the current data representation is not
time-invariant. It may thus not be used to compare two pieces of
music point-wise, as already a small time-shift of a few milliseconds
will usually result in completely different feature vectors. In the
second stage of the feature extraction process, we calculate a time-
invariant representation for each piece of music in 3 further steps,
namely the frequency-wise rhythm pattern. These rhythm patterns
contain information on how strong and fast beats are played within
the respective frequency bands.
( ) The loudness of a critical-band usually rises and falls several
times resulting in a more or less periodical pattern, also known as
the rhythm. The loudness values of a critical-band over a certain
time period can be regarded as a signal that has been sampled at
discrete points in time. The periodical patterns of this signal can
then be assumed to originate from a mixture of sinuids. These
sinuids modulate the amplitude of the loudness, and can be calcu-
lated by a Fourier transform. The modulation frequencies, which
Using Psycho-Acoustic Models and to create a Hierarchical Structuring of Music
Specific Loudness Sensation
Critical−band [bark]
0 2 4
13.6Hz +− 1.5Hz
Time [s]
Critical−band [bark]
Modulation Amplitude
Critical−band [bark]
Fluctuation Strength
Modulation Frequency [Hz]
Critical−band [bark]
Modified Fluctuation Strength
2 4 6 8 10
Specific Loudness Sensation
Critical−band [bark]
0 2 4
16.9Hz +− 2.7Hz
Time [s]
Critical−band [bark]
Modulation Amplitude
Critical−band [bark]
Fluctuation Strength
Modulation Frequency [Hz]
Critical−band [bark]
Modified Fluctuation Strength
2 4 6 8 10
Beethoven, Für Elise Korn, Freak on a Leash
can be analyzed using the 6-second sequences and time quanta of
12ms, are in the range from 0 to 43Hz with an accuracy of 0.17Hz.
Notice that a modulation frequency of 43Hz corresponds to almost
2600bpm. Thus, the amplitude modulation of the loudness sensa-
tion per critical-band for each 6-second sequence is calculated using
a FFT of the 6-second sequence of each critical band.
( ) The amplitude modulation of the loudness has different effects
on our sensation depending on the frequency. The sensation of
is most intense at a a modulation frequency
of around 4Hz and gradually decreases up to 15Hz. At 15Hz the
sensation of starts to increase, reaches its maximum at
about 70Hz, and starts to decreases at about 150Hz. Above 150Hz
the sensation of hearing increases.
It is the fluctuation strength, i.e. rhythm patterns up to 10Hz, which
corresponds to 600 beats per minute (bpm), that we are interested
in. For each of the 20 frequency bands we obtain 60 values for
modulation frequencies between 0 and 10Hz. This results in 1200
values representing the fluctuation strength.
() To distinguish certain rhythm patterns better and to reduce
irrelevant information, gradient and Gaussian filters are applied.
In particular, we use gradient filters to emphasize distinctive beats,
at a specific modulation frequency compared to the values immedi-
ately below and above this specific frequency. We further apply a
Gaussian filter to increase the similarity between two rhythm pattern
characteristics which differ only slightly in the sense of either being
in similar frequency bands or having similar modulation frequen-
cies by spreading the according values. We thus obtain modified
fluctuation strength values that can be used as feature vectors for
subsequent cluster analysis.
The second part of the feature extraction process is summarized in
Figure 3. Looking at the modulation amplitude of it
seems as though there is no beat. In the fluctuation strength subplot
the modulation frequencies around 4Hz are emphasized. Yet, there
are no clear vertical lines, as there are no periodic beats. On the other
hand, note the strong beat of around 7Hz in all frequency bands of
. For an in-depth discussion of the characteristics
of the feature extraction process, please refer to [23, 24].
Using the rhythm patterns we apply the
( ) [13], as well as its extension, the
( ) [6] algorithm to organize the
pieces of music on a 2-dimensional map display in such a way that
similar pieces are grouped close together. In the following sections
we will briefly review the principles of the and the ,
followed by a description of the last steps of the system,
i.e. the cluster analysis steps A1 to A3 in Figure 1.
4.1 Self-Organizing Maps
The ( ), as proposed in [12] and de-
scribed thoroughly in [13], is one of the most distinguished un-
supervised artificial neural network models. It basically provides
cluster analysis by producing a mapping of high-dimensional input
data onto a usually 2-dimensional output space while preserving the
topological relationships between the input data items as faithfully
as possible. In other words, the produces a projection of the
data space onto a two-dimensional map space in such a way, that
similar data items are located close to each other on the map.
More formally, the consists of a set of units , which are ar-
ranged according to some topology, where the most common choice
is a two-dimensional grid. Each of the units is assigned a model
vector of the same dimension as the input data, . In
the initial setup of the model prior to training, the model vectors
are frequently initialized with random values. However, more so-
phisticated strategies such as, for example, Principle Component
Analysis, may be applied. During each learning step , an input
pattern is randomly selected from the set of input vectors and
presented to the map. Next, the unit showing the most similar model
vector with respect to the presented input signal is selected as the
winner , where a common choice for similarity computation is the
Euclidean distance, cf. Expression 1.
Adaptation takes place at each learning iteration and is performed
as a gradual reduction of the difference between the respective com-
ponents of the input vector and the model vector. The amount of
adaptation is guided by a monotonically decreasing learning-rate
, ensuring large adaptation steps at the beginning of the training
process, followed by a fine-tuning-phase towards the end.
Apart from the winner, units in a time-varying and gradually de-
creasing neighborhood around the winner are adapted as well. This
enables a spatial arrangement of the input patterns such that alike
inputs are mapped onto regions close to each other in the grid of
output units. Thus, the training process of the self-organizing map
results in a topological ordering of the input patterns. According
to [27] the self-organizing map can be viewed as a neural network
model performing a spatially smooth version of -means cluster-
ing. The neighborhood of units around the winner may be described
implicitly by means of a neighborhood-kernel taking into ac-
count the distance – in terms of the output space – between unit
under consideration and unit , the winner of the current learning
iteration. A Gaussian may be used to define the neighborhood-
kernel, ensuring stronger adaption of units close to the winner. It is
common practice that in the beginning of the learning process the
neighborhood-kernel is selected large enough to cover a wide area
of the output space. The spatial width of the neighborhood-kernel
is reduced gradually during the learning process such that towards
the end of the learning process just the winner itself is adapted.
Using Psycho-Acoustic Models and to create a Hierarchical Structuring of Music
Input Space Output Space
In combining these principles of self-organizing map training, we
may write the learning rule as given in Expression (2), with rep-
resenting the time-varying learning-rate, representing the time-
varyingneighborhood-kernel, representingthecurrentlypresented
input pattern, and denoting the model vector assigned to unit .
A simple graphical representation of a self-organizing map’s archi-
tecture and its learning process is provided in Figure 4. In this
figure the output space consists of a square of 36 units, depicted
as circles, forming a grid of units. One input vector is
randomly chosen and mapped onto the grid of output units. The
winner showing the highest activation is determined. Consider the
winner being the unit depicted as the black unit labeled in the fig-
ure. The model vector of the winner, , is now moved towards
the current input vector. This movement is symbolized in the input
space in Figure 4. As a consequence of the adaptation, unit will
produce an even higher activation with respect to the input pattern
at the next learning iteration, , because the unit’s model vec-
tor, , is now nearer to the input pattern in terms of the
input space. Apart from the winner, adaptation is performed with
neighboring units, too. Units that are subject to adaptation are de-
picted as shaded units in the figure. The shading of the various units
corresponds to the amount of adaptation, and thus, to the spatial
width of the neighborhood-kernel. Generally, units in close vicinity
of the winner are adapted more strongly, and consequently, they are
depicted with a darker shade in the figure.
Being a decidedly stable and flexible model, the has been em-
ployed in a wide range of applications, ranging from financial data
analysis, via medical data analysis, to time series prediction, indus-
trial control, and many more [5, 13, 32]. It basically offers itself
to the organization and interactive exploration of high-dimensional
data spaces. One of its most prominent application areas is the orga-
nization of large text archives [15, 19, 29], which, due to numerous
computational optimizations and shortcuts that are possible in this
NN model, scale up to millions of documents [11, 14].
However, due to its topological characteristics, the not only
servesasthebasisforinteractiveexploration, butmayalsobeusedas
an index structure to high-dimensional databases, facilitating scal-
able proximity searches. Reports on a combination of and
R*-trees as an index to image databases have been reported, for
example, in [20, 21], whereas an index tree based on the
is reported in [36]. Thus, the combines and offers itself in
a convenient way both for interactive exploration, as well as for
the indexing and retrieval, of information represented in the form
of high-dimensional feature spaces, where exact matches are ei-
ther impossible due to the fuzzy nature of data representation or
the respective type of query, or at least computationally prohibitive,
making them particularly suitable for image or music databases.
layer 0
layer 1
layer 2
layer 3
4.2 The GHSOM
Thekeyideaofthe [6]
is to use a hierarchical structure of multiple layers, where each layer
consists of a number of independent . One is used at
the first layer of the hierarchy, representing the respective data in
more detail. For every unit in this map a might be added to
the next layer of the hierarchy. This principle is repeated with the
third and any further layers of the .
Since one of the shortcomings of usage is its fixed network
architecture we rather use an incrementally growing version of the
. This relieves us from the burden of predefining the network’s
size which is rather determined during the unsupervised training
process. We start with a layer 0, which consists of only one single
unit. The weight vector of this unit is initialized as the average of
all input data. The training process basically starts with a small map
of, say, units in layer 1, which is self-organized according to
the standard training algorithm.
This training process is repeated for a fixed number of training
iterations. Ever after training iterations the unit with the largest
bythis very unit is selected astheerrorunit. Inbetweenthe error unit
and its most dissimilar neighbor in terms of the input space either a
new row or a new column of units is inserted. The weight vectors
of these new units are initialized as the average of their neighbors.
An obvious criterion to guide the training process is the
, calculated as the sum of the distances between the
weight vector of a unit and the input vectors mapped onto this unit.
It is used to evaluate the mapping quality of a based on the
( ) of all units in the map. A map
grows until its falls below a certain fraction of the of
the unit in the preceding layer of the hierarchy. Thus, the map now
represents the data of the higher layer unit in more detail.
As outlined above the initial architecture of the consists
of one . This architecture is expanded by another layer in case
of dissimilar input data being mapped on a particular unit. These
units are identified by a rather high quantization error which is
above a threshold . This threshold basically indicates the desired
granularity level of data representation as a fraction of the initial
quantization error at layer 0. In such a case, a new map will be
added to the hierarchy and the input data mapped on the respective
higher layer unit are self-organized in this new map, which again
grows until its is reduced to a fraction of the respective
higher layer unit’s quantization error . Note that this does not
necessarily lead to a balanced hierarchy. The depth of the hierarchy
willratherreflect the diversityininputdatadistributionwhichshould
be expected in real-world data collections. Depending on the desired
fraction of reduction wemayendupwith either a very deep
hierarchy with small maps, a flat structure with large maps, or – in
the extreme case – only one largemap. The growth of the hierarchy
is terminated when no further units are available for expansion.
Using Psycho-Acoustic Models and to create a Hierarchical Structuring of Music
A graphical representation of a is given in Figure 5. The
map in layer 1 consists of units and provides a rough organiza-
tion of the main clusters in the input data. The six independent maps
in the second layer offer a more detailed view on the data. Two units
from one of the second layer maps have further been expanded into
third-layer maps to provide sufficiently granular input data repre-
sentation. By using a proper initialization of the maps added at each
layer in the hierarchy based on the parent unit’s neighbors, a global
orientation of the newly added maps can be reached [7]. Thus, sim-
ilar data will be found on adjoining borders of neighboring maps in
the hierarchy.
4.3 Cluster analysis of music data
The feature vectors extracted according to the process described in
Section 3 are used as input to the . However, some further
intermediary processing steps may be applied in order to obtain
feature vectors for pieces of music, rather than music segments, as
well as to, optionally, compress the dimensionality of the feature
space as follows.
( ) Basically, each segment of music may be treated as an in-
dependent piece of music, thus allowing multiple assignment of a
given piece of music to multiple clusters of varying style if a piece of
music contains passages that may be attributed to different genres.
Also, a two-level clustering procedure may be applied to first group
the segments according to their overall similarity. In a second step,
the distribution of segments across clusters may be used as a kind
of to describe the characteristics of the whole piece
of music, using the resulting distribution vectors as an input to the
second-level clustering procedure [26].
On the other hand, our research has shown, that simply using the
median of all segment vectors belonging to a given piece of music,
results in a stable representation of the characteristics of this piece of
music. We have evaluated several alternatives using Gaussian mix-
ture models, fuzzy c-means, and k-means pursuing the assumption
that a piece of music contains significantly different rhythm patterns.
However, the median, despite being by far the simplest technique,
yielded comparable results to the more complex methods. Other
simple alternatives such as the the mean proved to be too vulnerable
with respect to outliers.
The rhythm patterns of all 6-second sequences extracted from
and from as well as their medians are
depicted in Figure 6. The vertical axis represents the critical-bands
from 1-20, the horizontal axis the modulation frequencies
from 0-10Hz, where 1 and 0Hz is located in the lower left
corner. Generally, the patterns of one piece of music have common
properties. While is characterized by a rather horizon-
tal shape with low values, has a characteristic
vertical line around 7Hz. To capture these common characteristics
within a piece of music the median is a suitable approach. The
median of indicates that there are common but weak ac-
tivities in the range of 3-10 with a modulation frequency of up
to 5Hz. The single sequences of have many more details,
for example, the first sequence has a minor peak around 5 and
5Hz modulation frequency. However, the main characteristics, e.g.
the vertical line at 7Hz for , as well as the generic
activity in the frequency bands are preserved.
() Furthermore, the 1200-dimensionalfeature space may be com-
pressed using Principle Component Analysis (PCA). Our experi-
ments have shown that a reduction down to 80 dimensions may be
performed without much loss in variance. Yet, for the experiments
presented in this paper we use the uncompressed feature space.
() Following these optional steps, a may be trained to
obtain a hierarchical map interface to the music archive. Apart from
obtaining hierarchical representations, the may also be
applied to obtain flat maps similar to conventional , or grow
linear tree structures.
( ) The resulting maps offer themselves as interfaces
to explore a music archive. Yet advanced cluster visualization tech-
niques based on the , such as the [34], may be used
to assist in cluster identification. A specifically appealing visual-
ization based on [25] are the
, which use the metaphor of geographical maps,
where islands resemble styles of music, to provide an intuitive in-
terface to music archives. Furthermore, attribute aggregates are
used to create that help the user to understand the
sound characteristics of the various areas on the map. For a detailed
discussion and evaluation of these visualizations, see [24].
system based on a music archive made up of MP3-compressed files
of popular pieces of music from a variety of genres. Specifically,
we present in more detail the organization of a small subset of
the entire archive, consisting of 77 pieces of music, with a total
playing time of about 5 hours, using the . This subset,
due to its limited size offers itself for detailed discussion. We
furthermore present results using a larger collection of 359 pieces
of music, with a total playing length of about 23 hours. In both
cases, each piece is represented by 1200 features which describe
the dynamics of the loudness in frequency bands. The experiments,
including audio samples, are available for interactive exploration
at the project homepage at
5.1 A GHSOM of 77 pieces of music
Figure 7 depicts a trained on the music data. On the first
level the training process has resulted in the creation of a
Using Psycho-Acoustic Models and to create a Hierarchical Structuring of Music
map, organizing the collection into 9 major styles of music. The
bottom right represents mainly classical music, while the upper left
mainly represents a mixture of Hip Hop, Electro, and House by
. The upper-right, center-right, and upper-
center represent mainly disco music such as by
, by , or
by . Please note, that the organization does not
follow clean “conceptual” genre styles, splitting by definition, e.g.
and , but rather reflects the overall sound similarity.
Seven of these 9 first-level categories are further refined on the sec-
ond level. For example, the bottom right unit representing classical
music is divided into 4 further sub-categories. Of these 4 cate-
gories the lower-right represents slow and peaceful music, mainly
piano pieces such as and
by , or by
. The upper-right represents, for example,
pieces by (vm), which, in this case, are more dy-
namic interpretations of classical pieces played on the violin. In the
upper-left orchestral music is located such as the as the end credits
of the film and the slow love song
by , exhibiting a more intensive
sound sensation, whereas the lower right corner unit represents the
by .
Generally speaking, we find the softer, more peaceful songs on this
second level map located in the lower half of the map, whereas the
more dynamic, intensive songs are located in the upper half. This
corresponds to the general organization of the map in the first layer,
wheretheunitrepresentingClassicmusicislocatedin the lowerright
corner, having more aggressive music as its upper and left neighbors.
This allows us, even on lower-level maps, to move across map
boundaries to find similar music on the neighboring map following
the same general trends of organization, thus alleviating the common
problem of cluster separation in hierarchical organizations.
Some interesting insights into the music collection which the
reveals are, for example, that the song by
(center-left) is quite different then the other songs by
the same group. was the groups biggest hit so far and,
unlike their other songs, has been appreciate by a broader audience.
Generally, the pieces of one group have similar sound characteristics
and thus are located within the same categories. This applies, for
example, to the songs of and ,
which are located in the center of the 9 first-level categories together
with other aggressive rock songs. However, another exception is
by , located in the lower-
left. Listening to this piece reveals, that it is much slower than the
other pieces of the group, and that this song matches very well to,
for example, by .
5.2 A GHSOM of 359 pieces of music
In this section we present results from using the system
to structure a larger collection of 359 pieces of music. Due to space
constraints we cannot display or discuss the full hierarchy in detail.
We will thus pick a few examples to show the characteristics of the
resulting hierarchy, inviting the reader to explore and evaluate the
complete hierarchy via the project homepage.
The resulting has grown to a size of units on the
top layer map. All 8 top-layer units were expanded onto a second
layer in the hierarchy, from which 25 units out of 64 units total
on this layer were further expanded into a third layer. None of
the branches required expansion into a fourth layer at the required
level-of-detail setting. An integrated view of the two top-layers of
the map is depicted in Figure 8. We will now take a closer look at
Using Psycho-Acoustic Models and to create a Hierarchical Structuring of Music
some branches of this map, and compare them to the respective areas
in the of the smaller data collection depicted in Figure 7
Generally, we find pieces of soft classical music in the upper right
corner, with the music becoming gradually more dynamic and ag-
gressive as we move towards the bottom left corner of the map.
Due to the characteristics of the training process of the
we can find the same general tendency at the respective lower-layer
maps. The overall orientation of the map hierarchy is rotated when
compared to the smaller , where the classical titles were
located in the bottom right corner, with the more aggressive titles
placed on the upper left area of the map. This rotation is due to
the unsupervised nature of the training process. It can,
however, be avoided by using specific initialization techniques if a
specific orientation of the map were required.
The unit in the upper right corner of the top-layer map, representing
the softest classical pieces of music, is expanded onto a map
in the second layer (expanded to the upper right in Figure 8). Here
we again find the softest, most peaceful pieces in the upper right
corner, namely part of the sound-track of the movie ,
next to by ,
by , and by . Below this unit we find
further soft titles, yet somewhat more dynamic. We basically find
all titles that were mapped together in the bottom right corner unit
of the of the smaller collection depicted in Figure 7 on
this unit, i.e. and the . Furthermore,
a few additional titles of the larger collection have been mapped
onto this unit, the most famous of which probably are
by , the by or the
from the Clarinet Concert by emphMozart.
Let us now take a look at the titles that were mapped onto the
The , located on the neighboring unit
to the right in the first example, can be found in the lower left corner
of this map, together with, for example,
by . Mapped onto the upper neighboring unit in the
smaller wehad titles like the
by , or the
by . We find these two titles in the upper left corner of the
2-layer map of this , together with two of the three titles
mapped onto the diagonally neighboring unit in the first ,
i.e. by , and by
, which are again soft, mellow, but a bit more dynamic. The
third title mapped onto this unit in the smaller , i.e. the
sound track of the movie is not mapped
into this branch of this anymore. When we listen to this
title we find it to have mainly strong orchestral parts, which have a
different, more intense sound than the soft pieces mapped onto this
as more of them are available in the larger data collection. Instead,
we can find this title on the upper right corner in the neighboring
branch to the left, originating from the upper left corner unit of the
top-layer map. There it is mapped together with
and other orchestral pieces, such as by
. We thus find this branch of the to be more or
less identical to the overall organization of the smaller in
so far as the titles present in both collections are mapped in similar
relative positions to each other.
Due to the topology preservation provided by the we can
move from the soft classical cluster map to the left to find somewhat
more dynamic classical pieces of music on the neighboring map
(expanded to the left in Figure 8). Thus, a typical disadvantage of
hierarchical clustering and structuring of datasets, namely the fact
that a cluster that might be considered conceptually very similar is
subdivided into two distinct branches, is alleviated in the
concept, because these data points are typically located in the close
neighborhood. We thus find, on the right border of the neighboring
map, the more peaceful titles of this branch, yet more dynamic than
the classical pieces on the neighboring right branch discussed above.
Using Psycho-Acoustic Models and to create a Hierarchical Structuring of Music
Rather than continuing to discuss the individual units we shall now
take a look at the titles of a specific artist and its distribution in
this hierarchy. In total, there are 7 titles by in
this collection, all violin interpretations, yet of distinctly different
style. Her most “conventional” classical interpretations, such as
Brahm’s or Bach’s
are located in the classic-
cluster in the upper right corner branch on two neighboring units
on the left side of the second-layer map. These are definitely the
most “classical” of her interpretations in the given collection, yet
exhibiting strong dynamics. Further 3 pieces of Vanessa Mae (
by Vivaldi, in its symphonic version, and
) are found in the neighboring branch to the
left, the former two mapped together with by
. All of these titles are very dynamic violin pieces with
strong orchestral parts and percussion.
When we look for the remaining 2 titles by , we find
them on the unit expanded below the top right corner unit, thus
also neighboring the classical cluster. On the top-left corner unit
of this sub-map we find , which starts in a classical,
symphonic version, and gradually has more intensive percussion
being added, exhibiting a quite intense beat. Also on this map, on
by Bach, this time in the classical
interpretation of , also with a very intense beat. The
more “conventional” organ interpretation of this title, as we have
seen, is located in the classic cluster discussed before. Although
both are the same titles, the interpretations are very different in their
sound characteristic, with s interpretation definitely
being more pop-like than the typical classical interpretation of this
title. Thus, two identical titles, yet played in different styles, end
up in their respective stylistic branches of the system.
We furthermore find, that the system does not organize all titles
by a single artist into the same branch, but actually assigns them
according to their sound characteristics, which makes it particularly
suitable for localizing pieces according to ones likings independent
of the typical assignment of an artist to any category, or to the
conventional assignment of titles to specific genres.
In spiteofthesedesiredcharacteristics, however, several weaknesses
remain, especially when titles, that may be very similar in terms of
their beat characteristics in the various frequency bands, are mapped
together, yet derive from very different genres and are immediately
associatedwiththosegenres. Thisrefers,for example, to titles where
the language is a specific characteristic, such as several German-
language songs in our collection. Furthermore, in some cases like
the previously-mentioned by ,
which is mapped together with titles by , the rhyth-
mic properties might be similar, yet the perceived sound is still
distinctively different because of the strong vocal parts. Even if the
acoustic background shares some similarities over long distances
of the title, the rhythmic vocal parts are perceived much stronger.
This points towards the necessity to incorporate additional features
to better capture sound characteristics. Furthermore, in some cases
like these it might be advisable to use the two-stage clustering ap-
proach outlined in [26], as for some titles the variance of sound
characteristics of segments is rather large. When taking a look at
the mapping of the respective segments of in an-
other experiment we find 3 segments of it to be located in a more
classical sub-branch, whereas the other segments are located in the
more dynamic, aggressive branches of the hierarchy.
Further units depicted in more detail in Figure 8 are the bottom right
unit representing the more aggressive, dynamic titles. We leave it to
the reader to analyze this sub-map and compare the titles with the
ones mapped onto the upper left corner map in Figure 7.
We have presented the , a
system for content-based organization and visualization of music
archives. Given pieces of music in raw audio format a hierarchical
organization is created where music of similar sound characteristics
is mapped together. Our system thus enables a user to browse
through the archive, searching for music representing a particular
style, without relying on manual genre classification.
Rhythm patterns in various frequency bands are extracted and used
as a descriptor of perceived sound similarity, incorporating psychoa-
coustic models during the feature extraction stage. The
automaticallyidentifies the inherent structureofthemusiccollection
and offers an intuitive interface for genre browsing. Furthermore,
by mapping a piece of music representing a “query” onto the map
structure, the user is pointed to a location within the map hierarchy,
where he or she will find similar pieces of music. We evaluated our
approach using a collection of about 23 hours of music and obtained
encouraging results. Future work will mainly deal with improving
the feature extraction process. While the presented features offer
a simple but powerful way of describing the music, additional in-
formation is required to better capture sound characteristics that go
beyond frequency-specific beat patterns, focusing e.g. on the tim-
bre and instrumentation. Furthermore, more abstract features are
necessary to explain the organization principles to the user.
While the current evaluation allows for an intuitive analysis of the
system’s performance, a more formal evaluation is desired. We thus
plan to perform a user study allowing us to evaluate both users’
expectations towards such a system as well as to obtainfeedback on
the perceived quality of the current approach.
Part of this research has been carried out in the project Y99-INF,
sponsored by the Austrian Federal Ministry of Education, Science
and Culture (BMBWK) in the form of a START Research Prize. The
BMBWK also provides financial support to the Austrian Research
Institute for Artificial Intelligence. The authors wish to thank Simon
Dixon, Markus Frühwirth, and Werner Göbel for valuable discus-
sions and contributions.
[1] D. Bainbridge, C. Nevill-Manning, H. Witten, L. Smith, and
R. McNab. Towards a digital library of popular music. In ,
pages 161–169, Berkeley, CA, August 11-14 1999. ACM.
[2] W. Birmingham, R. Dannenberg, G. Wakefield, M. Bartsch,
D. Bykowski, D.Mazzoni, C.Meek, M.Mellody, and W. Rand.
MUSART: Music retrieval via aural queries. In
, Bloomington, ID, October 15-17 2001.
[3] R. Bladon. Modeling the judgement of vowel quality dif-
ferences. ,
69:1414–1422, 1981.
[4] J. Daniels and E. Rissland. Finding legally relevant passages
in case opinions. ,
pages 39–46, 1997.
[5] G. DeBoeck and T. Kohonen, editors.
. Springer Verlag, Berlin, Germany, 1998.
[6] M. Dittenbach, D. Merkl, and A. Rauber. The growing hierar-
chical self-organizing map. In , pages 15 – 19, Como,
Italy, July 24-27 2000. IEEE Computer Society.
Using Psycho-Acoustic Models and to create a Hierarchical Structuring of Music
[7] M. Dittenbach, A. Rauber, and D. Merkl. Recent advances
with the growing hierarchical self-organizing map. In
, Advances in
Self-Organizing Maps, pages140–145, Lincoln, England, June
13-15 2001. Springer.
[8] B. Feiten and S. Günzel. Automatic indexing of a sound
database using self-organizing neural nets.
, 18(3):53–65, 1994.
[9] J. Foote. An overview of audio information retrieval.
, 7(1):2–10, 1999.
[10] A. Ghias, J. Logan, D. Chamberlin, and S. B.C. Query by
humming: Musical information retrieval in an audio database.
In , pages 231–
236, San Francisco, CA, November 5 - 9 1995. ACM.
[11] S. Kaski. Fast winner search for SOM-based monitoring and
retrieval of high-dimensional data. In , pages 940–
945. IEE, September, 7.-10. 1999.
[12] T. Kohonen. Self-organized formation of topologically correct
feature maps. , 43:59–69, 1982.
[13] T. Kohonen. . Springer-Verlag, Berlin,
[14] T. Kohonen. Self-organization of very large document collec-
tions: State of the art. In
, pages 65–74, Skövde, Sweden, 1998.
[15] T. Kohonen, S. Kaski, K. Lagus, J. Salojärvi, J. Honkela,
V. Paatero, and A. Saarela. Self-organization of a massive doc-
umentcollection. ,
11(3):574–585, May 2000.
[16] N. Kosugi, Y. Nishihara, T. Sakata, M. Yamamuro, and
K. Kushima. A practical query-by-humming system for a large
music database. In
, pages 333–342, Marina del Ray, CA, 2000. ACM.
[17] C. Liu and P. Tsai. Content-based retrieval of mp3 music
objects. In , pages 506 – 511,
Atlanta, Georgia, 2001. ACM.
[18] M. Liu and C. Wan. A study of content-based classification and
retrievalofaudiodatabase.In ,
Grenoble, France, 2001. IEEE.
[19] D. Merkl and A. Rauber. Document classification with unsu-
pervised neural networks. In F. Crestani and G. Pasi, editors,
, pages 102–121.
Physica Verlag, 2000.
[20] K. Oh, Y. Feng, K. Kaneko, A. Makinouchi, and S. Bae. SOM-
based R*-tree for similarity retrieval. In ,
pages 182–189, Hong-Kong, China, April 18-21 2001. IEEE.
[21] K. Oh, K. Kaneko, and A. Makinouchi. Image classifica-
tion and retrieval based on wavelet-som. In
, pages 164–167, Kyoto, Japan, Novem-
ber 28-30 1999. IEEE.
[22] F. Pachet and D. Cazaly. A taxonomy of musical genres. In
, Paris, France, 2000.
[23] E. Pampalk . Master’s thesis, Vi-
enna University of Technology, 2001.
[24] E. Pampalk, A. Rauber, and D. Merkl. Content-based organi-
zation and visualization of music archives. In
,Juan-les-Pins, France, December 1-6 2002.
[25] E. Pampalk, A. Rauber, and D. Merkl. Using smoothed data
histograms for cluster visualization in self-organizing maps.
In , Madrid, Spain, August 27-30 2002. Springer.
[26] A. Rauber and M. Frühwirth. Automatically analyzing and
organizing music archives. In
, Darmstadt, Germany, Sept. 4-8 2001.
[27] B. Ripley. .
Cambridge University Press, Cambridge, UK, 1996.
[28] J. Rolland, G. Raskinis, and J. Ganascia. Musical content-
based retrieval: An overview of the Melodiscov approach and
system. In ,
pages 81–84, Orlando, FL, 1999. ACM.
[29] D. Roussinov and H. Chen. Information navigation on the web
by clustering and summarizing query results.
, 37:789 – 816, 2001.
[30] E. Scheirer. . PhD thesis, MIT Me-
dia Laboratory, 2000.
[31] M. Schröder, B. Atal, and J. Hall. Optimizing digital speech
coders by exploiting masking properties of the human ear.
, 66:1647–
1652, 1979.
[32] O. Simula, P. Vasara, J. Vesanto, and R. Helminen. The self-
organizing map in industry analysis. In L. Jain and V. Ve-
muri, editors, ,
Washington, DC., 1999. CRC Press.
[33] G. Tzanetakis, G. Essl, and P.Cook. Automatic musical genre
classification of audio signals. In , Bloomington, In-
diana, October 15-17 2001.
[34] A. Ultsch and H. Siemon. Kohonen’s self-organizing feature
maps for exploratory data analysis. In
, pages 305–308, Dordrecht,
Netherlands, 1990. Kluwer.
[35] E. Wold, T. Blum, D. Keislar, and J. Wheaton. Content-based
classification search and retrieval of audio.
, 3(3):27–36, Fall 1996.
[36] H. Zhang and D. Zhong. A scheme for visual feature based
image indexing. In , pages
36–46, San Jose, CA, February 4-10 1995.
[37] E. Zwicker and H. Fastl.
, volume 22 of . Springer,
Berlin, 2. edition, 1999.
... There have been a vast amount of proposals since music classification is one of the most popular MIR tasks. The most common are using hidden Markov models (Shao et al. [2004]), self-organizing maps (Rauber et al. [2002]), k-nearest neighbor (Tzanetakis and Cook [2002]), support vector machines (Xu et al. [2003]), or neural networks of different kinds (Soltau et al. [1998]). The state-of-the-art techniques include compressive sampling (Chang et al. [2010]), or low-rank semantic mapping (Panagakis and Kotropoulos [2013]). ...
Full-text available
Analysis of digital music and its retrieval based on the audio features is one of the popular topics within the music information retrieval (MIR) field. Every musical piece has its characteristic harmony structure, but harmony analysis is seldom used for retrieval. Retrieval systems that do not focus on similarities in harmony progressions may consider two versions of the same song different, even though they differ only in instrumentation or a singing voice. This thesis takes various paths in exploring, how music harmony can be used in MIR, and in particular, the cover song identification (CSI) task. We first create a music harmony model based on the knowledge of music theory. We define novel concepts: a harmonic complexity of a musical piece, as well as the chord and chroma distance features. We show how these concepts can be used for retrieval, complexity analysis, and how they compare with the state-of-the-art of music harmony modeling. An extensive comparison of harmony features is then performed, using both the novel features and the traditional MIR features. Based on this comparison, the best features are proposed for the final experiments of the CSI task, with a result of 88.9% retrieval accuracy for a dataset of 2,000 songs using chroma features. The two methods used in our experiments are dynamic time warping and machine learning, for both feature comparison and our experimental results. To facilitate our research, a stand-alone application harmony-analyser was created and published online. Capable of music processing of WAV audio files, this application is proposed to the MIR community for feature extraction and harmony analysis. We have also published a dataset of karaoke songs Kara1k, which has been used for our experiments and contains a unique selection of features and annotations for future work.
... There have been a great amount of proposals, since music classification is one of the recent challenges. The most common are using hidden Markov models [18], self-organizing maps [14], knearest neighbor [20], support vector machines [22], or neural networks [19] of different kinds. The state-of-theart techniques include compressive sampling [3], or lowrank semantic mapping [13]. ...
Conference Paper
Full-text available
Publicly available multimedia systems provide users with plenty of music files offered in different genres. These systems should process the music as fast as possible while satisfying the needs of their users as well. In this context, reliable music classification represents one of the major challenges. Classification systems without deeper knowledge of music structure and composition yield to considerable errors. In some cases, music can not be classified clearly due to an overlap in genres. However, in other cases, we can clarify the classification simply by using the approach of a skilled musician. In this paper, we develop a new approach to automatic music classification inspired by the theory of neural networks, enhanced by deeper knowledge of tonal harmony. Based on a new measure derived from harmonic movements, harmonic complexity , our supporting experiments proved a significant improvement in classification accuracy.
... From 2002, when Tzanetakis and Cook [1] introduced music genre classification as a pattern recognition task, many other works has been developed for this purpose [2], [3], [4], [5], [6], [7]. According to Lidy et al. [8], most of the works rely on the content-based approach, which extracts representative features from the digital audio signal. ...
... As defined by Mitrović et al. [17], this feature is a two-dimensional representation of acoustic versus modulation frequency that is built upon a specific loudness sensation, and it is obtained by Fourier analysis of the critical bands over time and incorporating a weighting stage that is inspired by the human auditory system. This feature has shown to be useful in music similarity retrieval (Pampalk et al. [162], Rauber et al. [163]). ...
Full-text available
Endowing machines with sensing capabilities similar to those of humans is a prevalent quest in engineering and computer science. In the pursuit of making computers sense their surroundings, a huge effort has been conducted to allow machines and computers to acquire, process, analyze and understand their environment in a human-like way. Focusing on the sense of hearing, the ability of computers to sense their acoustic environment as humans do goes by the name of machine hearing. To achieve this ambitious aim, the representation of the audio signal is of paramount importance. In this paper, we present an up-to-date review of the most relevant audio feature extraction techniques developed to analyze the most usual audio signals: speech, music and environmental sounds. Besides revisiting classic approaches for completeness, we include the latest advances in the field based on new domains of analysis together with novel bio-inspired proposals. These approaches are described following a taxonomy that organizes them according to their physical or perceptual basis, being subsequently divided depending on the domain of computation (time, frequency, wavelet, image-based, cepstral, or other domains). The description of the approaches is accompanied with recent examples of their application to machine hearing related problems.
... The system introduced consisted of 2-dimensional SOM representation that could be generated for any music set. More complex variation involved Growing Hierarchical Self-Organizing Maps (GHSOM) with a 3-layer architecture (Rauber et al., 2002b). GHSOM was fed with 1200 psychoacoustic loudness and rhythm descriptors. ...
Full-text available
Due to an increasing amount of music being made available in digital form in the Internet, an automatic organization of music is sought. The paper presents an approach to graphical representation of mood of songs based on Self-Organizing Maps. Parameters describing mood of music are proposed and calculated and then analyzed employing correlation with mood dimensions based on the Multidimensional Scaling. A map is created in which music excerpts with similar mood are organized next to each other on the two-dimensional display.
... approximates the specific loudness sensation per critical band of the human auditory system [Pampalk et al. 2002]. A Bark-scaled spectrogram is firstly computed and then spectral masking and equal-loudness contours are applied. ...
Digitalized music production exploded in the past decade. Huge amount of data drives the development of effective and efficient methods for automatic music analysis and retrieval. This thesis focuses on performing semantic analysis of music, in particular mood and genre classification, with low level and mid level features since the mood and genre are among the most natural semantic concepts expressed by music perceivable by audiences. In order to delve semantics from low level features, feature modeling techniques like K-means and GMM based BoW and Gaussian super vector have to be applied. In this big data era, the time and accuracy efficiency becomes a main issue in the low level feature modeling. Our first contribution thus focuses on accelerating k-means, GMM and UBM-MAP frameworks, involving the acceleration on single machine and on cluster of workstations. To achieve the maximum speed on single machine, we show that dictionary learning procedures can elegantly be rewritten in matrix format that can be accelerated efficiently by high performance parallel computational infrastructures like multi-core CPU, GPU. In particular with GPU support and careful tuning, we have achieved two magnitudes speed up compared with single thread implementation. Regarding data set which cannot fit into the memory of individual computer, we show that the k-means and GMM training procedures can be divided into map-reduce pattern which can be executed on Hadoop and Spark cluster. Our matrix format version executes 5 to 10 times faster on Hadoop and Spark clusters than the state-of-the-art libraries. Beside signal level features, mid-level features like harmony of music, the most natural semantic given by the composer, are also important since it contains higher level of abstraction of meaning beyond physical oscillation. Our second contribution thus focuses on recovering note information from music signal with musical knowledge. This contribution relies on two levels of musical knowledge: instrument note sound and note co-occurrence/transition statistics. In the instrument note sound level, a note dictionary is firstly built i from Logic Pro 9. With the musical dictionary in hand, we propose a positive constraint matching pursuit (PCMP) algorithm to perform the decomposition. In the inter-note level, we propose a two stage sparse decomposition approach integrated with note statistical information. In frame level decomposition stage, note co-occurrence probabilities are embedded to guide atom selection and to build sparse multiple candidate graph providing backup choices for later selections. In the global optimal path searching stage, note transition probabilities are incorporated. Experiments on multiple data sets show that our proposed approaches outperform the state-of-the-art in terms of accuracy and recall for note recovery and music mood/genre classification.
... From each of the frames a feature vector v is extracted. We tried various types of features: simple spectrogram, a timbre-related feature set (Mel Frequency Cepstrum Coefficients), two rhythm-based feature set (Rhythm Patterns [RPM02], Statistical Spectrum Descriptors [LR05]) and harmony related features ...
Lots of music pieces are blooming with internet advancement, but only few studies focused on popular music and relevant social contexts. We can enhance the knowledge of popular music markets through the musical patterns. Therefore, this study aims to investigate prevalent styles of Chinese popular music from 2001 to 2017, and explore the trends. The convolutional neural network was exploited to analyze four music features (timbre, rhythm, pitch, and mode) extracting from melspectrogram and chromagram. Results indicated that (i) timbre and rhythm are critical among four music features. (ii) compared to expression styles far from realism, music related to audiences’ behaviors and lifestyles are prevalent, such as lyricism and implicit expression. The proposed method accelerates the understanding of the evolution and prevalence of Chinese popular music. Meanwhile, this study might contribute to the future music markets, since musicians and marketers could comprehend trends efficiently and get inspired by this study.
Music is spatial in many ways. Musical concepts and music perception are described by spatial terms in many cultures. This spatial thinking is reflected in music from spatial compositions to stereophonic recording and mixing techniques. Consequently, traditional music theories as well as modern music information retrieval approaches leverage spatial concepts and operations to gain a deeper understanding of music. This chapter reviews concepts of spaciousness in music psychology, provides the state of the art in spatial music composition and mixing in the recording studio, and gives an overview about spaciousness in music theory and music information retrieval. The prominence of spatial concepts in all these theoretic and practical disciplines underlines the significance of space in music. This deep relationship becomes obvious in terms of music as creative arts, an acoustical signal, and a psychological phenomenon.
Abstrak. Implementasi Music Mood Player Yang Diaplikasikan Pada Daycare Menggunakan Metode Self Organizing Map. Musik adalah seni, hiburan dan aktivitas manusia yang melibatkan suara-suara yang teratur. Musik berkaitan erat dengan psikologi manusia. Sepotong musik sering dikaitkan dengan kata sifat tertentu seperti senang, sedih, romantis, dsb. Keterkaitan antara musik dengan mood tertentu ini telah banyak digunakan dalam berbagai kesempatan oleh manusia, untuk itu klasifikasi musik berdasarkan keterkaitannya dengan emosi tertentu menjadi penting. Daycare merupakan salah satu lembaga yang memanfaatkan musik sebagai terapi atau sarana pendukung dalam kegiatan pengasuhan anak. Penelitian ini fokus pada implementasi music mood player menggunakan Self Organizing Map yang diaplikasikan pada Daycare. Fitur yang digunakan sebagai ciri adalah rhythm pattern dari musik tersebut. Parameter mood didapatkan dari Robert Thayer’s energy-stress model yang terdiri dari exuberance / gembira, contentment / rilex, anxious / cemas dan depression. Sistem diuji dengan lagu dari berbagai genre dan mood hasil klasifikasi dibandingkan dengan mood dari pakar psikologi anak. Mood lagu dari sistem dapat diset secara otomatis disesuaikan dengan aktivitas pada daycare. Kata Kunci: Music Information Retrieval, Klasifikasi Mood, Klasifikasi Musik, Self Organizing Map, Rhythm Patterns. Abstract. Music is an art, entertainment and human activity that involve some organized sounds. Music is closely related to human psychology. A piece of music often associated with certain adjectives such as happy, sad, romantic and many more. The linkage between the music with a certain mood has been widely used in various occasions by people, there for music classification based on relevance to a particular emotion is important. Daycare is one example of an institution that used music as therapy or tools of support in each of its parenting activities. This research concerns in implementation of a music mood player using Self Organizing Map applied at the Daycare. The features that are used on this music mood player are rhythm patterns of the music. The mood parameters that used in this system is based on Robert Thayer's energy-stress model which are exuberance / happy, contentment / relax, anxious and depression. The system is tested using a set of songs with various genres and the classification results are compared with the mood obtained by child psychology expert. The system can be set automatically according to the activities at daycare. Keywords: Music Information Retrieval, Mood Classification, Music Classification, Self Organizing Map, Rhythm Patterns.
Conference Paper
Full-text available
Mixed reality (MR) systems which integrate the virtual world and the real world have become a major topic in the research area of multimedia. As a practical application of these MR systems, we propose an efficient method for making a 3D map from real-world ...
Conference Paper
Full-text available
With Islands of Music we present a system which facilitates exploration of music libraries without requiring manual genre classification. Given pieces of music in raw audio format we estimate their perceived sound similarities based on psychoacoustic models. Subsequently, the pieces are organized on a 2-dimensional map so that similar pieces are located close to each other. A visualization using a metaphor of geographic maps provides an intuitive interface where islands resemble genres or styles of music. We demonstrate the approach using a collection of 359 pieces of music.
Conference Paper
Full-text available
This paper presents a hybrid case-based reasoning (CBR) and information retrieval (IR) system, called SPIRE, that locates passages likely to contain information about legally relevant features of cases found in full-text court opinions. SPIRE uses an example base of excerpts from past opinions to form queries, which are run by the INQUERY IR text retrieval engine on individual case opinions. These opinions can be those found by SPIRE in a prior stage of processing, which also employs a hybrid CBR-IR approach to retrieve relevant texts from large document corpora. (This aspect of SPIRE was reported on at ICAIL95.) We present an overview of SPIRE, run through an extended example, and give results comparing SPIRE's with human performance. 1 Introduction There is an enormous amount of legal text available on-line and it is growing every day. While this is a decided benefit for legal research, it also presents a problem of how to search it effectively. In particular, it is no easy task to...
In any speech codingsystem that adds noise to the speech signal, the primary goal should not be to reduce the noise power as much as possible, but to make the noisei n a u d i b l e or to minimize its subjective loudness. ’’Hiding’’ the noise under the signal spectrum is feasible because of human auditory masking:sounds whose spectrum falls near the masking threshold of another sound are either completely masked by the other sound or reduced in loudness. In speech coding applications, the ’’other sound’’ is, of course, the speech signal itself. In this paper we report new results of masking and loudness reduction of noise and describe the design principles of speech codingsystems exploiting auditory masking.
This work contains a theoretical study and computer simulations of a new self-organizing process. The principal discovery is that in a simple network of adaptive physical elements which receives signals from a primary event space, the signal representations are automatically mapped onto a set of output responses in such a way that the responses acquire the same topological order as that of the primary events. In other words, a principle has been discovered which facilitates the automatic formation of topologically correct maps of features of observable events. The basic self-organizing system is a one- or two-dimensional array of processing units resembling a network of threshold-logic units, and characterized by short-range lateral feedback between neighbouring units. Several types of computer simulations are used to demonstrate the ordering process as well as the conditions under which it fails.
We report our experience with a novel approach to interactive information seeking that is grounded in the idea of summarizing query results through automated document clustering. We went through a complete system development and evaluation cycle: designing the algorithms and interface for our prototype, implementing them and testing with human users. Our prototype acted as an intermediate layer between the user and a commercial Internet search engine (AltaVista), thus allowing searches of the significant portion of World Wide Web. In our final evaluation, we processed data from 36 users and concluded that our prototype improved search performance over using the same search engine (AltaVista) directly. We also analyzed effects of various related demographic and task related parameters.
Conference Paper
In recent years, the searching and indexing techniques for multimedia data are getting more attention in the area of multimedia databases. As many research works were done on the content-based retrieval of image and video data, less attention was received to the content-based retrieval of audio data. In this paper, we propose an approach to retrieve MP3 music objects based on their content. In our approach, the coefficients extracting from the output of the polyphase filters are used to compute the MP3 features for indexing the MP3 objects. We also propose an MP3 similarity measuring function to provide users the ability to approximately retrieve the desired MP3 objects. Experiments are performed and analyzed to show the efficiency and the effectiveness of the proposed method.
Conference Paper
A music retrieval system that accepts hummed tunes as queries is described in this paper. This system uses similarity retrieval because a hummed tune may contain errors. The retrieval result is a list of song names ranked according to the closeness of the match. Our ultimate goal is that the correct song should be first on the list. This means that eventually our system's similarity retrieval should allow for only one correct answer.The most significant improvement our system has over general query-by-humming systems is that all processing of musical information is done based on beats instead of notes. This type of query processing is robust against queries generated from erroneous input. In addition, acoustic information is transcribed and converted into relative intervals and is used for making feature vectors. This increases the resolution of the retrieval system compared with other general systems, which use only pitch direction information.The database currently holds over 10,000 songs, and the retrieval time is at most one second. This level of performance is mainly achieved through the use of indices for retrieval. In this paper, we also report on the results of music analyses of the songs in the database. Based on these results, new technologies for improving retrieval accuracy, such as partial feature vectors and or'ed retrieval among multiple search keys, are proposed. The effectiveness of these technologies is evaluated quantitatively, and it is found that the retrieval accuracy increases by more than 20% compared with the previous system [9]. Practical user interfaces for the system are also described.
Conference Paper
We are experiencing a tremendous increase in the amount of music being made available in digital form. With the creation of large multimedia collections, however, we need to devise ways to make those collections accessible to the users. While music repositories exist today, they mostly limit access to their content to query-based retrieval of their items based on textual meta-information, with some advanced systems supporting acoustic queries. What we would like to have additionally, is a way to facilitate exploration of musical libraries. We thus need to automatically organize music according to its sound characteristics in such a way that we find similar pieces of music grouped together, allowing us to find a classical section, or a hard-rock section etc. in a music repository. In this paper we present an approach to obtain such an organization of music data based on an extension to our SOMLib digital library system for text documents. Particularly, we employ the Self-Organizing Map to create a map of a musical archive, where pieces of music with similar sound characteristics are organized next to each other on the two-dimensional map display. Locating a piece of music on the map then leaves you with related music next to it, allowing intuitive exploration of a music archive.