Content uploaded by Gideon Nave
Author content
All content in this area was uploaded by Gideon Nave on Apr 23, 2015
Content may be subject to copyright.
2010 IEEE 26-th Convention of Electrical and Electronics Engineers in Israel
Musical Features Extraction for Audio-based Search
Ofir Lindenbaum, Shai Maskit, Ophir Kutiel and Gideon Nave
Signal and Image Processing Laboratory (SIPL), Department of Electrical Engineering, Technion - IIT
Technion City, 32000, Haifa, Israel
email: gidi@tx.technion.ac.il
web: sipl.technion.ac.il
Abstract—Mixing unrelated musical sessions is a new form of
music creation, driven by the increasing popularity of media-
sharing web sites such as YouTube and MySpace. The basic
questions addressed in the present paper are which musical
features are required for matching two musical pieces, how to
decide if two musical pieces are compatible and how to measure
the degree of compatibility. We present a system designed for
content-based audio search that extracts musical features and
finds audio tracks which are compatible to a given musical query
in a music database.
I. INTRODUCTION
The increasing data transfer rates and storage capabilities,
as well as the vast growth in user accessibility in recent years,
enable internet users around the globe to share their musical
creation. As a result, a new form of musical creation was born:
compositions made by combining unrelated samples of music.
ThruYOU (www.thru-you.com) by Ophir Kutiel (”Kuti-
man”), is an online music video project mixed from samples
of unrelated amateur YouTube music videos. The project was
chosen by Time Magazine as one of the 50 best inventions of
2009 [1]. In his creation, Kutiman sampled footage posted on
YouTube by amateur musicians (drums, piano, guitar, vocals,
etc.) and mixed it together into musical jams, as illustrated in
Fig. 1 (a screenshot of a ThruYOU videoclip). While searching
for musical samples in order to create a mix, Kutiman queried
the YouTube database for videos that were indexed to have
specific musical attributes. His search results were based on
someone manually tagging the videos, rather than on digital
analysis of the video soundtracks.
Music is a complex form of information which consists of
various musical features. In order to extract, recognize and
search by these features, a solution other than the standard
metadata/tag search is required. The goal of the current study
is developing a method for content-based search, that will
assist musicians in exploring large music databases and thus
will broaden the spectrum of music with which they can work.
We present a system for musical search based on musical
features, designed for music compatibility.
This paper is organized as follows. We discuss related work
in the context of music similarity measures in Section II.
Section III gives details about the musical features used in
our music similarity system, which are presented in Section
IV. Experimental results are presented in Section V, followed
by conclusion and discussion in Section VI.
Fig. 1. A screen-shot of a thruYOU video clip, illustrating the mix of
unrelated musical tracks
II. RE LATE D WOR K
Previous methods for measuring musical similarity use
different levels of features extraction. Low level approaches
concentrate mostly on the attributes of timbre and rhythm such
as MFCC [2], that are easily and accurately computed, but lack
the semantic audio information, which is essential for match-
ing two pieces. For example, Dixon et al. [3] successfully
characterized music according to rhythm by adding higher-
level descriptors to a low-level feature set.
High level representations, such as music transcription (i.e.
midi) and chords extraction, hold the semantic audio informa-
tion, but lack the desired accuracy and therefore considered as
unsolved problems. For example, Pickens et al. [4], succeeded
in identifying harmonic similarities between a polyphonic
audio query and symbolic polyphonic scores. Their approach
relied on automatic transcription, which is partially effective
within a highly constrained subset of musical recordings (e.g.
mono-timbral, no drums or vocals, small polyphonies). To
overcome transcription errors, the symbolic data was converted
to harmonic distributions and the similarity measure is com-
puted using these distributions over the time intervals.
Mid level representations containing semantic audio in-
formation without the temporal resolution obtained by tran-
scriptions, overcomes miscalculation inherited in high level
representations and create a meaningful musical description
978-1-4244-8682-3/10/$26.00 c
2010 IEEE
that suits our goal of matching two pieces. For example, Ellis
and Poliner [5] present a system that attempts to identify
when different musicians perform the same underlying song
- also known as ’cover songs’. To overcome variability in
tempo, beat tracking was used for describing each piece
with one feature vector per beat. To deal with variation in
instrumentation, they suggests using 12-dimensional ’Chroma’
feature vectors that collect spectral energy supporting each
semitone of the octave.
Zlis and Pachet [6] proposed a sequence generation mecha-
nism called musical mosaicing which enables to generate auto-
matically sequences of sound samples by specifying only high-
level properties of the sequence to generate. The properties of
the sequence specified by the user are translated automatically
into constraints holding on descriptors of the samples.
While we are by no means the first to use mid-level
musical features for music similarity measures, it is important
to understand the essential difference between the current
study and previous approaches in this context. In this paper,
we present a system for measuring similarity that enables,
for example, matching vocals and piano, a task that music
similarity tools are not designed for.
III. MUSICAL FEATU RE S EXTRACTION
In this section, we discuss the essential musical attributes
and define measures for matching them. The attributes are
extracted from WAV files, each of which is split into 10
seconds segments. Throughout this paper, we regard two
musical segments as vectors x1and x2. The distance function
which relates the feature ais notated as Da(x1, x2).
A. Beats per Minute (BPM)
BPM describes the rhythm in which the tune is played.
The term ’beats’ refers to repeated musical structure, and
we focused on the number of repetitions per minute. Disk
Jockeys (DJs) often use BPM for song mixing, as two musical
pieces with similar BPM usually sound well when played
together. While BPM matching can be obtained by manual
manipulations such as precise slicing, such approaches may
lead to unwanted distorting effects. Hence, we use the native
BPM of songs for determining a distance that reflects their
similarity. BPM is extracted by detection of peaks in the
segment’s autocorrelation function [15]. The distance measure
for BPM matching is defined as
Dtempo =|tempo(x1)−tempo(x2)|
max[tempo(x1), tempo(x2)],(1)
where tempo(x)∈[0 200] represents the extracted BPM of
x.
B. Chromatic Scale and Chromagram
The chromatic scale is a 12 note musical scale, spaced
with equal distances on a logarithmic scale starting a basic
note [7]. The chromagram, also known as the harmonic pitch
class profile, is a histogram of notes of a given musical piece
showing the distribution of energy along the pitch classes [8].
It corresponds to the chromatic scale, in which the frequencies
are mapped onto a limited set of 12 chroma values (i.e. all oc-
taves are wrapped into one). A common method for computing
a chromagram is the constant Q transform (CQT) [9], which
computes a discrete spectral analysis of logarithmically spaced
bins. The CQT is defined as
Xcq[k] =
N(k)−1
X
n=o
w[n, k]·x[n]·e−j2πnfk,(2)
where the kth frequency bin is calculated using
fk=2k/β ·fmin,(3)
where in our case, the number of bins per octave βequals
12 and fmin is the lowest frequency analyzed. CQT can
be observed as a DFT with varying window size, and thus
varying frequency resolution. The window w[n, k]and N(k)
are functions of the computed frequency bin k. Finally, using
Xcq, we compute the chromagram of xby summing all
corresponding bins from different octaves into a 12-length
vector, Cx, whose bth bin is calculated by
Cx(b) =
M
X
m=0
|Xcq(b + mβ)|,(4)
where b∈[1 12] is the chroma bin number, and Mis the total
number of octaves in the constant Q spectrum. In the current
study, the chromagram is normalized so that the value of its
maximal instance is set to 1.
While analyzing various monophonic musical pieces, we
have encountered a major difference between chromagrams
of chromatic instruments (e.g. piano, guitar, flute), which
are characterized by high variance and a low average, in
contrast to non-chromatic instrument (e.g. vocals, drums),
whose chromagram is characterized by low variance and a
high average, as evident in Fig. 2. The observation can be
explained by the fact that a chromatic instruments’ spectrum
is concentrated in the 12 chromatic pitch notes, creating a
sparser chromagram that produces high variance and a low
average, where non-chromatic instruments’ spectral energy is
spread across all 12 chromatic bins. Using this observation, we
have created a user optional filter for limiting search results
to chromatic or non-chromatic instruments.
C. Cyclic Harmonic Cross-corelation
Musical instruments are often based on an approximate
harmonic oscillator (such as a string or a column of air),
oscillating at numerous frequencies simultaneously. These fre-
quencies are the harmonics of a basic frequency representing
the pitch note. Two notes played simultaneously with a large
amount of common harmonics will sound pleasant to the
listener [10]. Specifically, notes separated by 4 or 7 semitones
(i.e C and E or C and G) share a lode of common harmonics,
and therefore are harmonically compatible [11].
Fig. 2. Variance and average of 54 non-chromatic and 66 chromatic pieces.
It is evident that the distributions of variance and average are concentrated in
two different corners, indicating the nature of the playing instrument.
Musical key is a defined series of notes. Each key has unique
characteristics. For instance, it is customary to attribute the
Major keys a sense of infinity or suspense, whereas the Minor
key is attributed a sense of sadness or deep emotion [12]. The
maximum key-profile correlation (MKC) [13] is an algorithm
for finding the most prominent key in a music sample. The
MKC algorithm is based on key profiles [14] representing
typical chromagrams of common musical keys. The algorithm
computes the correlation between the chromagram of a mu-
sical sample and all 24 common western key profiles (Major
and Minor), and the key profile that provides the maximum
correlation is taken as the most probable key of the musical
sample.
Key matching by the MKC method is commonly used in
music matching applications. However, this approach does
not take into account music played in keys that differ from
the common major and minor keys (i.e Arabic or pentatonic
scales). In order to overcome this matter, our approach mea-
sures harmonic similarity using direct cross correlation of
the pieces’ chromagrams, maintaining the original musical
characteristics. We define the cyclic harmonic cross-correlation
as
R1,2(p) = E[Cx1(l)·Cx2(l−p mod 12)]
pvar(Cx1)var(Cx2).(5)
High cross-correlation values of R1,2(0),R1,2(4),R1,2(7)
indicates that the correlated segments share common harmon-
ics, and therefore are harmonically compatible. Accordingly,
the chroma distance is calculated using the maximal corre-
lation achieved, where the shifted versions are weighted by
0.8:
Dc(x1, x2) = 1
2[1 −Rmax(x1, x2)],(6)
where
Rmax = max[R1,2(0),0.8R1,2(4),0.8R1,2(7)].(7)
Fig. 5. An exemplary 2-dimensional output graph of our search system. The
query segment is located at the origin, the horizonal axis represents harmonic
distances and the vertical axis corresponds with tempo distance.
IV. CON TE NT-BA SE D MUSICAL SEARCH SYS TE M
Based on the musical features discussed in Section III, we
have developed an audio-based musical search system. The
system’s information flow is described in Fig. 4, and consists
of three main stages.
Initially, all audio tracks of the musical database are loaded
to the system. Each track is split into 10 seconds segments,
from each of which the features vector (BPM and chroma) are
extracted using MIR Matlab toolbox [15].
Musical search is conducted by loading a query audio
segment and specifying the desired weight of each musical fea-
ture. After calculating the feature-specific distances between
the query segment and all database instances (”matching”), the
results are sorted by decreasing order of the weighted distance,
D(x1, x2) = X
i∈A
wiDi(x1, x2),(8)
where Ais the group of all features and wiare the user-
defined weights, whose default value is 1
|A|. The compatibility
measure, for feature ais calculated by
Compa(x1, x2) = 100[1 −da(x1, x2)],(9)
where Compa∈[0 100].
Results can be classified as chromatic/non-chromatic tracks,
by setting a threshold for the chromagram’s variance and
average, as discussed in Section III-B. The system finds the
most compatible audio segments for the given query, and
presents them in a ranked table, as illustrated in Table I, as
well as a 2-dimensional graph, where the query segment is
located at the origin, the horizonal and vertical axis are the
measures of the harmonic and tempo distances of the search
results, as illustrated on Fig. 5
V. EXP ER IM EN TS A ND RE SU LTS
As the motivation of the current study has stemmed from
ThruYOU project, its database was chosen as our data set. The
database consists of unrelated video soundtracks from which
(a) (b)
Fig. 3. (a) Two highly correlated chromagrams, indicating harmonic compatibility (b) Two chromagrams with low correlation, indicating low harmonic
compatibility
Fig. 4. Block diagram of proposed system information flow. Each instance of the musical library and the query segment are converted to a feature vector.
When a query segment is loaded, search results are based on the weighted sum of the distances between the feature vectors, and plotted in a table and a 2D
graph.
Kutiman sampled audio for his project, some of which are
monophonic while others are polyphonic. The audio quality
varies; as most of the videos were recorded by low-end
equipment in non-studio settings, some of the soundtracks
suffer at times from background noise (e.g. air conditioner).
Kutiman created 7 compilations, 3-6 minutes long, out of over
130 video clips that accumulate to over 6 hours of video
material. The audio raw data contains a multitude of music
genres such as Classic, Rock, and Latin, as well as music
that does not follow a specific genre. Links to all of the
original videos used in ThruYOU are available online on the
project’s web site. Our system’s performance was tested by
experimenting two types of search queries on the ThruYOU
database, described as follows.
A. Experiments I
In order to test our system’s output in relation to Kutiman’s
musical selection, we loaded an original track which is a part
of a ThruYOU compilation as a search query on the ThruYOU
database. This experiment imitated Kutiman’s manual search
that was now automated by our application. We expected
segments that were originally mixed together with our query
to appear as highly compatible search results.
In one of our experiments, we used the track ”Chopin
Track name Seg. Com. %
Beethoven String Quartet Op.18 No.4 31 90.65
Piano Sonata in c minor I −Beethoven 3 90.37
P laying the J uno −60 5 90.29
Beethoven StringQuartet Op.18 No.4 28 89.86
SteelphonS900 synthesizer demo 1 89.48
RolandRS−09 like Arp S olina 9 89.11
Bach Cello SuiteNo.5−Gigue 1 89.08
T enorSaxophone F M ajor S cale 1 88.94
Piano Sonata in c minor I −Beethoven 19 87.94
J.C.Bach ConcertinCMinormvt.2 3 87.22
TABLE I
TOP 10 RES ULTS O F EX P. I. 6 MARKED SEGMENTS WERE FOUND AND
US ED BY KU TI MAN I N TRA CK 3 (”I’MNEW ”).
Nocturne”, used in Kutiman’s compilation ”I’m New”, as a
query. In the ThruYOU compilation, the original piece is
almost unchanged and is repeated throughout the track with
various pieces mixed simultaneously. Viewing the top 10
search results (Table I), 6 segments that were used by Kutiman
in track 3 of ThruYOU (bolded) were found out of a 6 hour
database. We haves repeated this experiment for a selected
footage contained in each of the 7 ThruYOU compilations,
and found that on average, more than 60% of Kutiman’s mixed
pieces appeared within the top 10 search results of our system.
B. Experiments II
In the second experiment, we tried to simulate the first
creative step made by Kutiman, which is choosing musical
pieces that should be mixed together. Following the results of
a search query, we tried to create a new compilation based
on the system’s suggestions of highly compatible segments
from its database. In one of the experiments, we queried the
ThruYOU database with a vocal piece (”An original song by
Mandy”), and used segments that appeared in the system’s top
15 results for creating our new mix. In order to create the mix,
only elementary editing was done (Trim and fade in/out), and
the segments were used ”as is”. While assessing the quality
of our compilation is subjective, it was regarded as successful
by numerous listeners, including Kutiman himself. Three of
our compilation can be found on: www.gidinave.com. The
application was installed in Kutiman’s computer for further
research.
VI. CONCLUSION
In this paper, we have presented an audio-based search
algorithm that uses Musical Information Retrieval (MIR) tools
and have suggested ThruYOU project as a data-set for ex-
periments in content based musical search. We tested the
system on the soundtracks of seemingly unrelated YouTube
videos whose quality is typically low, as they were recorded
using low-end equipment. ThruYOU is a classic example, of
where the tools of MIR may simplify problems that musicians
face when compiling different musical pieces into a new
composition. Our system is a tool that can help musicians
in the composition process, by dramatically reducing the time
of exploring large musical database, thus broadening the scope
of music with which the musicians can work. However, while
the computerized processes of music analysis can aid the
musician, they cannot take his place. We can identify the
following possible avenues for improving our system:
1) Harmonic segmentation tools we have tested were insuf-
ficient in providing the end users (musicians) material
to work with, as the segment outputs were too short.
2) The process of content-based search can be accelerated
by initial filtering of the tracks by their names. For
example, if a musician is looking for a piano piece, all
pieces that contain the word ”guitar” will not even be
searched.
3) Machine learning methods provide the system with
relevant user feedback for better understanding of the
his musical taste.
4) Search may be supported by additional criteria such as
Genre and time signature.
REFERENCES
[1] J. Kluger, ”The Best Inventions of 2009,” Time Magazine 2009, Novem-
ber 12 [Online]. Available: http://www.time.com.
[2] F. Zheng, Gu. Zhang and Z. Song, ”Comparison of Different Implemen-
tations of MFCC,” J. Computer Science & Technology, Vol. 16(6), pp.
582589, 2001.
[3] S. Dixon, F. Gouyon, and G. Widmer, ”Towards Characterisation of Music
via Rhythmic Patterns,” Proceedings of the 5th ISMIR, Barcelona, Spain.
pp. 509516, 2004.
[4] J. Pickens, J.P. Bello, G. Monti, T. Crawford, M. Dovey, M. Sandler, and
D. Byrd., ”Polyphonic score retrieval using polyphonic audio queries: A
harmonic modeling approach,” In Proceedings of the 3rd ISMIR, Paris,
France, pp. 140149, 2002.
[5] D. Ellis and G. Poliner, ”Identifying ’Cover Songs’ with Chroma Features
and Dynamic Programming Beat Tracking,” ICASSP, Vol. 4, pp. 1429-
1432, 2007.
[6] A. Zils and F. Pachet, ”Musical Mosaicing,” Proceedings of DAFX 01,
Limerick (Ireland), 2001.
[7] Benward and Saker, ”Music: In Theory and Practice,” Vol. I, pp.47.
Seventh Edition, 2009.
[8] E. Gomez, ”Tonal Description of Polyphonic Audio for Music Content
Processing,” INFORMS Journal on Computing, Vol. 18, no. 3, pp. 294-
304, 2006.
[9] J. Brown, ”Calculation of a Constant Q spectral Transform,” Journal of
the Acoustical Society of America, 89(1): 425434, 1991.
[10] W.F. Thompson, ”Music, Thought, and Feeling: Understanding the
Psychology of Music,” 2008.
[11] W. Piston and M. DeVoto, ”Harmony,” 5th ed. New York: W. W. Norton,
1987.
[12] W. Apel, ”Harvard Dictionary of Music,” Cambridge: Harvard Univer-
sity Press, 1969.
[13] C. Krumhansl, ”Cognitive Foundations of Musical Pitch,” Oxford Psy-
chological Series, no. 17, Oxford University Press, New York, 1990.
[14] C. Krumhansl and E.J. Kessler, ”Tracing the Dynamic Changes in
Perceived Tonal Organization in a Spatial Representation of Musical
Keys,” Psychological Review, Vol. 89, pp. 334-368, 1982.
[15] O. Lartillot and P. Toiviainen, ”A Matlab Toolbox for Musical Feature
Extraction from Audio,” Proc. of the 10th Int. Conference on Digital
Audio Effects (DAFx-07), Bordeaux, France, 2007.