Conference PaperPDF Available

Musical features extraction for audio-based search

Authors:

Abstract and Figures

Mixing unrelated musical sessions is a new form of music creation, driven by the increasing popularity of media-sharing web sites such as YouTube and MySpace. The basic questions addressed in the present paper are which musical features are required for matching two musical pieces, how to decide if two musical pieces are compatible and how to measure the degree of compatibility. We present a system designed for content-based audio search that extracts musical features and finds audio tracks which are compatible to a given musical query in a music database.
Content may be subject to copyright.
2010 IEEE 26-th Convention of Electrical and Electronics Engineers in Israel
Musical Features Extraction for Audio-based Search
Ofir Lindenbaum, Shai Maskit, Ophir Kutiel and Gideon Nave
Signal and Image Processing Laboratory (SIPL), Department of Electrical Engineering, Technion - IIT
Technion City, 32000, Haifa, Israel
email: gidi@tx.technion.ac.il
web: sipl.technion.ac.il
Abstract—Mixing unrelated musical sessions is a new form of
music creation, driven by the increasing popularity of media-
sharing web sites such as YouTube and MySpace. The basic
questions addressed in the present paper are which musical
features are required for matching two musical pieces, how to
decide if two musical pieces are compatible and how to measure
the degree of compatibility. We present a system designed for
content-based audio search that extracts musical features and
finds audio tracks which are compatible to a given musical query
in a music database.
I. INTRODUCTION
The increasing data transfer rates and storage capabilities,
as well as the vast growth in user accessibility in recent years,
enable internet users around the globe to share their musical
creation. As a result, a new form of musical creation was born:
compositions made by combining unrelated samples of music.
ThruYOU (www.thru-you.com) by Ophir Kutiel (”Kuti-
man”), is an online music video project mixed from samples
of unrelated amateur YouTube music videos. The project was
chosen by Time Magazine as one of the 50 best inventions of
2009 [1]. In his creation, Kutiman sampled footage posted on
YouTube by amateur musicians (drums, piano, guitar, vocals,
etc.) and mixed it together into musical jams, as illustrated in
Fig. 1 (a screenshot of a ThruYOU videoclip). While searching
for musical samples in order to create a mix, Kutiman queried
the YouTube database for videos that were indexed to have
specific musical attributes. His search results were based on
someone manually tagging the videos, rather than on digital
analysis of the video soundtracks.
Music is a complex form of information which consists of
various musical features. In order to extract, recognize and
search by these features, a solution other than the standard
metadata/tag search is required. The goal of the current study
is developing a method for content-based search, that will
assist musicians in exploring large music databases and thus
will broaden the spectrum of music with which they can work.
We present a system for musical search based on musical
features, designed for music compatibility.
This paper is organized as follows. We discuss related work
in the context of music similarity measures in Section II.
Section III gives details about the musical features used in
our music similarity system, which are presented in Section
IV. Experimental results are presented in Section V, followed
by conclusion and discussion in Section VI.
Fig. 1. A screen-shot of a thruYOU video clip, illustrating the mix of
unrelated musical tracks
II. RE LATE D WOR K
Previous methods for measuring musical similarity use
different levels of features extraction. Low level approaches
concentrate mostly on the attributes of timbre and rhythm such
as MFCC [2], that are easily and accurately computed, but lack
the semantic audio information, which is essential for match-
ing two pieces. For example, Dixon et al. [3] successfully
characterized music according to rhythm by adding higher-
level descriptors to a low-level feature set.
High level representations, such as music transcription (i.e.
midi) and chords extraction, hold the semantic audio informa-
tion, but lack the desired accuracy and therefore considered as
unsolved problems. For example, Pickens et al. [4], succeeded
in identifying harmonic similarities between a polyphonic
audio query and symbolic polyphonic scores. Their approach
relied on automatic transcription, which is partially effective
within a highly constrained subset of musical recordings (e.g.
mono-timbral, no drums or vocals, small polyphonies). To
overcome transcription errors, the symbolic data was converted
to harmonic distributions and the similarity measure is com-
puted using these distributions over the time intervals.
Mid level representations containing semantic audio in-
formation without the temporal resolution obtained by tran-
scriptions, overcomes miscalculation inherited in high level
representations and create a meaningful musical description
978-1-4244-8682-3/10/$26.00 c
2010 IEEE
that suits our goal of matching two pieces. For example, Ellis
and Poliner [5] present a system that attempts to identify
when different musicians perform the same underlying song
- also known as ’cover songs’. To overcome variability in
tempo, beat tracking was used for describing each piece
with one feature vector per beat. To deal with variation in
instrumentation, they suggests using 12-dimensional ’Chroma’
feature vectors that collect spectral energy supporting each
semitone of the octave.
Zlis and Pachet [6] proposed a sequence generation mecha-
nism called musical mosaicing which enables to generate auto-
matically sequences of sound samples by specifying only high-
level properties of the sequence to generate. The properties of
the sequence specified by the user are translated automatically
into constraints holding on descriptors of the samples.
While we are by no means the first to use mid-level
musical features for music similarity measures, it is important
to understand the essential difference between the current
study and previous approaches in this context. In this paper,
we present a system for measuring similarity that enables,
for example, matching vocals and piano, a task that music
similarity tools are not designed for.
III. MUSICAL FEATU RE S EXTRACTION
In this section, we discuss the essential musical attributes
and define measures for matching them. The attributes are
extracted from WAV files, each of which is split into 10
seconds segments. Throughout this paper, we regard two
musical segments as vectors x1and x2. The distance function
which relates the feature ais notated as Da(x1, x2).
A. Beats per Minute (BPM)
BPM describes the rhythm in which the tune is played.
The term ’beats’ refers to repeated musical structure, and
we focused on the number of repetitions per minute. Disk
Jockeys (DJs) often use BPM for song mixing, as two musical
pieces with similar BPM usually sound well when played
together. While BPM matching can be obtained by manual
manipulations such as precise slicing, such approaches may
lead to unwanted distorting effects. Hence, we use the native
BPM of songs for determining a distance that reflects their
similarity. BPM is extracted by detection of peaks in the
segment’s autocorrelation function [15]. The distance measure
for BPM matching is defined as
Dtempo =|tempo(x1)tempo(x2)|
max[tempo(x1), tempo(x2)],(1)
where tempo(x)[0 200] represents the extracted BPM of
x.
B. Chromatic Scale and Chromagram
The chromatic scale is a 12 note musical scale, spaced
with equal distances on a logarithmic scale starting a basic
note [7]. The chromagram, also known as the harmonic pitch
class profile, is a histogram of notes of a given musical piece
showing the distribution of energy along the pitch classes [8].
It corresponds to the chromatic scale, in which the frequencies
are mapped onto a limited set of 12 chroma values (i.e. all oc-
taves are wrapped into one). A common method for computing
a chromagram is the constant Q transform (CQT) [9], which
computes a discrete spectral analysis of logarithmically spaced
bins. The CQT is defined as
Xcq[k] =
N(k)1
X
n=o
w[n, k]·x[n]·ej2πnfk,(2)
where the kth frequency bin is calculated using
fk=2k·fmin,(3)
where in our case, the number of bins per octave βequals
12 and fmin is the lowest frequency analyzed. CQT can
be observed as a DFT with varying window size, and thus
varying frequency resolution. The window w[n, k]and N(k)
are functions of the computed frequency bin k. Finally, using
Xcq, we compute the chromagram of xby summing all
corresponding bins from different octaves into a 12-length
vector, Cx, whose bth bin is calculated by
Cx(b) =
M
X
m=0
|Xcq(b + mβ)|,(4)
where b[1 12] is the chroma bin number, and Mis the total
number of octaves in the constant Q spectrum. In the current
study, the chromagram is normalized so that the value of its
maximal instance is set to 1.
While analyzing various monophonic musical pieces, we
have encountered a major difference between chromagrams
of chromatic instruments (e.g. piano, guitar, flute), which
are characterized by high variance and a low average, in
contrast to non-chromatic instrument (e.g. vocals, drums),
whose chromagram is characterized by low variance and a
high average, as evident in Fig. 2. The observation can be
explained by the fact that a chromatic instruments’ spectrum
is concentrated in the 12 chromatic pitch notes, creating a
sparser chromagram that produces high variance and a low
average, where non-chromatic instruments’ spectral energy is
spread across all 12 chromatic bins. Using this observation, we
have created a user optional filter for limiting search results
to chromatic or non-chromatic instruments.
C. Cyclic Harmonic Cross-corelation
Musical instruments are often based on an approximate
harmonic oscillator (such as a string or a column of air),
oscillating at numerous frequencies simultaneously. These fre-
quencies are the harmonics of a basic frequency representing
the pitch note. Two notes played simultaneously with a large
amount of common harmonics will sound pleasant to the
listener [10]. Specifically, notes separated by 4 or 7 semitones
(i.e C and E or C and G) share a lode of common harmonics,
and therefore are harmonically compatible [11].
Fig. 2. Variance and average of 54 non-chromatic and 66 chromatic pieces.
It is evident that the distributions of variance and average are concentrated in
two different corners, indicating the nature of the playing instrument.
Musical key is a defined series of notes. Each key has unique
characteristics. For instance, it is customary to attribute the
Major keys a sense of infinity or suspense, whereas the Minor
key is attributed a sense of sadness or deep emotion [12]. The
maximum key-profile correlation (MKC) [13] is an algorithm
for finding the most prominent key in a music sample. The
MKC algorithm is based on key profiles [14] representing
typical chromagrams of common musical keys. The algorithm
computes the correlation between the chromagram of a mu-
sical sample and all 24 common western key profiles (Major
and Minor), and the key profile that provides the maximum
correlation is taken as the most probable key of the musical
sample.
Key matching by the MKC method is commonly used in
music matching applications. However, this approach does
not take into account music played in keys that differ from
the common major and minor keys (i.e Arabic or pentatonic
scales). In order to overcome this matter, our approach mea-
sures harmonic similarity using direct cross correlation of
the pieces’ chromagrams, maintaining the original musical
characteristics. We define the cyclic harmonic cross-correlation
as
R1,2(p) = E[Cx1(l)·Cx2(lp mod 12)]
pvar(Cx1)var(Cx2).(5)
High cross-correlation values of R1,2(0),R1,2(4),R1,2(7)
indicates that the correlated segments share common harmon-
ics, and therefore are harmonically compatible. Accordingly,
the chroma distance is calculated using the maximal corre-
lation achieved, where the shifted versions are weighted by
0.8:
Dc(x1, x2) = 1
2[1 Rmax(x1, x2)],(6)
where
Rmax = max[R1,2(0),0.8R1,2(4),0.8R1,2(7)].(7)
Fig. 5. An exemplary 2-dimensional output graph of our search system. The
query segment is located at the origin, the horizonal axis represents harmonic
distances and the vertical axis corresponds with tempo distance.
IV. CON TE NT-BA SE D MUSICAL SEARCH SYS TE M
Based on the musical features discussed in Section III, we
have developed an audio-based musical search system. The
system’s information flow is described in Fig. 4, and consists
of three main stages.
Initially, all audio tracks of the musical database are loaded
to the system. Each track is split into 10 seconds segments,
from each of which the features vector (BPM and chroma) are
extracted using MIR Matlab toolbox [15].
Musical search is conducted by loading a query audio
segment and specifying the desired weight of each musical fea-
ture. After calculating the feature-specific distances between
the query segment and all database instances (”matching”), the
results are sorted by decreasing order of the weighted distance,
D(x1, x2) = X
iA
wiDi(x1, x2),(8)
where Ais the group of all features and wiare the user-
defined weights, whose default value is 1
|A|. The compatibility
measure, for feature ais calculated by
Compa(x1, x2) = 100[1 da(x1, x2)],(9)
where Compa[0 100].
Results can be classified as chromatic/non-chromatic tracks,
by setting a threshold for the chromagram’s variance and
average, as discussed in Section III-B. The system finds the
most compatible audio segments for the given query, and
presents them in a ranked table, as illustrated in Table I, as
well as a 2-dimensional graph, where the query segment is
located at the origin, the horizonal and vertical axis are the
measures of the harmonic and tempo distances of the search
results, as illustrated on Fig. 5
V. EXP ER IM EN TS A ND RE SU LTS
As the motivation of the current study has stemmed from
ThruYOU project, its database was chosen as our data set. The
database consists of unrelated video soundtracks from which
(a) (b)
Fig. 3. (a) Two highly correlated chromagrams, indicating harmonic compatibility (b) Two chromagrams with low correlation, indicating low harmonic
compatibility
Fig. 4. Block diagram of proposed system information flow. Each instance of the musical library and the query segment are converted to a feature vector.
When a query segment is loaded, search results are based on the weighted sum of the distances between the feature vectors, and plotted in a table and a 2D
graph.
Kutiman sampled audio for his project, some of which are
monophonic while others are polyphonic. The audio quality
varies; as most of the videos were recorded by low-end
equipment in non-studio settings, some of the soundtracks
suffer at times from background noise (e.g. air conditioner).
Kutiman created 7 compilations, 3-6 minutes long, out of over
130 video clips that accumulate to over 6 hours of video
material. The audio raw data contains a multitude of music
genres such as Classic, Rock, and Latin, as well as music
that does not follow a specific genre. Links to all of the
original videos used in ThruYOU are available online on the
project’s web site. Our system’s performance was tested by
experimenting two types of search queries on the ThruYOU
database, described as follows.
A. Experiments I
In order to test our system’s output in relation to Kutiman’s
musical selection, we loaded an original track which is a part
of a ThruYOU compilation as a search query on the ThruYOU
database. This experiment imitated Kutiman’s manual search
that was now automated by our application. We expected
segments that were originally mixed together with our query
to appear as highly compatible search results.
In one of our experiments, we used the track ”Chopin
Track name Seg. Com. %
Beethoven String Quartet Op.18 No.4 31 90.65
Piano Sonata in c minor I Beethoven 3 90.37
P laying the J uno 60 5 90.29
Beethoven StringQuartet Op.18 No.4 28 89.86
SteelphonS900 synthesizer demo 1 89.48
RolandRS09 like Arp S olina 9 89.11
Bach Cello SuiteNo.5Gigue 1 89.08
T enorSaxophone F M ajor S cale 1 88.94
Piano Sonata in c minor I Beethoven 19 87.94
J.C.Bach ConcertinCMinormvt.2 3 87.22
TABLE I
TOP 10 RES ULTS O F EX P. I. 6 MARKED SEGMENTS WERE FOUND AND
US ED BY KU TI MAN I N TRA CK 3 (”I’MNEW ”).
Nocturne”, used in Kutiman’s compilation ”I’m New”, as a
query. In the ThruYOU compilation, the original piece is
almost unchanged and is repeated throughout the track with
various pieces mixed simultaneously. Viewing the top 10
search results (Table I), 6 segments that were used by Kutiman
in track 3 of ThruYOU (bolded) were found out of a 6 hour
database. We haves repeated this experiment for a selected
footage contained in each of the 7 ThruYOU compilations,
and found that on average, more than 60% of Kutiman’s mixed
pieces appeared within the top 10 search results of our system.
B. Experiments II
In the second experiment, we tried to simulate the first
creative step made by Kutiman, which is choosing musical
pieces that should be mixed together. Following the results of
a search query, we tried to create a new compilation based
on the system’s suggestions of highly compatible segments
from its database. In one of the experiments, we queried the
ThruYOU database with a vocal piece (”An original song by
Mandy”), and used segments that appeared in the system’s top
15 results for creating our new mix. In order to create the mix,
only elementary editing was done (Trim and fade in/out), and
the segments were used ”as is”. While assessing the quality
of our compilation is subjective, it was regarded as successful
by numerous listeners, including Kutiman himself. Three of
our compilation can be found on: www.gidinave.com. The
application was installed in Kutiman’s computer for further
research.
VI. CONCLUSION
In this paper, we have presented an audio-based search
algorithm that uses Musical Information Retrieval (MIR) tools
and have suggested ThruYOU project as a data-set for ex-
periments in content based musical search. We tested the
system on the soundtracks of seemingly unrelated YouTube
videos whose quality is typically low, as they were recorded
using low-end equipment. ThruYOU is a classic example, of
where the tools of MIR may simplify problems that musicians
face when compiling different musical pieces into a new
composition. Our system is a tool that can help musicians
in the composition process, by dramatically reducing the time
of exploring large musical database, thus broadening the scope
of music with which the musicians can work. However, while
the computerized processes of music analysis can aid the
musician, they cannot take his place. We can identify the
following possible avenues for improving our system:
1) Harmonic segmentation tools we have tested were insuf-
ficient in providing the end users (musicians) material
to work with, as the segment outputs were too short.
2) The process of content-based search can be accelerated
by initial filtering of the tracks by their names. For
example, if a musician is looking for a piano piece, all
pieces that contain the word ”guitar” will not even be
searched.
3) Machine learning methods provide the system with
relevant user feedback for better understanding of the
his musical taste.
4) Search may be supported by additional criteria such as
Genre and time signature.
REFERENCES
[1] J. Kluger, ”The Best Inventions of 2009,” Time Magazine 2009, Novem-
ber 12 [Online]. Available: http://www.time.com.
[2] F. Zheng, Gu. Zhang and Z. Song, ”Comparison of Different Implemen-
tations of MFCC,” J. Computer Science & Technology, Vol. 16(6), pp.
582589, 2001.
[3] S. Dixon, F. Gouyon, and G. Widmer, ”Towards Characterisation of Music
via Rhythmic Patterns,” Proceedings of the 5th ISMIR, Barcelona, Spain.
pp. 509516, 2004.
[4] J. Pickens, J.P. Bello, G. Monti, T. Crawford, M. Dovey, M. Sandler, and
D. Byrd., ”Polyphonic score retrieval using polyphonic audio queries: A
harmonic modeling approach,” In Proceedings of the 3rd ISMIR, Paris,
France, pp. 140149, 2002.
[5] D. Ellis and G. Poliner, ”Identifying ’Cover Songs’ with Chroma Features
and Dynamic Programming Beat Tracking,” ICASSP, Vol. 4, pp. 1429-
1432, 2007.
[6] A. Zils and F. Pachet, ”Musical Mosaicing,” Proceedings of DAFX 01,
Limerick (Ireland), 2001.
[7] Benward and Saker, ”Music: In Theory and Practice,” Vol. I, pp.47.
Seventh Edition, 2009.
[8] E. Gomez, ”Tonal Description of Polyphonic Audio for Music Content
Processing,” INFORMS Journal on Computing, Vol. 18, no. 3, pp. 294-
304, 2006.
[9] J. Brown, ”Calculation of a Constant Q spectral Transform,” Journal of
the Acoustical Society of America, 89(1): 425434, 1991.
[10] W.F. Thompson, ”Music, Thought, and Feeling: Understanding the
Psychology of Music,” 2008.
[11] W. Piston and M. DeVoto, ”Harmony,” 5th ed. New York: W. W. Norton,
1987.
[12] W. Apel, ”Harvard Dictionary of Music,” Cambridge: Harvard Univer-
sity Press, 1969.
[13] C. Krumhansl, ”Cognitive Foundations of Musical Pitch,” Oxford Psy-
chological Series, no. 17, Oxford University Press, New York, 1990.
[14] C. Krumhansl and E.J. Kessler, ”Tracing the Dynamic Changes in
Perceived Tonal Organization in a Spatial Representation of Musical
Keys,” Psychological Review, Vol. 89, pp. 334-368, 1982.
[15] O. Lartillot and P. Toiviainen, ”A Matlab Toolbox for Musical Feature
Extraction from Audio,” Proc. of the 10th Int. Conference on Digital
Audio Effects (DAFx-07), Bordeaux, France, 2007.
... For example, musical notes with intervals between 4 or 7 semitones share several harmonics, which makes them harmonically compatible [Piston and DeVoto 1987]. According to Lindenbaum et al. (2010), one way used to quantify harmonic similarity is based on the cross-relation of pieaces' chromagrams. Thus, the normalized cross-correlation [Yoo and Han 2009], that return values between -1 and 1, is define as ...
... for p a cross-correlation, l the vector length and C x i the C x mean value. Following the process described by Lindenbaum et al. (2010), the higher cross-correlation value between R 1,2 (0), R 1,2 (4), and R 1,2 (7) indicates that there are many common harmonics, which reflects a great compatibility. Using the maximum achieved value, the chroma distance is calculated as ...
... According to Lindenbaum et al. (2010) the relation between the musical features distances can be defined as ...
... Thus, a considerable amount of variance in musical preferences remains unexplained. Future investigations concerned with musical preferences should illuminate the underlying mechanisms by investigating how preferences relate to identity motives, emotion regulation processes, and activity preferences, and also by exploring how preferences for particular auditory features (e.g., rhythm, time signature, frequency components) may correspond to different personality traits (see Lindenbaum, Maskit, Kutiel, & Nave, 2010;Logan, 2000). ...
Article
Full-text available
Research over the past decade has shown that various personality traits are communicated through musical preferences. One limitation of that research is external validity, as most studies have assessed individual differences in musical preferences using self-reports of music-genre preferences. Are personality traits communicated through behavioral manifestations of musical preferences? We addressed this question in two large-scale online studies with demographically diverse populations. Study 1 (N = 22,252) shows that reactions to unfamiliar musical excerpts predicted individual differences in personality—most notably, openness and extraversion—above and beyond demographic characteristics. Moreover, these personality traits were differentially associated with particular music-preference dimensions. The results from Study 2 (N = 21,929) replicated and extended these findings by showing that an active measure of naturally occurring behavior, Facebook Likes for musical artists, also predicted individual differences in personality. In general, our findings establish the robustness and external validity of the links between musical preferences and personality.
... Thus, a considerable amount of variance in musical preferences remains unexplained. Future investigations concerned with musical preferences should illuminate the underlying mechanisms by investigating how preferences relate to identity motives, emotion regulation processes, and activity preferences, and also by exploring how preferences for particular auditory features (e.g, rhythm, time signature, frequency components) may correspond to different personality traits (see Logan and Others 2000;Lindenbaum et al. 2010). ...
Article
Full-text available
Research over the past decade has shown that various personality traits are communicated through musical preferences. One limitation of that research is external validity, as most studies have assessed individual differences in musical preferences using self-reports of music-genre preferences. Are personality traits communicated through behavioral manifestations of musical preferences? We address this question in two large-scale online studies with demographically diverse populations. Study 1 (N=22,252) shows that reactions to unfamiliar musical excerpts predicted individual differences in personality-most notably openness and extraversion-above and beyond demographic characteristics. Moreover, these personality traits were differentially associated with particular music-preference dimensions. The results from Study 2 (N=21,929) replicated and extended these findings by showing that an active measure of naturally-occurring behavior, Facebook Likes for musical artists, also predicted individual differences in personality. In general, our findings establish the robustness and external validity of the links between musical preferences and personality.
... In this study we develop and examine an algorithm for automatic key extraction from raw data. Automatic key extraction has driven much recent research due to a large number of applications involved, for example, content based search [1] , playlist generation, mosaicing , automatic accompaniment and disc jockey work. Recent studies focus on the task of extracting a musical key from raw data without the use of symbolic transcription; these studies have achieved reasonable results but cannot compete with a gifted musician. ...
Article
Full-text available
We propose a method for automatic musical key extraction using a two-stage spectral dimensionality reduction (two consecutive mappings). First we build a data set representing the 24 Western musical keys, and then we use a nonlinear dimensionality reduction method, in order to understand the true manifold on which the musical keys lie. The order of the keys along the manifold is perfectly correlated with a cognitive model for the key space. We exploit this manifold in order to extract the musical key from a musical piece. Furthermore we propose three classifiers using the extracted manifold. The Classifiers work in two stages, by first estimating the mode and then by estimating the key within the estimated mode. Finally we examine our method on The Beatles data set and demonstrate its improved performance compared to various existing methods.
Article
Full-text available
Assistive technology, especially gaze-controlled, can promote accessibility, health care, well-being and inclusion for impaired people, including musical activities that can be supported by interfaces controlled using eye tracking. Also, the Internet growth has allowed access to a huge digital music database, which can contribute to a new form of music creation. In this paper, we propose the application of Music Information Retrieval techniques for music segmentation and similarity identification, aiming at the development of a new form of musical creation using an automatic process and the optimization algorithm Harmony Search to combine segments. These techniques for segmentation and similarity of segments were implemented in an assistive musical interface controlled by eye movement to support musical creation and well-being. The experimental results can be found in [https://bit.ly/2Zl7KSC].
Conference Paper
Full-text available
A central problem in music information retrieval is finding suitable representations which enable efficient and accu- rate computation of musical similarity and identity. Low level audio features are ideal for calculating identity, but are of limited use for similarity measures, as many aspects of music can only be captured by considering high level features. We present a new method of characterising mu- sic by typical bar-length rhythmic patterns which are au- tomatically extracted from the audio signal, and demon- strate the usefulness of this representation by its applica- tion in a genre classification task. Recent work has shown the importance of tempo and periodicity features for genre recognition, and we extend this research by employing the extracted temporal patterns as features. Standard classifi- cation algorithms are utilised to discriminate 8 classes of Standard and Latin ballroom dance music (698 pieces). Although pattern extraction is error-prone, and patterns are not always unique to a genre, classification by rhyth- mic pattern alone achieves up to 50% correctness (base- line 16%), and by combining with other features, a classi- fication rate of 96% is obtained.
Conference Paper
Full-text available
This paper extends the familiar “query by humming” music retrieval framework into the polyphonic realm. As humming in multiple voices is quite difficult, the task is more accurately described as “query by audio example,” onto a collection of scores. To our knowledge, we are the first to use polyphonic audio queries to retrieve from polyphonic symbolic collections. Furthermore, as our results will show, we will not only use an audio query to retrieve a known item symbolic piece, but we will use it to retrieve an entire set of real-world composed variations on that piece, also in the symbolic format. The harmonic modeling approach which forms the basis of this work is a new and valuable technique which has both wide applicability and future potential.
Article
Full-text available
We present a method to extract a description of the tonal aspects of music from polyphonic audio signals. We define this tonal description using different levels of abstraction, differentiating between low-level signal descriptors and high-level textual labels. We also establish different temporal scales for description, defining some features as being attached to a certain time instant, and other global descriptors as related to a wider segment. The description is validated by estimating the key of a piece. We also propose the description as a tonal representation of the polyphonic audio signal to measure tonal similarity between audio excerpts and to establish the tonal structure of a musical piece.
Article
Full-text available
the 863 Speech Database show that, compared with the traditional MFCC with its corresponding auto-regressive analysis coefficients, the FBE-MFCC and the frame energy with their corresponding auto-regressive analysis coefficients form the best combination, reducing the Chinese syllable error rate (CSER) by about 10.0%, while the FBE-MFCC with the corresponding auto-regressive analysis coefficients reducing CSER by 2.5%. Comparison experiments are also done across a quite casual Chinese speech database, named Chinese Annotated Spontaneous Speech (CASS) corpus, the FBE-MFCC can reduce the error rate by about 2.9% on an average. (3) The resulted warped power spectrum is then convolved with the triangular band-pass filter P(M) into θ(M). The convolution with the relatively broad critical-band masking curves ψ (M) significantly reduces the spectral resolution of θ(M) in comparison with the original P(f), which allows for the down sampling of θ(M). The discrete convolution of ψ (M) with θ(M) yields samples of the critical-band power spectrum as θ(Mk) (k=1..K) in Equation (2), where Ω k's are linearly spaced in the mel-scale. Then K outputs X(k)=ln(θ(Mk)) (k=1..K) are obtained. The K filters in the implementation of discrete convolution are simulated as shown in Figure 1 (a). In the implementation, θ(Mk) is the average instead of the sum.
Chapter
This book addresses the central problem of music cognition: how listeners' responses move beyond mere registration of auditory events to include the organization, interpretation, and remembrance of these events in terms of their function in a musical context of pitch and rhythm. The work offers an analysis of the relationship between the psychological organization of music and its internal structure. It combines over a decade of original research on music cognition with an overview of the available literature. The author also provides a background in experimental methodology and music theory.
Conference Paper
Large music collections, ranging from thousands to millions of tracks, are unsuited to manual searching, motivating the development of automatic search methods. When different musicians perform the same underlying song or piece, these are known as `cover' versions. We describe a system that attempts to identify such a relationship between music audio recordings. To overcome variability in tempo, we use beat tracking to describe each piece with one feature vector per beat. To deal with variation in instrumentation, we use 12-dimensional `chroma' feature vectors that collect spectral energy supporting each semitone of the octave. To compare two recordings, we simply cross-correlate the entire beat-by-chroma representation for two tracks and look for sharp peaks indicating good local alignment between the pieces. Evaluation on several databases indicate good performance, including best performance on an independent international evaluation, where the system achieved a mean reciprocal ranking of 0.49 for true cover versions among top-10 returns