Content uploaded by Ayush Kumar Shah
Author content
All content in this area was uploaded by Ayush Kumar Shah on Sep 12, 2022
Content may be subject to copyright.
Chroma Feature Extraction
A. K. Shah, M. Kattel, A. Nepal, D. Shrestha
Department of Computer Science and Engineering, School of Engineering
Kathmandu University, Nepal
ayush.kumar.shah@gmail.com, manasikattel1@gmail.com, araju7nepal@gmail.com,
deepeshshrestha@outlook.com
Abstract
The chroma feature is a descriptor,
which represents the tonal content of a
musical audio signal in a condensed
form. Therefore chroma features can be
considered as an important prerequisite
for high-level semantic analysis, like
chord recognition or harmonic similarity
estimation. A better quality of the
extracted chroma feature enables much
better results in these high-level tasks.
Short-Time Fourier Transforms and
Constant Q Transforms are used for
chroma feature extraction.
Keywords: Fourier transform,
spectrogram, chroma representation,
chroma vector
1. Introduction
Over the past few years the need
of music information retrieval and
classification systems has become more
urgent and this brought to the birth of a
research area called Music Information
Retrieval (MIR). It is an important task
in the analysis of music and music
transcription in general, and it can
contribute to applications such as key
detection, structural segmentation, music
similarity measures, and other semantic
analysis tasks.
Pitch
Pitch is a perceptual property of
sounds that allows their ordering on a
frequency-related scale, or more
commonly, pitch is the quality that
makes it possible to judge sounds as
"higher" and "lower" in the sense
associated with musical melodies. Pitch
can be determined only in sounds that
have a frequency that is clear and stable
enough to distinguish from noise. Pitch
is a major auditory attribute of musical
tones, along with duration, loudness, and
timbre [1].
Chroma
Chroma feature, a quality of a
pitch class which refers to the "color" of
a musical pitch, which can be
decomposed in into an octave-invariant
value called "chroma" and a "pitch
height" that indicates the octave the
pitch is in [2].
Chroma Vector
A chroma vector is a typically a
12-element feature vector indicating how
much energy of each pitch class, {C, C#,
D, D#, E, ..., B}, is present in the signal.
The Chroma vector is a perceptually
motivated feature vector. It uses the
concept of chroma in the cyclic helix
representation of musical pitch
perception. The Chroma vector thus
represents magnitudes in twelve pitch
classes in a standard chromatic scale [3].
Chroma features
In music, the term chroma
feature or chromagram closely relates to
the twelve different pitch classes.
Chroma-based features, which are also
referred to as "pitch class profiles", are a
powerful tool for analyzing music whose
pitches can be meaningfully categorized
(often into twelve categories) and whose
tuning approximates to the
equal-tempered scale. One main
property of chroma features is that they
capture harmonic and melodic
characteristics of music, while being
robust to changes in timbre and
instrumentation.
Chroma features aim at
representing the harmonic content
(eg:keys,chords) of a short-time window
of audio. The feature vector is extracted
from the magnitude spectrum by using a
short time fourier transform(STFT),
Constant-Q transform(CQT), Chroma
Energy Normalized (CENS), etc[5].
Harmonic Pitch Class Profile (HPCP)
Harmonic pitch class profiles
(HPCP) is a group of features that a
computer program extracts from an
audio signal, based on a pitch class
profile—a descriptor proposed in the
context of a chord recognition system.
HPCP are an enhanced pitch distribution
feature that are sequences of feature
vectors that, to a certain extent, describe
tonality, measuring the relative intensity
of each of the 12 pitch classes of the
equal-tempered scale within an analysis
frame. Often, the twelve pitch spelling
attributes are also referred to as chroma
and the HPCP features are closely
related to chroma features or
chromagrams [4].
2. Background
Features of audio
●Frequency
●Amplitude
Frequency is the speed of the
vibration, and this determines the pitch
of the sound. It is only useful or
meaningful for musical sounds, where
there is a strongly regular waveform. It
is measured as the number of wave
cycles that occur in one second. The
unit of frequency measurement is Hertz.
Amplitude is the size of the
vibration, and this determines how loud
the sound is. We have already seen that
larger vibrations make a louder sound.
Amplitude is important when balancing
and controlling the loudness of sounds,
such as with the volume control on your
CD player.
Fig 2.1: Features of audio
2.1 Chroma Features
The underlying observation is
that humans perceive two musical
pitches as similar in color if they differ
by an octave. Based on this observation,
a pitch can be separated into two
components, which are referred to as
tone height and chroma. Assuming the
equal-tempered scale, one considers
twelve chroma values represented by the
set
{C, C♯, D, D♯, E ,F, F♯, G, G♯, A, A♯,
B}
that consists of the twelve pitch spelling
attributes as used in Western music
notation. Note that in the equal-tempered
scale different pitch spellings such C♯
and D♭refer to the same chroma.
Enumerating the chroma values, one can
identify the set of chroma values with
the set of integers {1,2,...,12}, where 1
refers to chroma C, 2 to C♯, and so on. A
pitch class is defined as the set of all
pitches that share the same chroma. For
example, using the scientific pitch
notation, the pitch class corresponding to
the chroma C is the set
{..., C−2, C−1, C0, C1, C2, C3 ...}
consisting of all pitches separated by an
integer number of octaves. Given a
music representation (e.g. a musical
score or an audio recording), the main
idea of chroma features is to aggregate
for a given local time window (e.g.
specified in beats or in seconds) all
information that relates to a given
chroma into a single coefficient. Shifting
the time window across the music
representation results in a sequence of
chroma features each expressing how the
representation's pitch content within the
time window is spread over the twelve
chroma bands. The resulting
time-chroma representation is also
referred to as chromagram. The figure
below shows chromagrams for a
C-major scale, once obtained from a
musical score and once from an audio
recording [5].
Fig 2.2 : (a) Musical score of a C-major scale.
(b) Chromagram obtained from the score. (c)
Audio recording of the C-major scale played on
a piano. (d) Chromagram obtained from the
audio recording.
2.1.1 Types of Chroma Features [7]
2.1.1.1 CP Feature
From the Pitch representation,
one can obtain a chroma representation
by simply adding up the corresponding
values that belong to the same chroma.
To archive invariance in dynamics, we
normalize each chroma vector with
respect to the Euclidean norm. The
resulting features are referred to as
Chroma-Pitch denoted by CP.
Fig 2.3 : CP Feature
2.1.1.2 CLP Features
To account for the logarithmic
sensation of sound intensity, one often
applies a logarithmic compression when
computing audio features [11]. To this
end, the local energy values e of the
pitch representation are logarithmized
before deriving the chroma
representation.
Here, each entry e is replaced by the
value
log(η · e+1), where η is a suitable
positive constant. The resulting features,
which depend on the compression
parameter η, are referred to as
Chroma-Log-Pitch denoted by CLP[η].
Fig 2.4 : CLP Features
2.1.1.3. CENS Features
Adding a further degree of
abstraction by considering short-time
statistics over energy distributions within
the chroma bands, one obtains CENS
(Chroma Energy Normalized Statistics)
features, which constitute a family of
scalable and robust audio features. These
features have turned out to be very
useful in audio matching and retrieval
applications. In computing CENS
features, a quantization is applied based
on logarithmically chosen thresholds.
This introduces some kind of
logarithmic compression similar to the
CLP[η] features. Furthermore, these
features allow for introducing a temporal
smoothing. Here, feature vectors are
averaged using a sliding window
technique depending on a window size
denoted by w (given in frames) and a
downsampling factor denoted by d. In
the following, we do not change the
feature rate and consider only the case d
= 1 (no downsampling). Therefore, the
resulting feature only depends on the
parameter w and is denoted by
CENS[w].
Fig 2.5 : CENS Features
2.1.1.4 CRP Features
To boost the degree of timbre
invariance, a novel family of
chroma-based audio features has been
introduced. The general idea is to
discard timbre-related information in a
similar fashion as pitch-related
information is discarded in the
computation of mel-frequency cepstral
coefficients (MFCCs). Starting with the
Pitch features, one first applies a
logarithmic compression and transforms
the logarithmized pitch representation
using a DCT. Then, one only keeps the
upper coefficients of the resulting
pitch-frequency cepstral coefficients
(PFCCs), applies an inverse DCT, and
finally projects the resulting pitch
vectors onto 12-dimensional chroma
vectors.
These vectors are referred to as CRP
(Chroma DCT Reduced log Pitch)
features. The upper coefficients to be
kept are specified by a parameter p ∈[1
: 120].
Fig 2.6 :CRP Features
3. Methodology
General chroma feature extraction
procedure
The block diagram of the
procedure is shown in Fig.
Fig 3.1: HPCP block diagram
The General HPCP (chroma) feature
extraction procedure is summarized as
follows:
1. Input musical signal.
2. Do spectral analysis to obtain the
frequency components of the music
signal.
3. Use Fourier transform to convert the
signal into a spectrogram. (The
Fourier transform is a type of
time-frequency analysis.)
4. Do frequency filtering. A frequency
range of between 100 and 5000 Hz is
used.
5. Do peak detection. Only the local
maximum values of the spectrum are
considered.
6. Do reference frequency computation
procedure. Estimate the deviation
with respect to 440 Hz.
7. Do Pitch class mapping with respect
to the estimated reference frequency.
This is a procedure for determining
the pitch class value from frequency
values. A weighting scheme with
cosine function is used. It considers
the presence of harmonic frequencies
(harmonic summation procedure),
taking account a total of 8 harmonics
for each frequency. To map the value
on a one-third of a semitone, the size
of the pitch class distribution vectors
must be equal to 36.
8. Normalize the feature frame by
frame dividing through the
maximum value to eliminate
dependency on global loudness. And
then we can get a result HPCP
sequence like Figure [4].
There are many ways for
converting an audio recording into a
chromagram. For example, the
conversion of an audio recording into a
chroma representation (or chromagram)
may be performed either by using
short-time Fourier transforms in
combination with binning strategies or
by employing suitable multirate filter
banks. Furthermore, the properties of
chroma features can be significantly
changed by introducing suitable pre- and
post-processing steps modifying
spectral, temporal, and dynamical
aspects. This leads to a large number of
chroma variants, which may show a
quite different behavior in the context of
a specific music analysis scenario [5].
3.1 Performing Fourier Transform
3.1.1 Short Time Fourier Transform
The Fourier transform maps a
time-dependent signal to a
frequency-dependent function which
reveals the spectrum of frequency
components that compose the original
signal. Loosely speaking, a signal and its
Fourier transform are two sides of the
same coin. On the one side, the signal
displays the time information and hides
the information about frequencies. On
the other side, the Fourier transform
reveals information about frequencies
and hides the time information [8].
To obtain back the hidden time
information, Dennis Gabor introduced in
the year 1946 the modified Fourier
transform, now known as short-time
Fourier transform or simply STFT. This
transform is a compromise between a
time- and a frequency-based
representation by determining the
sinusoidal frequency and phase content
of local sections of a signal as it changes
over time. In this way, the STFT does
not only tell which frequencies are
“contained” in the signal but also at
which points of times or, to be more
precise, in which time intervals these
frequencies appear.
The Short-Time Fourier
Transform (STFT) is a powerful
general-purpose tool for audio signal
processing. It defines a particularly
useful class of time-frequency
distributions which specify complex
amplitude versus time and frequency for
any signal. Fourier transform is a
well-known tool for analyzing the
frequency distribution of a signal. Let us
denote the uniformly sampled f(t) and
g(t) functions by f[n] and g[n]. Then the
discrete (D) STFT over a compactly
supported g window function can be
written as
Where
M is the window length of g and N is the
number of samples in f. This algorithm
can be interpreted as a successive
evaluation of Fourier transforms over
short segments of the whole signal.
Additionally, the frequencies can be
visually represented by displaying the
squared magnitude of the Fourier
coefficients at each section. This
diagram is called as the spectrogram of
the signal f [15].
Fig 3.2 : Short Time Fourier Transform [9]
3.1.2 Constant-Q Transform
Like the Fourier transform a
constant Q transform is a bank of filters,
but in contrast to the former it has
geometrically spaced center frequencies:
where b dictates the number of filters per
octave.
What makes the constant Q transform so
useful is that by an appropriate choice
for f0 (minimal center frequency) and b
the center frequencies directly
correspond to musical notes.
Another nice feature of the
constant Q transform is its increasing
time resolution towards higher
frequencies. This resembles the situation
in our auditory system. It is not only the
digital computer that needs more time to
perceive the frequency of a low tone but
also our auditory sense. This is related to
music usually being less agitated in the
lower registers.The constant Q-transform
can be viewed as a wavelet transform.
There are at least three reasons why the
CQT has not widely replaced the DFT in
audio signal processing. Firstly, it is
computationally more intensive than the
DFT. Secondly, the CQT lacks an
inverse transform that would allow
perfect reconstruction of the original
signal from its transform coefficients.
Thirdly, CQT produces a data structure
that is more difficult to work with than
the time-frequency matrix (spectrogram)
obtained by using short-time Fourier
transform in successive time frames. The
last problem is due to the fact that in
CQT, the time resolution varies for
different frequency bins, in effect
meaning that the ”sampling” of different
frequency bins is not synchronized [20].
3.2 Log-Frequency Spectrogram
We now derive some audio
features from the STFT by converting
the frequency axis (given in Hertz) into
an axis that corresponds to musical
pitches. In Western music, the
equal-tempered scale is most often used,
where the pitches of the scale correspond
to the keys of a piano keyboard. In this
scale, each octave (which is the distance
of two frequencies that differ a factor of
two) is split up into twelve
logarithmically spaced units. In MIDI
notation, one considers 128 pitches,
which are serially numbered starting
with 0 and ending with 127. The MIDI
pitch p = 69 corresponds to the pitch A4
(having a center frequency of 440 Hz),
which is often used as standard for
tuning musical instruments. In general,
the center frequency Fpitch(p) of a pitch p
∈ [0 : 127] is given by the formula
Fpitch(p) = 2(p−69)/12 .440
The logarithmic perception of frequency
motivates the use of a time-frequency
representation with a logarithmic
frequency axis labeled by the pitches of
the equal-tempered scale. To derive such
a representation from a given
spectrogram representation, the basic
idea is to assign each spectral coefficient
X (m, k) to the pitch with center
frequency that is closest to the frequency
Fcoef(k). More precisely, we define for
each pitch p ∈ [0 : 127] the set
P(p) := {k ∈[0 : K] : Fpitch(p − 0.5) ≤
Fcoef(k) < Fpitch(p + 0.5)}.
From this, we obtain a log-frequency
spectrogram YLF : Z × [0 : 127] → R≥0
defined by
By this definition, the frequency axis is
partitioned logarithmically and labeled
linearly according to MIDI pitches [8].
3.3 Chroma Features Extraction
The human perception of pitch is
periodic in the sense that two pitches are
perceived as similar in “color” (playing a
similar harmonic role) if they differ by
one or several octaves (where, in our
scale, an octave is defined as the
distance of 12 pitches). For example, the
pitches p = 60 and p = 72 are one octave
apart, and the pitches p = 57 and p = 71
are two octaves apart. A pitch can be
separated into two components, which
are referred to as tone height and
chroma. The tone height refers to the
octave number and the chroma to the
respective pitch spelling attribute. In
Western music notation, the 12 pitch
attributes are given by the set
{C, C, D, . . . , B} [8].
Enumerating the chroma values,
we identify this set with [0 : 11] where c
= 0 refers to chroma C, c = 1 to C], and
so on. A pitch class is defined as the set
of all pitches that share the same
chroma. For example, the pitch class that
corresponds to the chroma c = 0 (C)
consists of the set
{0, 12, 24, 36, 48, 60, 72, 84, 96, 108,
120}
(which are the musical notes
{. . . , C0, C1, C2, C3 . . .}).
The main idea of chroma features
is to aggregate all spectral information
that relates to a given pitch class into a
single coefficient. Given a pitch-based
log-frequency spectrogram YLF :
Z × [0 : 127] → R≥0, a chroma
representation or chromagram
Z × [0 : 11] → R≥0
can be derived by summing up all pitch
coefficients that belong to the same
chroma:
4. Applications
Identifying pitches that differ by
an octave, chroma features show a high
degree of robustness to variations in
timbre and closely correlate to the
musical aspect of harmony. This is the
reason why chroma features are a
well-established tool for processing and
analyzing music data. For example,
basically every chord recognition
procedure relies on some kind of chroma
representation. Also, chroma features
have become the de facto standard for
tasks such as music alignment and
synchronization as well as audio
structure analysis. Finally, chroma
features have turned out to be a powerful
mid-level feature representation in
content-based audio retrieval such as
cover song identification or audio
matching [5].
Example
Chord recognition:
We did a Machine Learning
project called Guitar chord recognition
that recognizes any guitar chord. For
chord recognition, chroma feature
extraction was used in the guitar chords
dataset. In this work, we investigated
Convolutional Neural Networks (CNNs)
for learning chroma features in the
context of chord recognition.
A large set of data was collected
and metadata Chords.csv was created
which contained information such as
class_id, classname, file_name of all the
audio guitar chords datasets.
Librosa library was used to
extract chromagram from the audio
datasets. A new dataset consisting of mel
spectrograms of the chroma features of
all the audio files in the guitar chord
datasets as input and corresponding
class_id as output was built. Since the
audio files in the dataset are of varying
duration (up to 4 s), th fixed the size of
the input taken was to 2 seconds (128
frames), i.e. X ∈R128 X 87
The new dataset was then
shuffled and divided to training and test
datasets and was further preprocessed,
fed to a convolution neural network,
trained and performance metrics were
evaluated on the basis of classes of
chords predicted from the chroma
features of the test audio dataset.
Hence, chroma feature extraction
was useful to train the audio dataset for
guitar chord recognition since audio
dataset cannot be trained directly.
Fig 4.1: Mel Spectrogram Chroma features of
audio files am89.wav, em69.wav, g200.wav and
dm100.wav respectively from guitar chord
dataset
The above screenshots show the
results obtained from the audio data
available after extracting the chroma
features of the audio files using librosa.
5. Acknowledgement
This paper was written to fulfill
the course requirement of COMP-407
Digital Signal Processing(DSP) offered
by the Department of Computer Science
and Engineering (DOCSE).The authors,
would like to thank Mr. Satyendra Nath
Lohani, Assistant Professor of
Department of Computer Science and
Engineering for providing this wonderful
opportunity to explore and gain new
ideas and experience in a new topic.
6. Conclusion
This paper presents the details of
chroma feature extraction from any
audio files and the different types of
extraction methods of the chroma feature
are explained.The short term fourier
transform proved to be better than the
constant Q transform under chroma
feature extraction mainly because it does
not have inverse transform so that the
original audio form could be
regained.This being the main reason for
the use of short term fourier transform in
the chord recognition project.
Experimental results using STFT chroma
feature extraction is presented.
References:
[1] "Pitch (music)", En.wikipedia.org,
2019. [Online]. Available:
https://en.wikipedia.org/wiki/Pitch_(mus
ic). [Accessed: 23- Jan- 2019].
[2]"Chroma", En.wikipedia.org, 2019.
[Online]. Available:
https://en.wikipedia.org/wiki/Chroma.
[Accessed: 23- Jan- 2019].
[3]"chroma",
Musicinformationretrieval.com, 2019.
[Online]. Available:
https://musicinformationretrieval.com/ch
roma.html. [Accessed: 23- Jan- 2019].
[4]L. Revolvy, ""Harmonic pitch class
profiles" on Revolvy.com", Revolvy.com,
2019. [Online]. Available:
https://www.revolvy.com/page/Harmonic
-pitch-class-profilesAW. [Accessed: 23-
Jan- 2019].
[5]"Chroma feature", En.wikipedia.org,
2019. [Online]. Available:
https://en.wikipedia.org/wiki/Chroma_fe
ature. [Accessed: 18- Jan- 2019].
[6]Cs.uu.nl, 2019. [Online]. Available:
http://www.cs.uu.nl/docs/vakken/msmt/le
ctures/SMT_B_Lecture5_DSP_2017.pdf.
[Accessed: 19- Jan- 2019].
[7]Pdfs.semanticscholar.org, 2019.
[Online]. Available:
https://pdfs.semanticscholar.org/6432/19
014e8aa48dda060cecf4ff413dd3ee1e3a.
pdf. [Accessed: 23- Jan- 2019].
[8]Audiolabs-erlangen.de, 2019.
[Online]. Available:
https://www.audiolabs-erlangen.de/conte
nt/05-fau/professor/00-mueller/02-teachi
ng/2016s_apl/LabCourse_STFT.pdf.
[Accessed: 23- Jan- 2019].
[9]2019. [Online]. Available:
http://mac.citi.sinica.edu.tw/~yang/tigp/l
ecture02_stft_yhyang_mir_2018.pdf.
[Accessed: 18- Jan- 2019].
[10]Mirlab.org, 2019. [Online].
Available:
http://www.mirlab.org/conference_paper
s/international_conference/ICASSP%20
2014/papers/p5880-kovacs.pdf.
[Accessed: 19- Jan- 2019].
[11]Scholarship.claremont.edu, 2019.
[Online]. Available: 6
ttps://scholarship.claremont.edu/cgi/vie
wcontent.cgi?referer=https://www.googl
e.com.np/&httpsredir=1&article=1575
&context=cmc_theses. [Accessed: 18-
Jan- 2019].
[12]Atkison.cs.ua.edu, 2019. [Online].
Available:
http://atkison.cs.ua.edu/papers/FT_as_F
E.pdf. [Accessed: 17- Jan- 2019].
[13]"Advances in Music Information
Retrieval", Google Books, 2019.
[Online]. Available:
https://books.google.com.np/books?id=g
y5qCQAAQBAJ&pg=PA340&lpg=PA34
0&dq=Chroma+Feature+Extraction+us
ing+Short+Time+Fourier+Transform(S
TFT)&source=bl&ots=JSxWIB9vm9&si
g=ACfU3U1oZHnD8Ohu-YGLaTycJXZ
LB895DA&hl=ne&sa=X&ved=2ahUKE
wj0uY-B6YDgAhVZaCsKHXBYAhcQ6A
EwCXoECAEQAQ#v=onepage&q=Chr
oma%20Feature%20Extraction%20usin
g%20Short%20Time%20Fourier%20Tra
nsform(STFT)&f=false. [Accessed: 17-
Jan- 2019].
[14]2019. [Online]. Available:
https://www.researchgate.net/figure/Feat
ure-extraction-process-using-short-time-
Fourier-transform-STFT-Spectrograms-
of-A_fig1_236264209. [Accessed: 22-
Jan- 2019].
[15]"The Short-Time FourierTransform |
Spectral Audio Signal Processing",
Dsprelated.com, 2019. [Online].
Available:
https://www.dsprelated.com/freebooks/sa
sp/Short_Time_Fourier_Transform.html.
[Accessed: 22- Jan- 2019].
[16]Kingma, D., & Ba, J. (2019). Adam:
A Method for Stochastic Optimization.
Retrieved from
https://arxiv.org/abs/1412.6980v8
[17]Arxiv.org, 2019. [Online].
Available:
https://arxiv.org/pdf/1811.01222.pdf.
[Accessed: 22- Jan- 2019].
[18]2019. [Online]. Available:
https://www.researchgate.net/publication
/290632086_Evaluation_and_compariso
n_of_audio_chroma_feature_extraction_
methods. [Accessed: 22- Jan- 2019].
[19]Iem.kug.ac.at, 2019. [Online].
Available:
https://iem.kug.ac.at/fileadmin/media/ie
m/projects/2010/smc10_schoerkhuber.pd
f. [Accessed: 22- Jan- 2019].
[20]Doc.ml.tu-berlin.de, 2019. [Online].
Available:
http://doc.ml.tu-berlin.de/bbci/material/
publications/Bla_constQ.pdf. [Accessed:
22- Jan- 2019].