Conference PaperPDF Available

Chroma Feature Extraction

Authors:

Abstract and Figures

The chroma feature is a descriptor, which represents the tonal content of a musical audio signal in a condensed form. Therefore chroma features can be considered as important prerequisite for high-level semantic analysis, like chord recognition or harmonic similarity estimation. A better quality of the extracted chroma feature enables much better results in these high-level tasks. Short Time Fourier Transforms and Constant Q Transforms are used for chroma feature extraction.
Content may be subject to copyright.
Chroma Feature Extraction
A. K. Shah, M. Kattel, A. Nepal, D. Shrestha
Department of Computer Science and Engineering, School of Engineering
Kathmandu University, Nepal
ayush.kumar.shah@gmail.com, manasikattel1@gmail.com, araju7nepal@gmail.com,
deepeshshrestha@outlook.com
Abstract
The chroma feature is a descriptor,
which represents the tonal content of a
musical audio signal in a condensed
form. Therefore chroma features can be
considered as an important prerequisite
for high-level semantic analysis, like
chord recognition or harmonic similarity
estimation. A better quality of the
extracted chroma feature enables much
better results in these high-level tasks.
Short-Time Fourier Transforms and
Constant Q Transforms are used for
chroma feature extraction.
Keywords: Fourier transform,
spectrogram, chroma representation,
chroma vector
1. Introduction
Over the past few years the need
of music information retrieval and
classification systems has become more
urgent and this brought to the birth of a
research area called Music Information
Retrieval (MIR). It is an important task
in the analysis of music and music
transcription in general, and it can
contribute to applications such as key
detection, structural segmentation, music
similarity measures, and other semantic
analysis tasks.
Pitch
Pitch is a perceptual property of
sounds that allows their ordering on a
frequency-related scale, or more
commonly, pitch is the quality that
makes it possible to judge sounds as
"higher" and "lower" in the sense
associated with musical melodies. Pitch
can be determined only in sounds that
have a frequency that is clear and stable
enough to distinguish from noise. Pitch
is a major auditory attribute of musical
tones, along with duration, loudness, and
timbre [1].
Chroma
Chroma feature, a quality of a
pitch class which refers to the "color" of
a musical pitch, which can be
decomposed in into an octave-invariant
value called "chroma" and a "pitch
height" that indicates the octave the
pitch is in [2].
Chroma Vector
A chroma vector is a typically a
12-element feature vector indicating how
much energy of each pitch class, {C, C#,
D, D#, E, ..., B}, is present in the signal.
The Chroma vector is a perceptually
motivated feature vector. It uses the
concept of chroma in the cyclic helix
representation of musical pitch
perception. The Chroma vector thus
represents magnitudes in twelve pitch
classes in a standard chromatic scale [3].
Chroma features
In music, the term chroma
feature or chromagram closely relates to
the twelve different pitch classes.
Chroma-based features, which are also
referred to as "pitch class profiles", are a
powerful tool for analyzing music whose
pitches can be meaningfully categorized
(often into twelve categories) and whose
tuning approximates to the
equal-tempered scale. One main
property of chroma features is that they
capture harmonic and melodic
characteristics of music, while being
robust to changes in timbre and
instrumentation.
Chroma features aim at
representing the harmonic content
(eg:keys,chords) of a short-time window
of audio. The feature vector is extracted
from the magnitude spectrum by using a
short time fourier transform(STFT),
Constant-Q transform(CQT), Chroma
Energy Normalized (CENS), etc[5].
Harmonic Pitch Class Profile (HPCP)
Harmonic pitch class profiles
(HPCP) is a group of features that a
computer program extracts from an
audio signal, based on a pitch class
profile—a descriptor proposed in the
context of a chord recognition system.
HPCP are an enhanced pitch distribution
feature that are sequences of feature
vectors that, to a certain extent, describe
tonality, measuring the relative intensity
of each of the 12 pitch classes of the
equal-tempered scale within an analysis
frame. Often, the twelve pitch spelling
attributes are also referred to as chroma
and the HPCP features are closely
related to chroma features or
chromagrams [4].
2. Background
Features of audio
Frequency
Amplitude
Frequency is the speed of the
vibration, and this determines the pitch
of the sound. It is only useful or
meaningful for musical sounds, where
there is a strongly regular waveform. It
is measured as the number of wave
cycles that occur in one second. The
unit of frequency measurement is Hertz.
Amplitude is the size of the
vibration, and this determines how loud
the sound is. We have already seen that
larger vibrations make a louder sound.
Amplitude is important when balancing
and controlling the loudness of sounds,
such as with the volume control on your
CD player.
Fig 2.1: Features of audio
2.1 Chroma Features
The underlying observation is
that humans perceive two musical
pitches as similar in color if they differ
by an octave. Based on this observation,
a pitch can be separated into two
components, which are referred to as
tone height and chroma. Assuming the
equal-tempered scale, one considers
twelve chroma values represented by the
set
{C, C♯, D, D♯, E ,F, F♯, G, G♯, A, A♯,
B}
that consists of the twelve pitch spelling
attributes as used in Western music
notation. Note that in the equal-tempered
scale different pitch spellings such C♯
and Drefer to the same chroma.
Enumerating the chroma values, one can
identify the set of chroma values with
the set of integers {1,2,...,12}, where 1
refers to chroma C, 2 to C♯, and so on. A
pitch class is defined as the set of all
pitches that share the same chroma. For
example, using the scientific pitch
notation, the pitch class corresponding to
the chroma C is the set
{..., C−2, C−1, C0, C1, C2, C3 ...}
consisting of all pitches separated by an
integer number of octaves. Given a
music representation (e.g. a musical
score or an audio recording), the main
idea of chroma features is to aggregate
for a given local time window (e.g.
specified in beats or in seconds) all
information that relates to a given
chroma into a single coefficient. Shifting
the time window across the music
representation results in a sequence of
chroma features each expressing how the
representation's pitch content within the
time window is spread over the twelve
chroma bands. The resulting
time-chroma representation is also
referred to as chromagram. The figure
below shows chromagrams for a
C-major scale, once obtained from a
musical score and once from an audio
recording [5].
Fig 2.2 : (a) Musical score of a C-major scale.
(b) Chromagram obtained from the score. (c)
Audio recording of the C-major scale played on
a piano. (d) Chromagram obtained from the
audio recording.
2.1.1 Types of Chroma Features [7]
2.1.1.1 CP Feature
From the Pitch representation,
one can obtain a chroma representation
by simply adding up the corresponding
values that belong to the same chroma.
To archive invariance in dynamics, we
normalize each chroma vector with
respect to the Euclidean norm. The
resulting features are referred to as
Chroma-Pitch denoted by CP.
Fig 2.3 : CP Feature
2.1.1.2 CLP Features
To account for the logarithmic
sensation of sound intensity, one often
applies a logarithmic compression when
computing audio features [11]. To this
end, the local energy values e of the
pitch representation are logarithmized
before deriving the chroma
representation.
Here, each entry e is replaced by the
value
log(η · e+1), where η is a suitable
positive constant. The resulting features,
which depend on the compression
parameter η, are referred to as
Chroma-Log-Pitch denoted by CLP[η].
Fig 2.4 : CLP Features
2.1.1.3. CENS Features
Adding a further degree of
abstraction by considering short-time
statistics over energy distributions within
the chroma bands, one obtains CENS
(Chroma Energy Normalized Statistics)
features, which constitute a family of
scalable and robust audio features. These
features have turned out to be very
useful in audio matching and retrieval
applications. In computing CENS
features, a quantization is applied based
on logarithmically chosen thresholds.
This introduces some kind of
logarithmic compression similar to the
CLP[η] features. Furthermore, these
features allow for introducing a temporal
smoothing. Here, feature vectors are
averaged using a sliding window
technique depending on a window size
denoted by w (given in frames) and a
downsampling factor denoted by d. In
the following, we do not change the
feature rate and consider only the case d
= 1 (no downsampling). Therefore, the
resulting feature only depends on the
parameter w and is denoted by
CENS[w].
Fig 2.5 : CENS Features
2.1.1.4 CRP Features
To boost the degree of timbre
invariance, a novel family of
chroma-based audio features has been
introduced. The general idea is to
discard timbre-related information in a
similar fashion as pitch-related
information is discarded in the
computation of mel-frequency cepstral
coefficients (MFCCs). Starting with the
Pitch features, one first applies a
logarithmic compression and transforms
the logarithmized pitch representation
using a DCT. Then, one only keeps the
upper coefficients of the resulting
pitch-frequency cepstral coefficients
(PFCCs), applies an inverse DCT, and
finally projects the resulting pitch
vectors onto 12-dimensional chroma
vectors.
These vectors are referred to as CRP
(Chroma DCT Reduced log Pitch)
features. The upper coefficients to be
kept are specified by a parameter p [1
: 120].
Fig 2.6 :CRP Features
3. Methodology
General chroma feature extraction
procedure
The block diagram of the
procedure is shown in Fig.
Fig 3.1: HPCP block diagram
The General HPCP (chroma) feature
extraction procedure is summarized as
follows:
1. Input musical signal.
2. Do spectral analysis to obtain the
frequency components of the music
signal.
3. Use Fourier transform to convert the
signal into a spectrogram. (The
Fourier transform is a type of
time-frequency analysis.)
4. Do frequency filtering. A frequency
range of between 100 and 5000 Hz is
used.
5. Do peak detection. Only the local
maximum values of the spectrum are
considered.
6. Do reference frequency computation
procedure. Estimate the deviation
with respect to 440 Hz.
7. Do Pitch class mapping with respect
to the estimated reference frequency.
This is a procedure for determining
the pitch class value from frequency
values. A weighting scheme with
cosine function is used. It considers
the presence of harmonic frequencies
(harmonic summation procedure),
taking account a total of 8 harmonics
for each frequency. To map the value
on a one-third of a semitone, the size
of the pitch class distribution vectors
must be equal to 36.
8. Normalize the feature frame by
frame dividing through the
maximum value to eliminate
dependency on global loudness. And
then we can get a result HPCP
sequence like Figure [4].
There are many ways for
converting an audio recording into a
chromagram. For example, the
conversion of an audio recording into a
chroma representation (or chromagram)
may be performed either by using
short-time Fourier transforms in
combination with binning strategies or
by employing suitable multirate filter
banks. Furthermore, the properties of
chroma features can be significantly
changed by introducing suitable pre- and
post-processing steps modifying
spectral, temporal, and dynamical
aspects. This leads to a large number of
chroma variants, which may show a
quite different behavior in the context of
a specific music analysis scenario [5].
3.1 Performing Fourier Transform
3.1.1 Short Time Fourier Transform
The Fourier transform maps a
time-dependent signal to a
frequency-dependent function which
reveals the spectrum of frequency
components that compose the original
signal. Loosely speaking, a signal and its
Fourier transform are two sides of the
same coin. On the one side, the signal
displays the time information and hides
the information about frequencies. On
the other side, the Fourier transform
reveals information about frequencies
and hides the time information [8].
To obtain back the hidden time
information, Dennis Gabor introduced in
the year 1946 the modified Fourier
transform, now known as short-time
Fourier transform or simply STFT. This
transform is a compromise between a
time- and a frequency-based
representation by determining the
sinusoidal frequency and phase content
of local sections of a signal as it changes
over time. In this way, the STFT does
not only tell which frequencies are
“contained” in the signal but also at
which points of times or, to be more
precise, in which time intervals these
frequencies appear.
The Short-Time Fourier
Transform (STFT) is a powerful
general-purpose tool for audio signal
processing. It defines a particularly
useful class of time-frequency
distributions which specify complex
amplitude versus time and frequency for
any signal. Fourier transform is a
well-known tool for analyzing the
frequency distribution of a signal. Let us
denote the uniformly sampled f(t) and
g(t) functions by f[n] and g[n]. Then the
discrete (D) STFT over a compactly
supported g window function can be
written as
Where
M is the window length of g and N is the
number of samples in f. This algorithm
can be interpreted as a successive
evaluation of Fourier transforms over
short segments of the whole signal.
Additionally, the frequencies can be
visually represented by displaying the
squared magnitude of the Fourier
coefficients at each section. This
diagram is called as the spectrogram of
the signal f [15].
Fig 3.2 : Short Time Fourier Transform [9]
3.1.2 Constant-Q Transform
Like the Fourier transform a
constant Q transform is a bank of filters,
but in contrast to the former it has
geometrically spaced center frequencies:
where b dictates the number of filters per
octave.
What makes the constant Q transform so
useful is that by an appropriate choice
for f0 (minimal center frequency) and b
the center frequencies directly
correspond to musical notes.
Another nice feature of the
constant Q transform is its increasing
time resolution towards higher
frequencies. This resembles the situation
in our auditory system. It is not only the
digital computer that needs more time to
perceive the frequency of a low tone but
also our auditory sense. This is related to
music usually being less agitated in the
lower registers.The constant Q-transform
can be viewed as a wavelet transform.
There are at least three reasons why the
CQT has not widely replaced the DFT in
audio signal processing. Firstly, it is
computationally more intensive than the
DFT. Secondly, the CQT lacks an
inverse transform that would allow
perfect reconstruction of the original
signal from its transform coefficients.
Thirdly, CQT produces a data structure
that is more difficult to work with than
the time-frequency matrix (spectrogram)
obtained by using short-time Fourier
transform in successive time frames. The
last problem is due to the fact that in
CQT, the time resolution varies for
different frequency bins, in effect
meaning that the ”sampling” of different
frequency bins is not synchronized [20].
3.2 Log-Frequency Spectrogram
We now derive some audio
features from the STFT by converting
the frequency axis (given in Hertz) into
an axis that corresponds to musical
pitches. In Western music, the
equal-tempered scale is most often used,
where the pitches of the scale correspond
to the keys of a piano keyboard. In this
scale, each octave (which is the distance
of two frequencies that differ a factor of
two) is split up into twelve
logarithmically spaced units. In MIDI
notation, one considers 128 pitches,
which are serially numbered starting
with 0 and ending with 127. The MIDI
pitch p = 69 corresponds to the pitch A4
(having a center frequency of 440 Hz),
which is often used as standard for
tuning musical instruments. In general,
the center frequency Fpitch(p) of a pitch p
[0 : 127] is given by the formula
Fpitch(p) = 2(p−69)/12 .440
The logarithmic perception of frequency
motivates the use of a time-frequency
representation with a logarithmic
frequency axis labeled by the pitches of
the equal-tempered scale. To derive such
a representation from a given
spectrogram representation, the basic
idea is to assign each spectral coefficient
X (m, k) to the pitch with center
frequency that is closest to the frequency
Fcoef(k). More precisely, we define for
each pitch p [0 : 127] the set
P(p) := {k [0 : K] : Fpitch(p 0.5)
Fcoef(k) < Fpitch(p + 0.5)}.
From this, we obtain a log-frequency
spectrogram YLF : Z × [0 : 127] R≥0
defined by
By this definition, the frequency axis is
partitioned logarithmically and labeled
linearly according to MIDI pitches [8].
3.3 Chroma Features Extraction
The human perception of pitch is
periodic in the sense that two pitches are
perceived as similar in “color” (playing a
similar harmonic role) if they differ by
one or several octaves (where, in our
scale, an octave is defined as the
distance of 12 pitches). For example, the
pitches p = 60 and p = 72 are one octave
apart, and the pitches p = 57 and p = 71
are two octaves apart. A pitch can be
separated into two components, which
are referred to as tone height and
chroma. The tone height refers to the
octave number and the chroma to the
respective pitch spelling attribute. In
Western music notation, the 12 pitch
attributes are given by the set
{C, C, D, . . . , B} [8].
Enumerating the chroma values,
we identify this set with [0 : 11] where c
= 0 refers to chroma C, c = 1 to C], and
so on. A pitch class is defined as the set
of all pitches that share the same
chroma. For example, the pitch class that
corresponds to the chroma c = 0 (C)
consists of the set
{0, 12, 24, 36, 48, 60, 72, 84, 96, 108,
120}
(which are the musical notes
{. . . , C0, C1, C2, C3 . . .}).
The main idea of chroma features
is to aggregate all spectral information
that relates to a given pitch class into a
single coefficient. Given a pitch-based
log-frequency spectrogram YLF :
Z × [0 : 127] R≥0, a chroma
representation or chromagram
Z × [0 : 11] → R≥0
can be derived by summing up all pitch
coefficients that belong to the same
chroma:
4. Applications
Identifying pitches that differ by
an octave, chroma features show a high
degree of robustness to variations in
timbre and closely correlate to the
musical aspect of harmony. This is the
reason why chroma features are a
well-established tool for processing and
analyzing music data. For example,
basically every chord recognition
procedure relies on some kind of chroma
representation. Also, chroma features
have become the de facto standard for
tasks such as music alignment and
synchronization as well as audio
structure analysis. Finally, chroma
features have turned out to be a powerful
mid-level feature representation in
content-based audio retrieval such as
cover song identification or audio
matching [5].
Example
Chord recognition:
We did a Machine Learning
project called Guitar chord recognition
that recognizes any guitar chord. For
chord recognition, chroma feature
extraction was used in the guitar chords
dataset. In this work, we investigated
Convolutional Neural Networks (CNNs)
for learning chroma features in the
context of chord recognition.
A large set of data was collected
and metadata Chords.csv was created
which contained information such as
class_id, classname, file_name of all the
audio guitar chords datasets.
Librosa library was used to
extract chromagram from the audio
datasets. A new dataset consisting of mel
spectrograms of the chroma features of
all the audio files in the guitar chord
datasets as input and corresponding
class_id as output was built. Since the
audio files in the dataset are of varying
duration (up to 4 s), th fixed the size of
the input taken was to 2 seconds (128
frames), i.e. X R128 X 87
The new dataset was then
shuffled and divided to training and test
datasets and was further preprocessed,
fed to a convolution neural network,
trained and performance metrics were
evaluated on the basis of classes of
chords predicted from the chroma
features of the test audio dataset.
Hence, chroma feature extraction
was useful to train the audio dataset for
guitar chord recognition since audio
dataset cannot be trained directly.
Fig 4.1: Mel Spectrogram Chroma features of
audio files am89.wav, em69.wav, g200.wav and
dm100.wav respectively from guitar chord
dataset
The above screenshots show the
results obtained from the audio data
available after extracting the chroma
features of the audio files using librosa.
5. Acknowledgement
This paper was written to fulfill
the course requirement of COMP-407
Digital Signal Processing(DSP) offered
by the Department of Computer Science
and Engineering (DOCSE).The authors,
would like to thank Mr. Satyendra Nath
Lohani, Assistant Professor of
Department of Computer Science and
Engineering for providing this wonderful
opportunity to explore and gain new
ideas and experience in a new topic.
6. Conclusion
This paper presents the details of
chroma feature extraction from any
audio files and the different types of
extraction methods of the chroma feature
are explained.The short term fourier
transform proved to be better than the
constant Q transform under chroma
feature extraction mainly because it does
not have inverse transform so that the
original audio form could be
regained.This being the main reason for
the use of short term fourier transform in
the chord recognition project.
Experimental results using STFT chroma
feature extraction is presented.
References:
[1] "Pitch (music)", En.wikipedia.org,
2019. [Online]. Available:
https://en.wikipedia.org/wiki/Pitch_(mus
ic). [Accessed: 23- Jan- 2019].
[2]"Chroma", En.wikipedia.org, 2019.
[Online]. Available:
https://en.wikipedia.org/wiki/Chroma.
[Accessed: 23- Jan- 2019].
[3]"chroma",
Musicinformationretrieval.com, 2019.
[Online]. Available:
https://musicinformationretrieval.com/ch
roma.html. [Accessed: 23- Jan- 2019].
[4]L. Revolvy, ""Harmonic pitch class
profiles" on Revolvy.com", Revolvy.com,
2019. [Online]. Available:
https://www.revolvy.com/page/Harmonic
-pitch-class-profilesAW. [Accessed: 23-
Jan- 2019].
[5]"Chroma feature", En.wikipedia.org,
2019. [Online]. Available:
https://en.wikipedia.org/wiki/Chroma_fe
ature. [Accessed: 18- Jan- 2019].
[6]Cs.uu.nl, 2019. [Online]. Available:
http://www.cs.uu.nl/docs/vakken/msmt/le
ctures/SMT_B_Lecture5_DSP_2017.pdf.
[Accessed: 19- Jan- 2019].
[7]Pdfs.semanticscholar.org, 2019.
[Online]. Available:
https://pdfs.semanticscholar.org/6432/19
014e8aa48dda060cecf4ff413dd3ee1e3a.
pdf. [Accessed: 23- Jan- 2019].
[8]Audiolabs-erlangen.de, 2019.
[Online]. Available:
https://www.audiolabs-erlangen.de/conte
nt/05-fau/professor/00-mueller/02-teachi
ng/2016s_apl/LabCourse_STFT.pdf.
[Accessed: 23- Jan- 2019].
[9]2019. [Online]. Available:
http://mac.citi.sinica.edu.tw/~yang/tigp/l
ecture02_stft_yhyang_mir_2018.pdf.
[Accessed: 18- Jan- 2019].
[10]Mirlab.org, 2019. [Online].
Available:
http://www.mirlab.org/conference_paper
s/international_conference/ICASSP%20
2014/papers/p5880-kovacs.pdf.
[Accessed: 19- Jan- 2019].
[11]Scholarship.claremont.edu, 2019.
[Online]. Available: 6
ttps://scholarship.claremont.edu/cgi/vie
wcontent.cgi?referer=https://www.googl
e.com.np/&httpsredir=1&article=1575
&context=cmc_theses. [Accessed: 18-
Jan- 2019].
[12]Atkison.cs.ua.edu, 2019. [Online].
Available:
http://atkison.cs.ua.edu/papers/FT_as_F
E.pdf. [Accessed: 17- Jan- 2019].
[13]"Advances in Music Information
Retrieval", Google Books, 2019.
[Online]. Available:
https://books.google.com.np/books?id=g
y5qCQAAQBAJ&pg=PA340&lpg=PA34
0&dq=Chroma+Feature+Extraction+us
ing+Short+Time+Fourier+Transform(S
TFT)&source=bl&ots=JSxWIB9vm9&si
g=ACfU3U1oZHnD8Ohu-YGLaTycJXZ
LB895DA&hl=ne&sa=X&ved=2ahUKE
wj0uY-B6YDgAhVZaCsKHXBYAhcQ6A
EwCXoECAEQAQ#v=onepage&q=Chr
oma%20Feature%20Extraction%20usin
g%20Short%20Time%20Fourier%20Tra
nsform(STFT)&f=false. [Accessed: 17-
Jan- 2019].
[14]2019. [Online]. Available:
https://www.researchgate.net/figure/Feat
ure-extraction-process-using-short-time-
Fourier-transform-STFT-Spectrograms-
of-A_fig1_236264209. [Accessed: 22-
Jan- 2019].
[15]"The Short-Time FourierTransform |
Spectral Audio Signal Processing",
Dsprelated.com, 2019. [Online].
Available:
https://www.dsprelated.com/freebooks/sa
sp/Short_Time_Fourier_Transform.html.
[Accessed: 22- Jan- 2019].
[16]Kingma, D., & Ba, J. (2019). Adam:
A Method for Stochastic Optimization.
Retrieved from
https://arxiv.org/abs/1412.6980v8
[17]Arxiv.org, 2019. [Online].
Available:
https://arxiv.org/pdf/1811.01222.pdf.
[Accessed: 22- Jan- 2019].
[18]2019. [Online]. Available:
https://www.researchgate.net/publication
/290632086_Evaluation_and_compariso
n_of_audio_chroma_feature_extraction_
methods. [Accessed: 22- Jan- 2019].
[19]Iem.kug.ac.at, 2019. [Online].
Available:
https://iem.kug.ac.at/fileadmin/media/ie
m/projects/2010/smc10_schoerkhuber.pd
f. [Accessed: 22- Jan- 2019].
[20]Doc.ml.tu-berlin.de, 2019. [Online].
Available:
http://doc.ml.tu-berlin.de/bbci/material/
publications/Bla_constQ.pdf. [Accessed:
22- Jan- 2019].

Supplementary resource (1)

... This approach has been used by several studies (Amlathe 2018;Yazgaç et al. 2016;Phung et al. 2017;Noda et al. 2016;Zhang and Guo 2010;Silva et al. 2013;Kawakita and Ichikawa 2019) to extract sound signal features. b) Chroma feature A chroma feature is a type of audio feature whose tonal variation of an audio signal is in a condensed form (Kattel et al. 2019). Kattel et al. (2019) notes that the tonal variations are represented by 'color' which refers to the quality of a pitch class, namely: C, C#, D, D#, E, F, F#, G, G#, A, A# and B. A chroma vector is represented by 12 bins of pitches which indicate the amount of energy in each pitch class present in the signal. ...
... b) Chroma feature A chroma feature is a type of audio feature whose tonal variation of an audio signal is in a condensed form (Kattel et al. 2019). Kattel et al. (2019) notes that the tonal variations are represented by 'color' which refers to the quality of a pitch class, namely: C, C#, D, D#, E, F, F#, G, G#, A, A# and B. A chroma vector is represented by 12 bins of pitches which indicate the amount of energy in each pitch class present in the signal. Thus, the 12 bins represent the magnitude of the 12 features/tone in a standard chromatic scale. ...
Article
Full-text available
The application of machine learning has received increasing attention in the synthesis of insect sounds to preserve biodiversity. This study reviewed current literature on the application of these techniques in the automatic synthesis of insect bioacoustic and their applications in insects as food and feed, improving pest management, and as well as managing pollinators. To achieve this, the study used Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) methodology to identify, screen, and include the final articles used in this review, based on criteria such as papers addressing machine learning in insect acoustics, biodiversity, ecology conservation, etc. This study revealed that most of the researchers used secondary data and the microphone was the common tool used to record sound signals. Sound signals were mainly pre-processed using techniques such as denoising, segmentation, and windowing. Sound signal classification algorithms were categorized mainly as shallow and deep machine learning algorithms. In the shallow machine learning algorithms, the most common method of feature extraction was the Mel-Frequency Cepstral Coefficient (MFCC) and the Support Vector Machine (SVM) was the most commonly used algorithm. In deep learning, spectrogram image features were widely extracted and the Convolutional Neural Network (CNN) was mostly used to synthesize the spectral features. This paper also reviewed recent developments in insect bioacoustics signals processing, applications, and future directions. Generally, machine learning algorithms can be applied and deployed successfully to different insects’ automatic synthesis problems to improve the production of insects (as food and/or feed), and improve/preserve diversity and life on Earth.
... Chroma features capture the essence of the twelve different pitch classes in Western music. This capability makes them invaluable for tasks that require the identification of harmonic and melodic elements in audio signals [17]. For instance, a study by Ghosal, Dutta, and Banerjee in 2018 utilized chroma-based features to effectively differentiate string instruments by analyzing the strength of notes in the Western 12-note scale extracted from audio signals. ...
... In musical instrument classification, chroma features help in distinguishing instruments based on their harmonic contributions, even when they produce similar pitches. For example, even if an instrument like a marimba and a string instrument like a cello or violin play the exact same note, their chroma features would differ due to the unique harmonic overtones each instrument produces, which are captured in the chroma energy representation [17]. Consult Figures 4,5 below to see the visual representations of chroma energy and how they differ. ...
Preprint
Full-text available
Musical instrument classification, a key area in Music Information Retrieval, has gained considerable interest due to its applications in education, digital music production, and consumer media. Recent advances in machine learning, specifically deep learning, have enhanced the capability to identify and classify musical instruments from audio signals. This study applies various machine learning methods, including Naive Bayes, Support Vector Machines, Random Forests, Boosting techniques like AdaBoost and XGBoost, as well as deep learning models such as Convolutional Neural Networks and Artificial Neural Networks. The effectiveness of these methods is evaluated on the NSynth dataset, a large repository of annotated musical sounds. By comparing these approaches, the analysis aims to showcase the advantages and limitations of each method, providing guidance for developing more accurate and efficient classification systems. Additionally, hybrid model testing and discussion are included. This research aims to support further studies in instrument classification by proposing new approaches and future research directions.
... The neural network architecture is inspired by the 1-dimensional Convolutional Neural Network (1D-CNN) [59] with different audio features extracted from audio clips added to enhance classification [60]. These features include mean Mel Frequency Cepstral Coefficients (MFCC) [61], Mean Chromagram [62], Mean Mel Spectrogram [63], Mean Spectral Contrast [64] and Mean Tonal Centroid [65]. The features are concatenated into a one-dimensional vector by taking a mean along the time axis for each of the 10-second segments. ...
... We also report on two other models for which we take the final prediction as ASD if either audio or video model predicts a subject as ASD (OR Model) and we take the final prediction as ASD only if both models predict a subject as ASD (AND model). segments, we extracted several acoustic features [61][62][63][64][65] that were then passed through a convolutional neural network (see Fig 2 and Methods). Following standard model hyperparameter tuning procedure, we deployed the 80-20 training validation split. ...
Article
Full-text available
A timely diagnosis of autism is paramount to allow early therapeutic intervention in preschoolers. Deep Learning tools have been increasingly used to identify specific autistic symptoms. But they also offer opportunities for broad automated detection of autism at an early age. Here, we leverage a multi-modal approach by combining two neural networks trained on video and audio features of semi-standardized social interactions in a sample of 160 children aged 1 to 5 years old. Our ensemble model performs with an accuracy of 82.5% (F1 score: 0.816, Precision: 0.775, Recall: 0.861) for screening Autism Spectrum Disorders (ASD). Additional combinations of our model were developed to achieve higher specificity (92.5%, i.e., few false negatives) or sensitivity (90%, i.e. few false positives). Finally, we found a relationship between the neural network modalities and specific audio versus video ASD characteristics, bringing evidence that our neural network implementation was effective in taking into account different features that are currently standardized under the gold standard ASD assessment.
... Consisting of tone height and chroma components, the chroma feature provides octave-invariant values, referred to as "chroma," representing pitch class attributes. In the equal-tempered scale, chroma values are associated with pitch classes {C, C#, D, D# ..., B}, forming the basis for pitch-related analyses [9]. ...
Conference Paper
Full-text available
Swarming is a natural process that leads to reduced honey production and poses a challenge for beekeepers. Precision beekeeping provides swarm notifications to help prevent this phenomenon. The paper investigates the utilization of machine learning for the early detection of bee swarming behavior through the analysis of audio data. It employs three feature extraction methods-Mel Frequency Cepstral Coefficients (MFCC), Short-Time Fourier Transform (STFT), and Chroma-to capture important characteristics of bee sounds. The effectiveness of five machine learning models (K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Naive Bayes (NB), Random Forest (RF), and Gradient Boosting (GB)) is evaluated in distinguishing between swarming and non-swarming states using two real-world datasets collected in Vietnam. To ensure generalizability, the models are assessed on a separate validation set that is not used during training. The experiment results reveal the significant potential of employing machine learning methods for the detection of bee swarming.
... It is a quality of pitch class that refers to a colour of a musical pitch whose features aim at representing the harmonic content of a short-time window of audio [25] Energy ...
... CQT is considered more suitable for musical signals compared to the Fourier transform, although CQT computation is more demanding. Another variant is known as Chroma Energy Normalized Statistics (CENS), which is a normalized Chroma CQT, and is widely used for audio matching applications [25]. ...
Article
Full-text available
Fairphonic Pte Ltd is a technology company operating in the music sector. Fairphonic provides services to detect content on social media that is suspected of copyright infringement. Fairphonic utilizes audio features for the detection process. The current algorithm used by Fairphonic requires pairwise comparison, and the content to be compared is collected through a scraping process immediately after the process is run. Fairphonic has hundreds of thousands of music data in their database. Fairphonic desires a more scalable algorithm to compare an input music piece with the entire Fairphonic music catalog. This research uses features such as Harmonic Pitch Class Profile (HPCP), Chroma, and Rhythm Pattern. The study compares previously researched algorithms, namely binary similarity matrix, Euclidean distance, and similarity matrix profile. The results show that the combination of HPCP with the binary similarity matrix yields the highest Mean Average Precision of 0.989. Speed testing by performing comparisons 10 times shows that the combination of Chroma and the similarity matrix profile is 72% faster compared to the combination of HPCP with the binary similarity matrix. The author recommends the Chroma and similarity matrix profile algorithm for music similarity ranking due to its faster process.
... Chroma audio feature extraction focuses on capturing the pitch content or tonal information of an audio signal [35]. Many researchers utilize chroma features as the input for deep learning models to address audio detection and recognition problems [36][37][38]. ...
Article
Full-text available
Since pig vocalization is an important indicator of monitoring pig conditions, pig vocalization detection and recognition using deep learning play a crucial role in the management and welfare of modern pig livestock farming. However, collecting pig sound data for deep learning model training takes time and effort. Acknowledging the challenges of collecting pig sound data for model training, this study introduces a deep convolutional neural network (DCNN) architecture for pig vocalization and non-vocalization classification with a real pig farm dataset. Various audio feature extraction methods were evaluated individually to compare the performance differences, including Mel-frequency cepstral coefficients (MFCC), Mel-spectrogram, Chroma, and Tonnetz. This study proposes a novel feature extraction method called Mixed-MMCT to improve the classification accuracy by integrating MFCC, Mel-spectrogram, Chroma, and Tonnetz features. These feature extraction methods were applied to extract relevant features from the pig sound dataset for input into a deep learning network. For the experiment, three datasets were collected from three actual pig farms: Nias, Gimje, and Jeongeup. Each dataset consists of 4000 WAV files (2000 pig vocalization and 2000 pig non-vocalization) with a duration of three seconds. Various audio data augmentation techniques are utilized in the training set to improve the model performance and generalization, including pitch-shifting, time-shifting, time-stretching, and background-noising. In this study, the performance of the predictive deep learning model was assessed using the k-fold cross-validation (k = 5) technique on each dataset. By conducting rigorous experiments, Mixed-MMCT showed superior accuracy on Nias, Gimje, and Jeongeup, with rates of 99.50%, 99.56%, and 99.67%, respectively. Robustness experiments were performed to prove the effectiveness of the model by using two farm datasets as a training set and a farm as a testing set. The average performance of the Mixed-MMCT in terms of accuracy, precision, recall, and F1-score reached rates of 95.67%, 96.25%, 95.68%, and 95.96%, respectively. All results demonstrate that the proposed Mixed-MMCT feature extraction method outperforms other methods regarding pig vocalization and non-vocalization classification in real pig livestock farming.
Article
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.
Harmonic pitch class profiles" on Revolvy
  • L Revolvy
L. Revolvy, ""Harmonic pitch class profiles" on Revolvy.com", Revolvy.com, 2019. [Online]. Available: https://www.revolvy.com/page/Harmonic -pitch-class-profilesAW. [Accessed: 23-Jan-2019].