Conference PaperPDF Available

Fundamental Frequency Contour Classification: A Comparison between Hand-Crafted and CNN-Based Features

Authors:

Abstract and Figures

In this paper, we evaluate hand-crafted features as well as features learned from data using a convolutional neural network (CNN) for different fundamental frequency classification tasks. We compare classification based on full (variable-length) contours and classification based on fixed-sized subcontours in combination with a fusion strategy. Our results show that hand-crafted and learned features lead to comparable results for both classification scenarios. Aggregating contour-level to file-level classification results generally improves the results. In comparison to the hand-crafted features, our examination indicates that the CNN-based features show a higher degree of redundancy across feature dimensions, where multiple filters (convolution kernels) specialize on similar contour shapes.
Content may be subject to copyright.
FUNDAMENTAL FREQUENCY CONTOUR CLASSIFICATION: A COMPARISON BETWEEN
HAND-CRAFTED AND CNN-BASED FEATURES
Jakob Abeßer1Meinard M ¨
uller2
1Semantic Music Technologies Group, Fraunhofer IDMT, Ilmenau, Germany
2International Audio Laboratories Erlangen, Germany
ABSTRACT
In this paper, we evaluate hand-crafted features as well as features
learnt from data using a convolutional neural network (CNN) for
different fundamental frequency classification tasks. We compare
classification based on full (variable-length) contours and classifica-
tion based on fixed-sized subcontours in combination with a fusion
strategy. Our results show that hand-crafted and learnt features lead
to comparable results for both classification scenarios. Aggregating
contour-level to file-level classification results generally improves
the results. In comparison to the hand-crafted features, our examina-
tion indicates that the CNN-based features show a higher degree of
redundancy across feature dimensions, where multiple filters (con-
volution kernels) specialize on similar contour shapes.
Index Termsfundamental frequency contours, feature learn-
ing, convolutional neural networks, activation maximization
1. INTRODUCTION
In the last years, data-driven algorithms for feature learning based
on deep neural networks often outperformed traditional analysis
methods that exploit domain expert knowledge. Compared to hand-
crafted feature design, data-driven approaches often show superior
performance within analysis and classification scenarios. However,
as a main disadvantage, learnt feature representations often lack a
clear interpretation and give only little insight into the problem at
hand.
In the field of Music Information Retrieval (MIR), fundamental
frequency (f0) contours, i. e., variable-length time-series representa-
tions of the pitch curve of musical notes, are a rich mid-level repre-
sentation as they provide cues for both music performance analysis
and music content analysis [1]. For example, frequency contours
have been successfully applied for MIR tasks such as playing and
singing style analysis, as well as genre and music instrument classi-
fication. In general, a reliable extraction of f0contours from poly-
phonic audio mixtures remains challenging to this day. One open
issue is how to best map variable-length f0contours to fixed-size
feature representations for music classification applications.
As the main contribution of this paper, we systematically eval-
uate different contour feature representations for a wide range of
MIR classification tasks. In particular, we compare hand-crafted
features (knowledge-driven approach) with features learnt from data
(data-driven approach). To capture dependencies over time, vari-
ous sequence modeling techniques such as recurrent neural networks
(RNN) or auto-regressive models exist. In this paper, we will focus
on CNN-based methods for two reasons: first, shift-invariance with
respect to time is a useful property in our context and, second, convo-
lution kernels allow for a better interpretability (in terms of filters).
As a further contribution, we discuss a fusion approach based on
fixed-size segments (subcontours), which lead to better classification
results than approaches based on variable-length contours.
2. RELATED WORK
One prominent application scenario for frequency contour analysis
in MIR is to classify instrument playing techniques as part of auto-
matic transcription algorithms. For example, Barbancho et al. [7],
Abeßer et al. [8], and Kehling et al. [3] showed for isolated vi-
olin, bass guitar, and electric guitar recordings, respectively, that
typical frequency modulation techniques such as vibrato, bending,
and slides can be classified with high accuracy above 90 % on a
note-level. As for ensemble recordings, the classification problem
becomes much harder. For example, Abeßer et al. reported in [9] ac-
curacy values between 48% (fall-off) to 92 % (vibrato) for common
modulation techniques in trumpet and saxophone jazz solos. The au-
thors proposed a set of contour features that measures modulation,
fluctuation, and the average gradient of f0contours (see PYMUS fea-
ture set, Section 4.1). In a follow-up publication, these features were
used to investigate how the pitch modulation range and the intona-
tion depend on the musical context (within a solo) and on the artist
[10].
In [11], Salamon et al. used the Melodia melody detection al-
gorithm [12] to extract f0contours from polyphonic music record-
ings. Based on these f0contours, the authors described a set of
low-dimensional features including contour duration, pitch range,
as well as vibrato rate, extent, and coverage. These contour fea-
tures outperformed low-level timbre features for genre classifica-
tion. Pantelli and Bittner proposed a set of contour features (see
BIT TELI feature set, Section 4.1) for singing style analysis [13].
First, f0contours were classified according to vocal/non-vocal cat-
egories. Then, a dictionary-learning approach based on spherical k-
means clustering was used to derive fixed-size activation histogram
for vocal contours. Finally, these histograms were used as features
to analyze different singing styles. Using the same feature set, Bit-
tner et al. reported in [1] accuracy values around 0.72 for related
tasks like vocal/non-vocal, bass/non-bass, melody/non-melody, and
singer’s gender (male, female) classification.
3. DATASETS & CLASSIFICATION SCENARIOS
In this paper, we use four datasets, which cover various music analy-
sis tasks and different levels of timbre complexities. Table 1 provides
a general overview over all datasets. We apply a post-processing (re-
sampling) to have the same time resolution of 5.8 ms for all contours
across all four datasets. As will be detailed in Section 4.2, we only
consider contours from 197.2 ms to 1995.2 ms duration.
The GENRE database was used in [11] and contains 12531 con-
tours from 500 30-second excerpts equally distributed along the five
music genres opera, pop, flamenco, vocal jazz, and instrumental
Table 1: Dataset overview. Bold prints in column “Classes” indicate corresponding class abbreviations used in Figure 1. Final three columns
indicate number of classes, contours (C), subcontours (SC), and files in each dataset.
Label Task Classes Dataset Complexity Contour Estimator Number of
Classes C (SC) Files
GENRE Music Genre flamenco,instrumental jazz,opera,
pop,vocal jazz,
In-house dataset [2] multitimbral Melodia 5 12531
(487386)
499
GUITAR Playing Style
(guitar)
bending (BE), normal (NO), slide
(SL), vibrato (VI)
IDMT-SMT-GUITAR [3] monotimbral Score + pYin 4 2240
(67728)
191
INST Instrument clarinet, flute, saxophone, singing
voice (female), singing voice
(male), trumpet, violin
IDMT-MONOTIMBRAL
[4]
multitimbral Melody Transcription
+ Peak Tracking
8 10214
(179743)
180
WJD Playing Style
(saxophone,
trumpet etc.)
(pitch) bend, fall, slide, vibrato Weimar Jazz Database
(WJD) [5]
monotimbral
after source
sep.
Score-Informed Sep-
aration + pYin [6]
4 4964
(126808)
360
jazz. The contours were extracted using the Melodia algorithm [12]
covering a frequency range of five octaves between 55 Hz and 1760
Hz.
The GU ITAR dataset includes 2240 tones extracted from
monotimbral electric guitar recordings in the IDMT-SMT-GUITAR
dataset [3]. The notes are annotated with five the playing style
classes bending, normal (stable pitch), slide, and vibrato. Again, the
pYin pitch tracker was used for note-wise contour extraction.
The IN ST dataset includes 10214 contours extracted from the
IDMT-MONOTIMBRAL dataset [4]. Here, only the monophonic
instrument classes violin, flute, trumpet, saxophone, clarinet, as well
as female and male singing voice were considered. Contours were
extracted by first running the automatic melody transcription algo-
rithm by Dressler [14] followed a partial tracking based on linear
interpolation as part of the solo/accompaniment source separation
[15].
The WJD dataset includes a subset of 4964 tones taken from the
Weimar Jazz Database (WJD) [5], which are annotated with one of
the four playing style classes drop-off, slide, pitch-bend, and vibrato.
The WJD includes jazz ensemble recordings with predominant solo
instruments such as trumpet, tenor, alto, soprano saxophone, and
trombone. Using the same procedure as described in [5], we first ap-
plied score-informed solo/accompaniment source separation [15] to
extract the solo instrument, and then applied the pYin pitch tracking
algorithm [6] to extract frequency contours for all notes.
Figure 1 shows seven randomly chosen example contours for
each dataset and each class. For instance, characteristic con-
tour shapes such as the periodic frequency modulation of vibrato
tones can be recognized for the playing style classification tasks
(GU ITAR and W JD). For high-level classification tasks such as
genre classification (GE NRE) and instrument classification (I NST),
the classes tend to be less homogeneous in the sense that there are
often several different contour shapes associated to a single class.
Furthermore, some contours cover multiple playing techniques such
as an initial pitch slide followed by a vibrato (see WJD examples).
4. FEATURE REPRESENTATIONS
4.1. Hand-Crafted Audio Features
In our experiments, we use two hand-crafted features sets. The first
one is called BI TTE LI1and was introduced in [13]. From this feature
set, we use 18 features including 6 features that capture the shape and
coverage of vibrato, 8 features derived from a polynomial approxi-
mation of frequency contours, as well as 4 features derived from
global statistics.
1https://github.com/rabitt/motif
(a) GENRE (b) GUITAR
(c) INST (d) WJD
Fig. 1: Randomly selected contours from each class of the four
datasets introduced in Table 1.
Fig. 2: Neural network architecture used for automatic contour fea-
ture learning (see Section 4.2).
The second hand-crafted feature set is the PYMUS2set, which
consists of 17 features including 3 features describing vibrato char-
acteristics, 10 features measuring different contour shape properties
related to fluctuation and gradient, as well as 4 features derived from
a temporal contour segmentation. The two features sets contain dif-
ferent types of features and overlap only with regard to vibrato rate
descriptors.
4.2. Feature Learning
Next, we introduce some feature sets that are automatically derived
using a data-driven approach based on CNNs. In particular, as
2https://github.com/jazzomat/python/tree/master/
pymus
BITTELI
PYMUS
CNN-1
CNN-2
H
W
Contour
Subcontours
Feature'
Extractors
Feature'
Vectors
Class'Probabilities
Input
BITTELI
PYMUS
Aggregation
bend
vibrat o
fall
slide
Prediction
bend
vibrato
fall
slide
bend
vibrato
fall
slide
Fig. 3: Summary of the contour classification strategies. Variable-length contours can be processed solely by hand-crafted features (BITTE LI,
PYMUS). Fixed-size subcontours can be processed by all feature extractors.
Table 2: Features included in the hand-crafted feature sets BITTE LI
and PYMUS sorted by three feature categories (italic print). Salience-
based features are discarded.
BIT TELI PYMUS
Vibrato features
Rate (1) Modulation frequency (1)
Extent (1) Modulation dominance (1)
Coverage features (4) Number of modulation peri-
ods (1)
Contour shape features
Polynomial-fit on frequency
contour (8)
Measures for intonation &
fluctuation (6)
Polynomial-fit on salience
contour (7)
Frequency gradient descrip-
tors (4)
Contour segmentation fea-
tures (4)
Global statistics
Duration (1)
Pitch (3)
Salience (3)
18 (total) 17 (total)
shown in Figure 2, we compare two neural network architectures
with one (CN N-1) or two processing blocks (CNN-2) followed by
two fully-connected (FC) layers. Each processing block includes
a one-dimensional convolutional layer (CONV) followed by batch-
normalization (BN) [16] and a rectified linear unit (ReLU), and a
dropout (DO) [17] layer. In each convolution layer, we empirically
selected 30 filters, each filter having a size of 10 (corresponding to
a duration of 58 ms). We used the Adam optimizer, a learning rate
of 103, and a batch-size of 256. Optimizing the hyperparameters
is not within the scope of this paper.
Each contour is assigned to one class (compare Table 1). These
classes are used as target for the final softmax layer to train the mod-
els in a supervised fashion. After training, the activations of the
penultimate fully-connected layer are used as features. For training
the networks, we extract fixed-size subcontours from the variable
length contours as input data for the CNN model using a window
size Wand a hopsize of 50 %. The class label of each contour is
transferred to its subcontours. Since a common range for the vibrato
rate is between 5 and 12 Hz [5], we use W= 34 (corresponding to
197.2 ms) as window size to capture at least one full vibrato period.
In our classification experiment described in Section 5, we include
all original contours that have a minimum length of 34 frames (197.2
ms). As shown in Figure 3, the predicted class probabilities on a
subcontour level are aggregated by averaging to obtain contour-level
predictions.
5. EVALUATION
The four feature sets have comparable numbers of feature dimen-
sions: PYMUS (17), BITTE LI (18), CN N-1 (17), and CNN-2 (17).
We perform a three-fold cross validation. In each fold, we repeat the
following procedure for each of the classification tasks: The current
dataset is randomly split into training and test set (80 % : 20 %) based
on unique file assignments to avoid that f0contours from the same
recording end up in both sets. Using the subcontours extracted from
the training set as input and their class labels as targets, we train the
CNN-1 and C NN- 2 models. Afterwards, we extract feature vectors
from the training set using all four feature extractors, normalize them
to zero mean and unit variance, and train four independent random
forest classifier models [18] with 50 trees each. Finally, we evalu-
ate the classification performance on the test set by applying feature
scaling using mean and standard variance values derived from the
training set and computing the F1 score of the model predictions.
The random forest classifier was chosen as it easily allows us to fur-
ther analyze the importance of different feature dimensions for the
trained models (compare Section 6.2). As illustrated in Figure 3, we
compare both subcontour-level and contour-level classification for
the two hand-crafted feature sets BI TTELI and PY MUS. In addi-
tion, for the GENRE and INST dataset, we investigate a file-level
aggregation strategy by averaging over all contour-level class prob-
abilities.
Table 3 shows the mean F1 scores obtained from the cross-
validation folds for all combinations of classification-level, dataset,
and aggregation strategy. As general findings, we observe very sim-
ilar scores for both hand-crafted features and learnt features as well
as for subcontour and contour classification. Throughout all fea-
Table 3: Mean F1 scores from 3-fold cross-validation. Results are
shown for different feature extractors, datasets, classification lev-
els (C = contour, SC = subcontour), and result aggregation level
(contour-level, file-level). Best results for each dataset are high-
lighted in bold print.
Aggr. Contour-Level File-Level
Extractor Dataset GENRE GUITAR INST WJD GENRE INST
BIT TELI C 0.51 0.97 0.38 0.87 0.73 0.56
SC 0.54 0.96 0.43 0.82 0.76 0.54
PYM US C 0.53 0.98 0.35 0.87 0.79 0.52
SC 0.55 0.97 0.31 0.83 0.85 0.45
CNN -1 SC 0.54 0.95 0.34 0.83 0.85 0.49
CNN-2 SC 0.63 0.96 0.43 0.84 0.94 0.67
ture extractors, aggregating the contour-level classification results in
file-level results clearly boosts the F1 scores by up to 0.24.
The highest scores are achieved for the two playing-style clas-
sification datasets GU ITAR and W JD. While all four models per-
form comparably well on the easier-to-classify GUITAR dataset,
the hand-crafted features perform better on the WJ D dataset. For the
more difficult classification tasks based on the INST and GE NRE
datasets, the two-level CNN model (CNN-2) clearly outperforms
its simpler counterpart (CN N-1) and the two hand-crafted features.
Presumably, the CNN -2 model can learn to recognize more complex
contour shapes.
6. CNN MODEL INSPECTION
6.1. Prototype Contours
In the following, we aim to get a better insight into which contour
shapes the CNN-based features capture. As an example, we inves-
tigate the two-layer CNN-2 model. We use the activation maxi-
mization algorithm [19]3to generate frequency contours that max-
imize the activations of each of the 17 neurons in the penultimate
dense layer, i. e., the individual feature dimension values. Figure 4
shows such contours for each of the four datasets. Despite the differ-
ent underlying classification tasks, the networks learn to recognize
similar contour shapes—increasing and decreasing frequency con-
tours (pitch slides), alternating sequences of increasing and decreas-
ing contour parts (pitch bends) as well as modulating (vibrato-like)
shapes.
6.2. Feature Set Redundancy
The contours shown in Figure 4 indicate a certain redundancy as sim-
ilar shapes can be found across neurons. In order to measure the in-
formation redundancy within different feature sets, we first compute
all pair-wise correlation coefficients between features of the same
set. Here we only focus on features extracted from subcontours. We
observe significantly higher mean correlation values of 0.451 for the
learnt feature sets than for the hand-crafted feature sets (0.189).
Additionally, we analyze the feature importances, which mea-
sure their effect in the Random Forest models. Low information re-
dundancy leads to only a few features having high importance values
whereas high redundancy would result in an almost equal distribu-
tion. We compute the entropy Hto measure the uniformity of the
distribution over the feature importance values. We observe higher
3We used the keras-vis implementation of the activation maximization
algorithm [20] with hand-tuned parameter values tv weight = 0.01 and
lp norm weight = 0.01.
(a) GENRE (b) GU ITAR (c) INST (d) WJD
Fig. 4: Normalized frequency contours that maximize neuron activa-
tions in penultimate dense layer, which are used as features vectors.
entropy values for the feature learning configurations of 0.960 than
for the configurations using hand-crafted features (0.925). Both re-
sults indicate that discriminative information is more concentrated
in the hand-crafted features, where only a subset of the features have
a high effect in the classifier models. In contrast, the learnt features
show a higher information redundancy across feature dimensions.
7. CONCLUSION
This paper compares hand-crafted features and automatically learnt
features for different f0contour classification tasks. Our findings
show that embedding features from a simple non-optimized neural
network architecture with two convolutional layers can outperform
hand-crafted features based on expert knowledge. Multiple convolu-
tional layers allow for learning more complex contour shapes, which
is beneficial especially for higher-level tasks such as genre and in-
strument recognition.
The evaluation results show that using fixed-length subcontours
in combination with an aggregation strategy leads to comparable
classification accuracies as compared to an approach using global
contour. As an advantage, classifying subcontours allows for a more
complex (time-dependent) description of playing techniques on a
note-level (e. g., initial pitch bend followed by a vibrato). A close
investigation of the CNN-based features revealed that the learnt fea-
ture sets have a higher redundancy across feature dimensions than
the hand-crafted features. Reducing the number of filters and the
amount of redundancy while maintaining the classification perfor-
mance could be a guideline for model optimization.
8. ACKNOWLEDGEMENTS
This work has been supported by the German Research Founda-
tion (AB 675/2-1, MU 2686/11-1). The International Audio Labo-
ratories Erlangen are a joint institution of the Friedrich-Alexander-
Universit¨
at Erlangen-N¨
urnberg (FAU) and Fraunhofer Institut f¨
ur In-
tegrierte Schaltungen IIS. The authors would like to thank Justin
Salamon, Rachel Bittner, and Maria Pantelli for valuable discussions
and for sharing their contour datasets.
9. REFERENCES
[1] Rachel M. Bittner, Justin Salamon, Juan J. Bosch, and Juan P.
Bello, “Pitch contours as a mid-level representation for music
informatics,” in AES International Conference on Semantic
Audio, Erlangen, Germany, 22/06/2017 2017.
[2] Justin Salamon, Geoffroy Peeters, and Axel R¨
obel, “Statistical
characterisation of melodic pitch contours and its application
for melody extraction, in Proceedings of the 13th Interna-
tional Society for Music Information Retrieval Conference (IS-
MIR), Porto, Portugal, 2012, pp. 187–192.
[3] Christian Kehling, Jakob Abeßer, Christian Dittmar, and Ger-
ald Schuller, “Automatic Tablature Transcription of Electric
Guitar Recordings by Estimation of Score- and Instrument-
related Parameters, in Proceedings of the International Con-
ference on Digital Audio Effects (DAFx), Erlangen, Germany,
September 2014.
[4] Juan G´
omez, Jakob Abeßer, and Estefan´
ıa Cano, “Jazz solo
instrument classification with convolutional neural networks,
source separation, and transfer learning,” in Proceedings of
the 19th International Society for Music Information Retrieval
Conference (ISMIR), Paris, France, 2018.
[5] Martin Pfleiderer, Klaus Frieler, Jakob Abeßer, Wolf-Georg
Zaddach, and Benjamin Burkhart, Eds., Inside the Jazzomat
- New Perspectives for Jazz Research, Schott Campus, 2017.
[6] Matthias Mauch and Simon Dixon, “pYIN: A fundamen-
tal frequency estimator using probabilistic threshold distribu-
tions,” in Proceedings of the IEEE International Conference
on Acoustics, Speech, and Signal Processing (ICASSP), 2014,
pp. 659–663.
[7] Isabel Barbancho, Christina de la Bandera, Ana M. Barbancho,
and Lorenzo J. Tardon, “Transcription and expressiveness de-
tection system for violin music,” in Proceedings of the IEEE
International Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), 2009, pp. 189–192.
[8] Jakob Abeßer, Hanna Lukashevich, and Gerald Schuller,
“Feature-based Extraction of Plucking and Expression Styles
of the Electric Bass Guitar, in Proceedings of the IEEE Inter-
national Conference on Acoustics, Speech and Signal Process-
ing (ICASSP), Dallas, USA, 2010, pp. 2290–2293.
[9] Jakob Abeßer, Estefan´
ıa Cano, Klaus Frieler, Martin Pfleiderer,
and Wolf-Georg Zaddach, “Score-informed analysis of intona-
tion and pitch modulation in jazz solos,” in Proceedings of
the 16th International Society for Music Information Retrieval
Conference (ISMIR), M´
alaga, Spain, 2015, pp. 823–829.
[10] Jakob Abeßer, Klaus Frieler, Estefan´
ıa Cano, Martin Pfleiderer,
and Wolf-Georg Zaddach, “Score-informed analysis of tun-
ing, intonation, pitch modulation, and dynamics in jazz so-
los,” IEEE/ACM Transactions on Audio, Speech, and Lan-
guage Processing, vol. 25, no. 1, pp. 168–177, Jan 2017.
[11] Justin Salamon, Bruno Rocha, and Emilia G ´
omez, “Musi-
cal genre classification using melody features extracted from
polyphonic music signals,” in IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), Kyoto,
Japan, 2012.
[12] Justin Salamon and Emilia G´
omez, “Melody extraction from
polyphonic music signals using pitch contour characteristics,”
IEEE Transactions on Audio, Speech, and Language Process-
ing, vol. 20, no. 6, pp. 1759–1770, 2012.
[13] Maria Panteli, Rachel M. Bittner, Juan Pablo Bello, and Simon
Dixon, “Towards the characterization of singing styles in world
music,” in Proceedings of the IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), New
Orleans, LA, USA, 2017, pp. 636–640.
[14] Karin Dressler, Automatic transcription of the melody from
polyphonic music, Ph.D. thesis, TU Ilmenau, Germany, Jul
2017.
[15] Estefan´
ıa Cano, Gerald Schuller, and Christian Dittmar,
“Pitch-informed solo and accompaniment separation: towards
its use in music education applications,” EURASIP Journal on
Advances in Signal Processing, pp. 1–19, 2014.
[16] Sergey Ioffe and Christian Szegedy, “Batch normalization: Ac-
celerating deep network training by reducing internal covariate
shift,” in Proceedings of the International Conference on Inter-
national Conference on Machine Learning (ICML), 2015, pp.
448–456.
[17] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya
Sutskever, and Ruslan Salakhutdinov, “Dropout: A simple way
to prevent neural networks from overfitting, Journal of Ma-
chine Learning Research, vol. 15, pp. 1929–1958, 2014.
[18] Leo Breiman, “Random forests,” Machine Learning, vol. 45,
no. 1, pp. 5–32, Oct 2001.
[19] Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal
Vincent, “Visualizing higher-layer features of a deep network,
Tech. Rep. 1341, University of Montreal, June 2009.
[20] Raghavendra Kotikalapudi and contributors, “keras-vis,”
https://github.com/raghakot/keras-vis, 2017.
... This development begs the question, "How does a performer implicitly embed musical intent, personal characteristics, and their surrounding environment directly into a performance?" Challenges raised by this question are especially relevant to vocal music as 1 recordings of both amateur and professional singers become available more rapidly than controlled, curated and annotated datasets. One answer to this question invokes data-driven designs that seek to capture complex or abstract musical meaning without the assistance of external annotation. ...
... The F 0 contour -the continuous trajectory of fundamental frequency in time -is a commonly used feature for numerous MIR problems including instrument recognition and classification in monophonic or polyphonic recordings [3,4], melody extraction in polyphonic music [32,33], genre and style labeling [1,3,25], classification of affect in the singing voice [34], and vocal synthesis [36]. The importance of these features in the success of these MIR tasks has led to significant research on contour classification [1,3,4,25,32,33] and intonation patterns in the singing voice [7,34,35]. ...
... The F 0 contour -the continuous trajectory of fundamental frequency in time -is a commonly used feature for numerous MIR problems including instrument recognition and classification in monophonic or polyphonic recordings [3,4], melody extraction in polyphonic music [32,33], genre and style labeling [1,3,25], classification of affect in the singing voice [34], and vocal synthesis [36]. The importance of these features in the success of these MIR tasks has led to significant research on contour classification [1,3,4,25,32,33] and intonation patterns in the singing voice [7,34,35]. The use of pitch contours as robust features has also prompted the creation of accurate short-time pitch analysis and extraction methods such as probabilistic-Yin [19] and CREPE [16]. ...
Preprint
Full-text available
In this paper we aim to learn meaningful representations of sung intonation patterns derived from surrounding data without supervision. We focus on two facets of context in which a vocal line is produced: 1) within the short-time context of contiguous vocalizations, and 2) within the larger context of a recording. We propose two unsupervised deep learning methods, pseudo-task learning and slot filling, to produce latent encodings of these con-textual representations. To evaluate the quality of these representations and their usefulness as meaningful feature space, we conduct classification tasks on recordings sung by both professional and amateur singers. Initial results indicate that the learned representations enhance the performance of downstream classification tasks by several points, as compared to learning directly from the intonation contours alone. Larger increases in performance on classification of technique and vocal phrase patterns suggest that the representations encode short-time temporal context learned directly from the original recordings. Additionally, their ability to improve singer and gender identification suggest the learning of more broad contextual pat-terns. The growing availability of large unlabeled datasets makes this idea of contextual representation learning additionally promising, with larger amounts of meaningful samples often yielding better performance
... One of the main problems in this task is extracting features from highly dynamic time-frequency textures of singing techniques. Convolutional neural networks (CNNs) have been recently used as effective methods to capture audio features for singing technique classification [2,3,4] as well as similar objectives such as musical playing technique recognition [5]. Although square-shaped kernels, e.g., 3×3 and 5×5, are commonly used in CNNs, it has been shown that customizing the kernel shape improves the classification performance. ...
Preprint
Full-text available
Singing techniques are used for expressive vocal performances by employing temporal fluctuations of the timbre, the pitch, and other components of the voice. Their classification is a challenging task, because of mainly two factors: 1) the fluctuations in singing techniques have a wide variety and are affected by many factors and 2) existing datasets are imbalanced. To deal with these problems, we developed a novel audio feature learning method based on deformable convolution with decoupled training of the feature extractor and the classifier using a class-weighted loss function. The experimental results show the following: 1) the deformable convolution improves the classification results, particularly when it is applied to the last two convolutional layers, and 2) both re-training the classifier and weighting the cross-entropy loss function by a smoothed inverse frequency enhance the classification performance.
... Indeed, most of the proposed methods deal with image type data. However, several DNN-based solutions have been proposed for the shape classification (Droby and El-Sana, 2020), (Lu et al., 2021) (Abeßer and Müller, 2019). Therefore, it turns out to be useful to study the robustness of deep contour classifiers. ...
Conference Paper
Full-text available
DNN certification using abstract interpretation often deals with image-type data, and subsequently evaluates the robustness of the deep classifiers against disturbances on the images such as geometric transformations, occlusion and convolutional noises by modeling them as an abstract domain. In this paper, we propose Con-tourVerifier, a new system for the evaluation of contour classifiers as we have formulated the abstract domains generated by rigid displacements on contours. This formulation allowed us to estimate the robustness of deep classifiers with different architectures and on different databases. This work will serve as a fundamental building block for the certification of deep models developed for shape recognition.
Conference Paper
Full-text available
Content-based Music Informatics includes tasks that involve estimating the pitched content of music, such as the main melody or the bass line. To date, the field lacks a good machine representation that models the human perception of pitch, with each task using specific, tailored representations. This paper proposes factoring pitch estimation problems into two stages, where the output of the first stage for all tasks is a multipitch contour representation. Further, we propose the adoption of pitch contours as a unit of pitch organization. We give a review of the existing work on contour extraction and characterization and present experiments that demonstrate the discriminability of pitch contours.
Conference Paper
Full-text available
The paper presents new approaches for analyzing the characteristics of intonation and pitch modulation of woodwind and brass solos in jazz recordings. To this end, we use score-informed analysis techniques for source separation and fundamental frequency tracking. After splitting the audio into a solo and a backing track, a reference tuning frequency is estimated from the backing track. Next, we compute the fundamental frequency contour for each tone in the solo and a set of features describing its temporal shape. Based on this data, we first investigate, whether the tuning frequencies of jazz recordings changed over the decades of the last century. Second, we analyze whether the intonation is artist-specific. Finally, we examine how the modulation frequency of vibrato tones depends on con-textual parameters such as pitch, duration, and tempo as well as the performing artist.
Conference Paper
Predominant instrument recognition in ensemble recordings remains a challenging task, particularly if closely-related instruments such as alto and tenor saxophone need to be distinguished. In this paper, we build upon a recently-proposed instrument recognition algorithm based on a hybrid deep neural network: a combination of convolu-tional and fully connected layers for learning characteristic spectral-temporal patterns. We systematically evaluate harmonic/percussive and solo/accompaniment source separation algorithms as pre-processing steps to reduce the overlap among multiple instruments prior to the instrument recognition step. For the particular use-case of solo instrument recognition in jazz ensemble recordings, we further apply transfer learning techniques to fine-tune a previously trained instrument recognition model for classifying six jazz solo instruments. Our results indicate that both source separation as pre-processing step as well as transfer learning clearly improve recognition performance, especially for smaller subsets of highly similar instruments.
Article
Both the collection and analysis of large music repertoires constitute major challenges within musicological disciplines such as jazz research. Automatic methods of music analysis based on audio signal processing have the potential to assist researchers and to accelerate the transcription and analysis of music recordings significantly. In this paper, we propose a framework for analyzing improvised monophonic solos in multi-instrumental jazz recordings with special focus on reed and brass instruments. The analysis algorithms rely on prior score-information, which is taken from high quality manual solo transcriptions. Following an initial solo and accompaniment source separation, we propose algorithms for tone-wise extraction of fundamental frequency and intensity contours. Based on this fine-grained representation of recorded jazz solos, we perform several exploratory experiments motivated by questions relating to jazz research in order to analyze the use of expressive stylistic devices such as intonation, pitch modulation, and dynamics in jazz solos. The results show that a score-informed audio analysis of jazz recordings can provide valuable insights into the individual stylistic characteristics of jazz musicians.
Article
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.
Article
This publication introduces a software toolbox that encapsulates different algorithmic solutions directed towards the automatic extraction of symbolic note information from digitized music excerpts. This process, often referred to as automatic musictranscription is still confronted with many issues such as mimicking the human perception or making a decision between ambiguousnote candidates for symbolic representation. Therefore, the current publication describes algorithmic procedures dedicated to thedetection and classification of drum notes, bass notes, main melody notes and chord structure. The focus on four different domains of automatic transcription allows utilization of specialized analysis procedures for almost every aspect of music. This paper provides insight into the single transcription methods and their performance. Additionally, various application scenarios for the transcription based interaction with music and audio are sketched with regard to the required technologies.
Article
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch}. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ∗∗∗, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Conference Paper
We propose the Probabilistic YIN (PYIN) algorithm, a modification of the well-known YIN algorithm for fundamental frequency (F0) estimation. Conventional YIN is a simple yet effective method for frame-wise monophonic F0 estimation and remains one of the most popular methods in this domain. In order to eliminate short-term errors, outputs of frequency estimators are usually post-processed resulting in a smoother pitch track. One shortcoming of YIN is that such post-processing cannot fall back on alternative interpretations of the signal because the method outputs precisely one estimate per frame. To address this problem we modify YIN to output multiple pitch candidates with associated probabilities (PYIN Stage 1). These probabilities arise naturally from a prior distribution on the YIN threshold parameter. We use these probabilities as observations in a hidden Markov model, which is Viterbi-decoded to produce an improved pitch track (PYIN Stage 2). We demonstrate that the combination of Stages 1 and 2 raises recall and precision substantially. The additional computational complexity of PYIN over YIN is low. We make the method freely available online1 as an open source C++ library for Vamp hosts.