In this paper, we evaluate hand-crafted features as well as features learned from data using a convolutional neural network (CNN) for different fundamental frequency classification tasks. We compare classification based on full (variable-length) contours and classification based on fixed-sized subcontours in combination with a fusion strategy. Our results show that hand-crafted and learned features lead to comparable results for both classification scenarios. Aggregating contour-level to file-level classification results generally improves the results. In comparison to the hand-crafted features, our examination indicates that the CNN-based features show a higher degree of redundancy across feature dimensions, where multiple filters (convolution kernels) specialize on similar contour shapes.
Jakob Abeßer1Meinard M ¨
1Semantic Music Technologies Group, Fraunhofer IDMT, Ilmenau, Germany
2International Audio Laboratories Erlangen, Germany
In the last years, data-driven algorithms for feature learning based
on deep neural networks often outperformed traditional analysis
methods that exploit domain expert knowledge. Compared to hand-
crafted feature design, data-driven approaches often show superior
performance within analysis and classification scenarios. However,
as a main disadvantage, learnt feature representations often lack a
clear interpretation and give only little insight into the problem at
In the field of Music Information Retrieval (MIR), fundamental
frequency (f0) contours, i. e., variable-length time-series representa-
tions of the pitch curve of musical notes, are a rich mid-level repre-
sentation as they provide cues for both music performance analysis
and music content analysis [1]. For example, frequency contours
have been successfully applied for MIR tasks such as playing and
singing style analysis, as well as genre and music instrument classi-
fication. In general, a reliable extraction of f0contours from poly-
phonic audio mixtures remains challenging to this day. One open
issue is how to best map variable-length f0contours to fixed-size
feature representations for music classification applications.
As the main contribution of this paper, we systematically eval-
uate different contour feature representations for a wide range of
MIR classification tasks. In particular, we compare hand-crafted
features (knowledge-driven approach) with features learnt from data
(data-driven approach). To capture dependencies over time, vari-
ous sequence modeling techniques such as recurrent neural networks
(RNN) or auto-regressive models exist. In this paper, we will focus
on CNN-based methods for two reasons: first, shift-invariance with
respect to time is a useful property in our context and, second, convo-
lution kernels allow for a better interpretability (in terms of filters).
As a further contribution, we discuss a fusion approach based on
fixed-size segments (subcontours), which lead to better classification
results than approaches based on variable-length contours.
One prominent application scenario for frequency contour analysis
in MIR is to classify instrument playing techniques as part of auto-
matic transcription algorithms. For example, Barbancho et al. [7],
Abeßer et al. [8], and Kehling et al. [3] showed for isolated vi-
olin, bass guitar, and electric guitar recordings, respectively, that
typical frequency modulation techniques such as vibrato, bending,
and slides can be classified with high accuracy above 90 % on a
note-level. As for ensemble recordings, the classification problem
becomes much harder. For example, Abeßer et al. reported in [9] ac-
curacy values between 48% (fall-off) to 92 % (vibrato) for common
modulation techniques in trumpet and saxophone jazz solos. The au-
thors proposed a set of contour features that measures modulation,
fluctuation, and the average gradient of f0contours (see PYMUS fea-
ture set, Section 4.1). In a follow-up publication, these features were
used to investigate how the pitch modulation range and the intona-
tion depend on the musical context (within a solo) and on the artist
In [11], Salamon et al. used the Melodia melody detection al-
gorithm [12] to extract f0contours from polyphonic music record-
ings. Based on these f0contours, the authors described a set of
low-dimensional features including contour duration, pitch range,
as well as vibrato rate, extent, and coverage. These contour fea-
tures outperformed low-level timbre features for genre classifica-
tion. Pantelli and Bittner proposed a set of contour features (see
BIT TELI feature set, Section 4.1) for singing style analysis [13].
First, f0contours were classified according to vocal/non-vocal cat-
egories. Then, a dictionary-learning approach based on spherical k-
means clustering was used to derive fixed-size activation histogram
for vocal contours. Finally, these histograms were used as features
to analyze different singing styles. Using the same feature set, Bit-
tner et al. reported in [1] accuracy values around 0.72 for related
tasks like vocal/non-vocal, bass/non-bass, melody/non-melody, and
singer’s gender (male, female) classification.
In this paper, we use four datasets, which cover various music analy-
sis tasks and different levels of timbre complexities. Table 1 provides
a general overview over all datasets. We apply a post-processing (re-
sampling) to have the same time resolution of 5.8 ms for all contours
across all four datasets. As will be detailed in Section 4.2, we only
consider contours from 197.2 ms to 1995.2 ms duration.
The GENRE database was used in [11] and contains 12531 con-
tours from 500 30-second excerpts equally distributed along the five
music genres opera, pop, flamenco, vocal jazz, and instrumental
Table 1: Dataset overview. Bold prints in column “Classes” indicate corresponding class abbreviations used in Figure 1. Final three columns
indicate number of classes, contours (C), subcontours (SC), and files in each dataset.
Label Task Classes Dataset Complexity Contour Estimator Number of
Classes C (SC) Files
GENRE Music Genre flamenco,instrumental jazz,opera,
pop,vocal jazz,
In-house dataset [2] multitimbral Melodia 5 12531
GUITAR Playing Style
bending (BE), normal (NO), slide
(SL), vibrato (VI)
IDMT-SMT-GUITAR [3] monotimbral Score + pYin 4 2240
INST Instrument clarinet, flute, saxophone, singing
voice (female), singing voice
(male), trumpet, violin
multitimbral Melody Transcription
+ Peak Tracking
8 10214
WJD Playing Style
trumpet etc.)
(pitch) bend, fall, slide, vibrato Weimar Jazz Database
(WJD) [5]
after source
Score-Informed Sep-
aration + pYin [6]
4 4964
jazz. The contours were extracted using the Melodia algorithm [12]
covering a frequency range of five octaves between 55 Hz and 1760
The GU ITAR dataset includes 2240 tones extracted from
monotimbral electric guitar recordings in the IDMT-SMT-GUITAR
dataset [3]. The notes are annotated with five the playing style
classes bending, normal (stable pitch), slide, and vibrato. Again, the
pYin pitch tracker was used for note-wise contour extraction.
The IN ST dataset includes 10214 contours extracted from the
IDMT-MONOTIMBRAL dataset [4]. Here, only the monophonic
instrument classes violin, flute, trumpet, saxophone, clarinet, as well
as female and male singing voice were considered. Contours were
extracted by first running the automatic melody transcription algo-
rithm by Dressler [14] followed a partial tracking based on linear
interpolation as part of the solo/accompaniment source separation
The WJD dataset includes a subset of 4964 tones taken from the
Weimar Jazz Database (WJD) [5], which are annotated with one of
the four playing style classes drop-off, slide, pitch-bend, and vibrato.
The WJD includes jazz ensemble recordings with predominant solo
instruments such as trumpet, tenor, alto, soprano saxophone, and
trombone. Using the same procedure as described in [5], we first ap-
plied score-informed solo/accompaniment source separation [15] to
extract the solo instrument, and then applied the pYin pitch tracking
algorithm [6] to extract frequency contours for all notes.
Figure 1 shows seven randomly chosen example contours for
each dataset and each class. For instance, characteristic con-
tour shapes such as the periodic frequency modulation of vibrato
tones can be recognized for the playing style classification tasks
(GU ITAR and W JD). For high-level classification tasks such as
genre classification (GE NRE) and instrument classification (I NST),
the classes tend to be less homogeneous in the sense that there are
often several different contour shapes associated to a single class.
Furthermore, some contours cover multiple playing techniques such
as an initial pitch slide followed by a vibrato (see WJD examples).
4.1. Hand-Crafted Audio Features
In our experiments, we use two hand-crafted features sets. The first
one is called BI TTE LI1and was introduced in [13]. From this feature
set, we use 18 features including 6 features that capture the shape and
coverage of vibrato, 8 features derived from a polynomial approxi-
mation of frequency contours, as well as 4 features derived from
global statistics.
(c) INST (d) WJD
Fig. 1: Randomly selected contours from each class of the four
datasets introduced in Table 1.
Fig. 2: Neural network architecture used for automatic contour fea-
ture learning (see Section 4.2).
The second hand-crafted feature set is the PYMUS2set, which
consists of 17 features including 3 features describing vibrato char-
acteristics, 10 features measuring different contour shape properties
related to fluctuation and gradient, as well as 4 features derived from
a temporal contour segmentation. The two features sets contain dif-
ferent types of features and overlap only with regard to vibrato rate
4.2. Feature Learning
Next, we introduce some feature sets that are automatically derived
using a data-driven approach based on CNNs. In particular, as
vibrat o
Fig. 3: Summary of the contour classification strategies. Variable-length contours can be processed solely by hand-crafted features (BITTE LI,
PYMUS). Fixed-size subcontours can be processed by all feature extractors.
Table 2: Features included in the hand-crafted feature sets BITTE LI
and PYMUS sorted by three feature categories (italic print). Salience-
based features are discarded.
Vibrato features
Rate (1) Modulation frequency (1)
Extent (1) Modulation dominance (1)
Coverage features (4) Number of modulation peri-
ods (1)
Contour shape features
Polynomial-fit on frequency
contour (8)
Measures for intonation &
fluctuation (6)
Polynomial-fit on salience
contour (7)
Frequency gradient descrip-
tors (4)
Contour segmentation fea-
tures (4)
Global statistics
Duration (1)
Pitch (3)
Salience (3)
18 (total) 17 (total)
shown in Figure 2, we compare two neural network architectures
with one (CN N-1) or two processing blocks (CNN-2) followed by
two fully-connected (FC) layers. Each processing block includes
a one-dimensional convolutional layer (CONV) followed by batch-
normalization (BN) [16] and a rectified linear unit (ReLU), and a
dropout (DO) [17] layer. In each convolution layer, we empirically
selected 30 filters, each filter having a size of 10 (corresponding to
a duration of 58 ms). We used the Adam optimizer, a learning rate
of 103, and a batch-size of 256. Optimizing the hyperparameters
is not within the scope of this paper.
Each contour is assigned to one class (compare Table 1). These
classes are used as target for the final softmax layer to train the mod-
els in a supervised fashion. After training, the activations of the
penultimate fully-connected layer are used as features. For training
the networks, we extract fixed-size subcontours from the variable
length contours as input data for the CNN model using a window
size Wand a hopsize of 50 %. The class label of each contour is
transferred to its subcontours. Since a common range for the vibrato
rate is between 5 and 12 Hz [5], we use W= 34 (corresponding to
197.2 ms) as window size to capture at least one full vibrato period.
In our classification experiment described in Section 5, we include
all original contours that have a minimum length of 34 frames (197.2
ms). As shown in Figure 3, the predicted class probabilities on a
subcontour level are aggregated by averaging to obtain contour-level
The four feature sets have comparable numbers of feature dimen-
sions: PYMUS (17), BITTE LI (18), CN N-1 (17), and CNN-2 (17).
We perform a three-fold cross validation. In each fold, we repeat the
following procedure for each of the classification tasks: The current
dataset is randomly split into training and test set (80 % : 20 %) based
on unique file assignments to avoid that f0contours from the same
recording end up in both sets. Using the subcontours extracted from
the training set as input and their class labels as targets, we train the
CNN-1 and C NN- 2 models. Afterwards, we extract feature vectors
from the training set using all four feature extractors, normalize them
to zero mean and unit variance, and train four independent random
forest classifier models [18] with 50 trees each. Finally, we evalu-
ate the classification performance on the test set by applying feature
scaling using mean and standard variance values derived from the
training set and computing the F1 score of the model predictions.
The random forest classifier was chosen as it easily allows us to fur-
ther analyze the importance of different feature dimensions for the
trained models (compare Section 6.2). As illustrated in Figure 3, we
compare both subcontour-level and contour-level classification for
the two hand-crafted feature sets BI TTELI and PY MUS. In addi-
tion, for the GENRE and INST dataset, we investigate a file-level
aggregation strategy by averaging over all contour-level class prob-
Table 3 shows the mean F1 scores obtained from the cross-
validation folds for all combinations of classification-level, dataset,
and aggregation strategy. As general findings, we observe very sim-
ilar scores for both hand-crafted features and learnt features as well
as for subcontour and contour classification. Throughout all fea-
Table 3: Mean F1 scores from 3-fold cross-validation. Results are
shown for different feature extractors, datasets, classification lev-
els (C = contour, SC = subcontour), and result aggregation level
(contour-level, file-level). Best results for each dataset are high-
lighted in bold print.
Aggr. Contour-Level File-Level
BIT TELI C 0.51 0.97 0.38 0.87 0.73 0.56
SC 0.54 0.96 0.43 0.82 0.76 0.54
PYM US C 0.53 0.98 0.35 0.87 0.79 0.52
SC 0.55 0.97 0.31 0.83 0.85 0.45
CNN -1 SC 0.54 0.95 0.34 0.83 0.85 0.49
CNN-2 SC 0.63 0.96 0.43 0.84 0.94 0.67
ture extractors, aggregating the contour-level classification results in
file-level results clearly boosts the F1 scores by up to 0.24.
The highest scores are achieved for the two playing-style clas-
sification datasets GU ITAR and W JD. While all four models per-
form comparably well on the easier-to-classify GUITAR dataset,
the hand-crafted features perform better on the WJ D dataset. For the
more difficult classification tasks based on the INST and GE NRE
datasets, the two-level CNN model (CNN-2) clearly outperforms
its simpler counterpart (CN N-1) and the two hand-crafted features.
Presumably, the CNN -2 model can learn to recognize more complex
contour shapes.
6.1. Prototype Contours
In the following, we aim to get a better insight into which contour
shapes the CNN-based features capture. As an example, we inves-
tigate the two-layer CNN-2 model. We use the activation maxi-
mization algorithm [19]3to generate frequency contours that max-
imize the activations of each of the 17 neurons in the penultimate
dense layer, i. e., the individual feature dimension values. Figure 4
shows such contours for each of the four datasets. Despite the differ-
ent underlying classification tasks, the networks learn to recognize
similar contour shapes—increasing and decreasing frequency con-
tours (pitch slides), alternating sequences of increasing and decreas-
ing contour parts (pitch bends) as well as modulating (vibrato-like)
6.2. Feature Set Redundancy
The contours shown in Figure 4 indicate a certain redundancy as sim-
ilar shapes can be found across neurons. In order to measure the in-
formation redundancy within different feature sets, we first compute
all pair-wise correlation coefficients between features of the same
set. Here we only focus on features extracted from subcontours. We
observe significantly higher mean correlation values of 0.451 for the
learnt feature sets than for the hand-crafted feature sets (0.189).
Additionally, we analyze the feature importances, which mea-
sure their effect in the Random Forest models. Low information re-
dundancy leads to only a few features having high importance values
whereas high redundancy would result in an almost equal distribu-
tion. We compute the entropy Hto measure the uniformity of the
distribution over the feature importance values. We observe higher
3We used the keras-vis implementation of the activation maximization
algorithm [20] with hand-tuned parameter values tv weight = 0.01 and
lp norm weight = 0.01.
(a) GENRE (b) GU ITAR (c) INST (d) WJD
Fig. 4: Normalized frequency contours that maximize neuron activa-
tions in penultimate dense layer, which are used as features vectors.
entropy values for the feature learning configurations of 0.960 than
for the configurations using hand-crafted features (0.925). Both re-
sults indicate that discriminative information is more concentrated
in the hand-crafted features, where only a subset of the features have
a high effect in the classifier models. In contrast, the learnt features
show a higher information redundancy across feature dimensions.
This paper compares hand-crafted features and automatically learnt
features for different f0contour classification tasks. Our findings
show that embedding features from a simple non-optimized neural
network architecture with two convolutional layers can outperform
hand-crafted features based on expert knowledge. Multiple convolu-
tional layers allow for learning more complex contour shapes, which
is beneficial especially for higher-level tasks such as genre and in-
strument recognition.
The evaluation results show that using fixed-length subcontours
in combination with an aggregation strategy leads to comparable
classification accuracies as compared to an approach using global
contour. As an advantage, classifying subcontours allows for a more
complex (time-dependent) description of playing techniques on a
note-level (e. g., initial pitch bend followed by a vibrato). A close
investigation of the CNN-based features revealed that the learnt fea-
ture sets have a higher redundancy across feature dimensions than
the hand-crafted features. Reducing the number of filters and the
amount of redundancy while maintaining the classification perfor-
mance could be a guideline for model optimization.
This work has been supported by the German Research Founda-
tion (AB 675/2-1, MU 2686/11-1). The International Audio Labo-
ratories Erlangen are a joint institution of the Friedrich-Alexander-
at Erlangen-N¨
urnberg (FAU) and Fraunhofer Institut f¨
ur In-
tegrierte Schaltungen IIS. The authors would like to thank Justin
Salamon, Rachel Bittner, and Maria Pantelli for valuable discussions
and for sharing their contour datasets.
Conference Paper
We propose the Probabilistic YIN (PYIN) algorithm, a modification of the well-known YIN algorithm for fundamental frequency (F0) estimation. Conventional YIN is a simple yet effective method for frame-wise monophonic F0 estimation and remains one of the most popular methods in this domain. In order to eliminate short-term errors, outputs of frequency estimators are usually post-processed resulting in a smoother pitch track. One shortcoming of YIN is that such post-processing cannot fall back on alternative interpretations of the signal because the method outputs precisely one estimate per frame. To address this problem we modify YIN to output multiple pitch candidates with associated probabilities (PYIN Stage 1). These probabilities arise naturally from a prior distribution on the YIN threshold parameter. We use these probabilities as observations in a hidden Markov model, which is Viterbi-decoded to produce an improved pitch track (PYIN Stage 2). We demonstrate that the combination of Stages 1 and 2 raises recall and precision substantially. The additional computational complexity of PYIN over YIN is low. We make the method freely available online1 as an open source C++ library for Vamp hosts.