Conference PaperPDF Available

On the Use of U-Net for Dominant Melody Estimation in Polyphonic Music

Authors:

Figures

Content may be subject to copyright.
On the use of U-Net for dominant melody
estimation in polyphonic music
Guillaume Doras
Sacem, Ircam
UMR STMS 9912, CNRS
Paris, France
guillaume.doras@sacem.fr
Philippe Esling
Ircam,
UMR STMS 9912, CNRS
Paris, France
philippe.esling@ircam.fr
Geoffroy Peeters
LTCI, Telecom ParisTech
University Paris-Saclay
Paris, France
geoffroy.peeters@telecom-paristech.fr
Abstract—Estimation of dominant melody in polyphonic music
remains a difficult task, even though promising breakthroughs
have been done recently with the introduction of the Harmonic
CQT and the use of fully convolutional networks. In this paper,
we build upon this idea and describe how U-Net – a neural
network originally designed for medical image segmentation –
can be used to estimate the dominant melody in polyphonic
audio. We propose in particular the use of an original layer-
by-layer sequential training method, and show that this method
used along with careful training data conditioning improve the
results compared to plain convolutional networks.
Index Terms—dominant melody estimation, pitch estimation,
HCQT, U-Net
I. INTRODUCTION
Dominant melody or multi-pitches estimation in polyphonic
music has long been seen as a difficult problem in Music
Information Retrieval (MIR), both because of the inherent
harmonic complexity of real-life music and because of the
lack of annotated data available for training and evaluating.
Most of the successful approaches proposed so far for
this task start by deriving a pitch salience from a spectral
representation, and then apply some heuristics to it to esti-
mate the dominant melody and/or the multiple pitches. Such
heuristics are as varied as harmonic partials summation [1],
pitch contour tracking [2], spectral smoothness enforcement
[3], [4] or source-filter modeling [5], [6].
Recently, deep neural networks have been proposed to
compute this pitch salience representation, using Recurrent
Neural Networks (RNN) in [7], Convolutional Neural Net-
works (CNN) in [8], or a combination of both in [9]. The audio
representation usually provided as input to the network is the
Short Time Fourier Transform (STFT), but some authors have
also used the raw waveform [10] or the Harmonic Constant-Q
Transform (HCQT) [8].
In this paper, we propose the use of a U-Net architecture
to estimate the dominant melody in polyphonic music. We
propose a sequential method to train the U-Net using ground
truth data at increasing resolutions, and show that this method
improves performances compared to the usual training. We
also compare the performances of the U-Net to those of
the full CNN proposed in [8], and show that the U-Net
architecture brings slight improvements over this previously
proposed approach.
II. RE LATE D WO RK
In this work, we build upon three main existing concepts:
the HCQT data representation, the U-Net architecture and the
curriculum learning paradigm.
A. Harmonic Constant-Q Transform (HCQT)
The HCQT, introduced in [8], is an elegant and astute repre-
sentation of the audio signal in 3 dimensions (time, frequency,
harmonic). It stacks along the third dimension several standard
CQTs sharing the same frequency resolution and frequency
range, but starting at different minimal frequency h.fmin,
where fmin is the minimal frequency of interest and his the
harmonic index of each CQT. The harmonic components of
the audio signal will thus be represented along the third axis of
the HCQT and localized in the time-frequency domain across
its first and second dimensions.
The alignment of harmonic series along the third dimension
makes this representation particularly suitable for melody
tracking, as it is can be directly processed by convolutional
networks, whose 3-D filters can be trained to localize in the
time and frequency plan the harmonic components in the
melody of the input signal.
B. U-Net
U-Net was originally introduced in the context of image
segmentation [11] for identifying and localizing high resolu-
tion details in medical images. It can be seen as a downsam-
pling/upsampling model, where the downsampling part (the
descending branch of the U) is learning representations of the
input image at coarser resolutions by means of convolution
and pooling layers, while the upsampling part (the ascending
branch of the U) is learning to recreate representations at finer
resolutions by means of convolution and transposed convo-
lution layers [12]. The main difference with a convolutional
Auto-Encoder is the introduction of skip-connections from the
encoding levels to their counterpart decoding levels. These
skip connections can be seen as a manner of providing an
information context to the next reconstruction level, and have
proven to help localization of features of interest.
The U-Net model has already been used in the context
of MIR for sources separation [13]. We apply it here for
dominant melody estimation, as we think that an analogy can
source
target
batch norm + 2 conv2D
batch norm + 2 conv2D
+ downsampling sigmoid and cross
entropy loss
batch norm + upsampling
+ 2 conv2D
concat
360 x 258 x 6
360 x 258 x 64
180 x 129 x 128
90 x 65 x 256
45 x 33 x 512
360 x 258 x 1
360 x 258
Fig. 1. U-Net model for dominant melody estimation.
be drawn between this problem and the image segmentation
problem. Indeed, considering the HCQT as an image with h
channels, contrasting and extracting the melody line from the
background noise can be seen as a task similar to contrasting
and extracting objects boundaries out of the rest of a natural
image.
C. Curriculum learning
Curriculum learning was introduced as a continuation
method, i.e. a strategy to minimize non-convex criteria [14],
based on the intuition that a model – similarly to humans
– could learn more efficiently if trained with successive
training objectives of increasing difficulty, starting first with
smooth objectives and gradually increasing the level of their
complexity. This can be seen also as a sort of pre-training,
which has proven to be beneficial [15].
The nature of the dominant melody estimation problem and
the architecture of the U-Net are well suited for a curriculum
learning approach: instead of training the model to deal with
high resolution information directly, it is possible to prune
parts of the network and train it repeatedly level by level.
Successive trainings will start with coarse resolution informa-
tion at the lowest level of the U, and continue with increasing
resolutions while adding higher levels to the upsampling
branch. This will be described more in details in section III.
III. MET HO D
A. Model
The U-Net model used here is directly inspired by the
seminal U-Net of [11], and is depicted in Fig. 1 with four
levels.
On the down-sampling branch, each level consists of a batch
normalization layer followed by two convolution layers with
3×3kernels. Contrary to the original U-Net, padding is ap-
plied before convolutions (’same’ convolution type), so that the
time and frequency dimensions are maintained. Convolution
layers are then followed by a max-pooling layer with a kernel
of shape 2×2and a stride of 2. The first level starts with 64
kernels, and the number of kernels is doubled at each level
(i.e. the deeper level handles 512-depth tensors).
On the up-sampling branch, each level consists of a batch
normalization layer followed by a transposed convolution layer
with 2×2kernels and a stride of 2 also, followed by two
convolutional layers of 3×3kernels also. The number of
kernels is divided by 2 at each level.
At each resolution level, the output of the down-sampling
branch is concatenated with the output of the up-sampling
branch. For uneven dimensions on the down-sampling branch,
the up-sampling will produce an even dimension. In this case,
the supernumerary row or column is simply removed, so that
data in each of the two branches has same shape and can be
concatenated via the corresponding skip connection.
Finally, the output tensor is processed with a 1×1kernel
layer with sigmoid activation such that each time/frequency
bin models a probability. The model is then trained to mini-
mize the cross-entropy between the output probability and the
the target ground truth normalized activations.
B. Data chunking
In order to process the full duration of songs and their
corresponding dominant melody ground truth annotations, the
data is split into chunks. Different durations of chunks have
been tried, and preliminary experiments showed that a duration
of 3 seconds is a good trade-off.
As padded convolutions are used, extra data might be added
to the borders of the chunks on the frequency and the time axis.
Additionally, removing supernumerary rows or columns at the
borders when up-sampling might remove relevant information
propagated from one layer to another.
We considered that zero padding on the frequency axis is
acceptable, as lowest and highest frequencies of the HCQT are
very unlikely to be part of the melody. However, sides effects
on time axis might not be negligible. To mitigate these effects,
we have overlapped the beginning and the end of each chunk.
The full duration melody estimation is reconstructed trimming
each chunk’s overlapping part and concatenating the remaining
parts along the time axis. In practice, an overlap of 0.3 seconds
at the beginning and at the end of each chunk appears adequate
for 3 seconds chunks.
C. Curriculum training
We have investigated two different training methods: a clas-
sical end-to-end method, and a level-by-level method inspired
by the curriculum learning paradigm.
In the level-by-level method, the up-sampling branch of the
U-Net is initially pruned, except its lowest level. The ground
truth target is downsampled with three pooling layers (that
have no trainable parameters) to match the dimensionality of
the lowest resolution level, as illustrated in Fig. 2. Only the
down-sampling branch and the lowest level are then trained to
minimize the cross entropy loss between the network output
and the ground truth target at this coarsest resolution.
source
target
batch norm + 2 conv2D
batch norm + 2 conv2D
+ downsampling
sigmoid and
cross entropy
l
3 successive
mean pooling
Fig. 2. Training of the lowest level of our Dominant Melody U-Net model
The next level layers and skip connections are then added to
the partially trained network. The resulting network is trained
to minimize the loss re-defined as the cross entropy between
the new level output and the ground truth downsampled to the
corresponding dimensionality.
Each next level is then subsequently added and the entire
resulting network is trained reusing the weights of the lower
levels. These successive partial trainings are repeated until the
highest and finest resolution level is trained.
The main goal behind this strategy is to provide information
about the ground truth target to the up-sampling branch as
early as possible. Our assumption is that providing ground
truth information at coarse resolution should help reconstruc-
tion at higher resolution levels.
IV. DOMINANT MELODY EXTRACTION EXPERIMENT
A. Dataset
To train our networks, we have used the first release of
the MedleyDB dataset [16]1, which provides the dominant
1A newer and more accurate version has recently been released [17], but
was not yet available during our experiments.
melody and multi-pitch annotations for 108 songs of varied
musical styles. We used the ”melody2” annotations (see [17]
for details) as dominant melody target for our networks.
Train/validation/test sets. Preliminary experiments have
shown that different randomized train/validation/test sets splits
could lead to very different results from one split to another. In
order to obtain more robust results, we conducted a 10-folds
cross-validation experiment. The 108 songs were divided into
10 folds containing 10 to 11 songs, using artist filtering (songs
of the same artist must belong to the same fold). Each of the
ten folds was used in turn as the test set. Another fold was
randomly picked among the nine remaining ones to be used as
the evaluation set, while the remaining eight folds were used
together as the train set.
Baseline comparison. Because of this approach, we cannot
directly compare our results to the ones published in [8].
In the following, we therefore consider as the baseline our
own re-implementation of [8] applied to each of the 10-folds
train/validation/test sets.
B. Configuration
For all experiments, we compute the HCQT as described
in [8] with fmin = 32.7 Hz and 6 harmonics – h
{0.5,1,2,3,4,5}. Each CQT spans 6 octaves with a resolution
of 60 bins per octave (5 bins per semi-tone), and has a frame
duration of 11 ms. The implementation of the CQT was
done with the Librosa library [18].
Training parameters. For training, we shuffled the chunks
of the training set and then used a batch size of 16 chunks.
We optimized the parameters using Adam [19] with a learning
rate starting at 104with a decaying factor of 0.94 per epoch.
We applied early stopping if the loss on the validation set had
not decreased after 1000 training steps.
From pitch saliency to dominant melody. The output of
the networks (either the full CNN or U-Net) is a pitch salience
representation. As in [8], we obtain the dominant melody
simply keeping at each time frame the frequency with the
maximum salience value. For the voicing/unvoicing decision
at each time frame, we use a threshold whose value is chosen
to optimize the Overall Accuracy score on the validation set.
This threshold is then fixed and used on the test set before
scores described below are computed.
C. Performance measures
To measure the performances of our system, we computed
the melody Overall Accuracy (OA) along with the Raw
Chroma Accuracy (RCA), Raw Pitch Accuracy (RPA) as well
as the Voicing Recall (VR) and the Voicing False Alarm (VFA)
scores as provided by the mir_eval toolbox [20].
V. RESULTS
We compare here the three different systems: 1) our re-
implementation of the fully convolutional baseline proposed
in [8], 2) the U-Net trained end-to-end using chunks with tem-
poral overlap, 3) the U-Net trained using curriculum training
and chunks with temporal overlap.
We show in Fig. 3 the distributions of the mean of metrics
obtained by the three models for each of the 10 folds.
Fig. 3. 10-folds mean scores distributions obtained with mir_eval
(OA=Overall Accuracy, RPA=Raw pitch accuracy, RCA=Row Chroma Ac-
curacy, VR=Voicing Recall, VFA=Voicing False Alarm)
We see on Fig. 3 that the proposed U-Net provides some
improvement for all scores compared to the baseline CNN.
Level-by-level training improvements. Now comparing
the types of training used for U-Net, it appears that our
proposed level-by-level training also provides further improve-
ments on all scores, except for the Voicing False Alarm metric.
Interestingly, the variance for these scores across the ten folds
seems to be lower compared to the other models. This suggests
that U-Net’s generalization ability benefits from curriculum
training, and that isolating and training lower levels first with
coarse resolution data helps training of higher levels dealing
with finer resolution data. The effect seems however less
obvious on the Voicing False Alarm.
Voicing False Alarm. Despite the improved accuracies of
U-Net, the Voicing False Alarm remains fairly high (around
20%). This is illustrated for a specific song in Fig. 4 where
false voicing/unvoicing decisions are indeed often made: high
values of pitch salience are present where no dominant melody
is annotated. These voicing errors could be related to a discrep-
ancy between the validation set (for which the voicing decision
threshold has been optimized) and the test set. However, a
visualization of the network outputs corresponding to empty
chunks (i.e. chunks where no dominant melody is present)
indicates that U-Net generally produces a non-empty output
even when it should not. This suggests that conditioning the
output with an extra voicing/unvoicing information could be
beneficial, for instance with a dual loss [21].
Post hoc statistical significance analysis. All in all, the
improvements of the mean scores observed in Fig. 3 between
the different models remain fairly small. We have therefore
conducted a Tukey’s Honestly Significant Difference test
(HSD) between each pair of models to assess the statistical
significance of the observed improvements.
Fig. 4. [Top] Pitch salience output of U-Net trained level-by-level on
overlapping chunks for the song of the test set ”Don’t Hear A Thing”
by Brandon Webster. [Bottom] Corresponding MedleyDB’s ground truth
annotation.
The HSD test shows that the small differences between the
mean scores of each fold are not large enough to reject the
Null hypothesis, i.e. that the improvements observed on this
dataset do not appear to be statistically significant enough to
draw a definitive conclusion.
VI. CONCLUSION
In this paper, we have proposed to use the U-Net model
for dominant melody estimation, and compared it with one
of the current state-of-the-art models for this task, a fully
convolutional network.
We have proposed to improve the performances of the U-
Net model in two ways. Firstly, by overlapping training data
to mitigate side-effects errors introduced by the padding and
un-padding of its convolutions and de-convolutions layers.
Secondly, by training U-Net with a curriculum training ap-
proach, starting with lower levels in isolation with coarse res-
olution data, and successively training higher levels with finer
resolution data. We have shown that under these conditions,
the U-Net provides a slight improvement over the full CNN.
This improvement does however not appear to be statistically
significant.
We however believe that the trend observed could be
significant given larger amount of training examples. To
improve performances of the proposed model, we therefore
plan to use larger annotated datasets, such as Dali [22] or
Lakh [23] datasets. We also plan to condition the network
with a voicing/unvoicing information using a dual loss.
Finally, we also want to continue exploring the idea that
U-Net’s higher levels can benefit from the knowledge of
lower levels, and plan to introduce an attention mechanism
between low and high resolution layers.
REFERENCES
[1] A. Klapuri, “Multiple fundamental frequency estimation by summing
harmonic amplitudes.” in ISMIR, 2006, pp. 216–221.
[2] J. Salamon and E. G´
omez, “Melody extraction from polyphonic music
signals using pitch contour characteristics,” IEEE Transactions on Audio,
Speech, and Language Processing, vol. 20, no. 6, pp. 1759–1770, 2012.
[3] E. Vincent, N. Bertin, and R. Badeau, “Adaptive harmonic spectral
decomposition for multiple pitch estimation,” IEEE Transactions on
Audio, Speech and Language Processing, vol. 18, no. 3, pp. 528–537,
2010.
[4] V. Emiya, R. Badeau, and B. David, “Multipitch estimation of piano
sounds using a new probabilistic spectral smoothness principle,” IEEE
Transactions on Audio, Speech, and Language Processing, vol. 18, no. 6,
pp. 1643–1654, 2010.
[5] J.-L. Durrieu, G. Richard, B. David, and C. F´
evotte, “Source/filter model
for unsupervised main melody extraction from polyphonic audio sig-
nals,” IEEE Transactions on Audio, Speech, and Language Processing,
vol. 18, no. 3, pp. 564–575, 2010.
[6] D. Basaran, S. Essid, and G. Peeters, “Main melody extraction with
source-filter nmf and crnn,” in Proc. ISMIR, 2018.
[7] S. B ¨
ock and M. Schedl, “Polyphonic piano note transcription with
recurrent neural networks.” in ICASSP, 2012, pp. 121–124.
[8] R. M. Bittner, B. McFee, J. Salamon, P. Li, and J. P. Bello, “Deep
salience representations for f0 estimation in polyphonic music,” in
Proceedings of the 18th International Society for Music Information
Retrieval Conference, Suzhou, China, 2017, pp. 23–27.
[9] S. Sigtia, E. Benetos, and S. Dixon, “An end-to-end neural network
for polyphonic piano music transcription,” IEEE/ACM Transactions on
Audio, Speech and Language Processing (TASLP), vol. 24, no. 5, pp.
927–939, 2016.
[10] J. W. Kim, J. Salamon, P. Li, and J. P. Bello, “Crepe: A convolutional
representation for pitch estimation,” arXiv preprint arXiv:1802.06182,
2018.
[11] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
for biomedical image segmentation,” in International Conference on
Medical image computing and computer-assisted intervention. Springer,
2015, pp. 234–241.
[12] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
for semantic segmentation,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2015, pp. 3431–3440.
[13] A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and
T. Weyde, “Singing voice separation with deep u-net convolutional
networks,” 2017.
[14] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum
learning,” in Proceedings of the 26th annual international conference
on machine learning. ACM, 2009, pp. 41–48.
[15] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A
review and new perspectives,” IEEE transactions on pattern analysis
and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
[16] R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and
J. P. Bello, “Medleydb: A multitrack dataset for annotation-intensive
mir research.” in ISMIR, vol. 14, 2014, pp. 155–160.
[17] J. Salamon, R. M. Bittner, J. Bonada, J. J. Bosch, E. G´
omez, and J. P.
Bello, “An analysis/synthesis framework for automatic f0 annotation of
multitrack datasets,” in Proceedings of the 18th ISMIR Conference, 2017.
[18] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg,
and O. Nieto, “librosa: Audio and music signal analysis in python,” in
Proceedings of the 14th python in science conference, 2015, pp. 18–25.
[19] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,
arXiv preprint arXiv:1412.6980, 2014.
[20] C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang,
D. P. Ellis, and C. C. Raffel, “mir eval: A transparent implementation
of common mir metrics,” in In Proceedings of the 15th International
Society for Music Information Retrieval Conference, ISMIR. Citeseer,
2014.
[21] C. Hawthorne, E. Elsen, J. Song, A. Roberts, I. Simon, C. Raffel,
J. Engel, S. Oore, and D. Eck, “Onsets and frames: Dual-objective piano
transcription,” arXiv preprint arXiv:1710.11153, 2017.
[22] G. Meseguer-Brocal, A. Cohen-Hadria, and G. Peeters, “Dali: a large
dataset of synchronized audio, lyrics and notes, automatically created
using teacher-student machine learning paradigm.” in 19th International
Society for Music Information Retrieval Conference, ISMIR, Ed., 2018.
[23] C. Raffel, Learning-based methods for comparing sequences, with
applications to audio-to-midi alignment and matching. Columbia
University, 2016.
... Digital signal processing has brought a whole new universe of possibilities for manipulating and analyzing signals, being one of the most important set of tools and techniques in the information era, along with machine learning. Tasks such as automatic music transcription [28][29][30][31][32][33][34][35][36][37][38][39][40][41][42][43][44][45][46][47] and sound source separation [11,46,[48][49][50][51][52][53][54][55][56][57] can now be performed by digital processors and provide new ways of human interaction with music. ...
... For this reason, dominant melody analysis is one of the most important tasks in the area of MIR, and has actively been studied by its research community [28, 40-47, 49-53, 55-57, 66, 71, 92-99] for many years. The majority of the state-of-the-art methods in MIR make use of machine learning techniques, such as neural networks [14,31,40,41,43,49,50,53,53,56,59,62,92,93,[100][101][102][103][104], and of TFRs of the audio signal as input. Therefore, this and many other applications in the context of MIR may profit from TFRs with higher resolution. ...
... This section contains a brief experiment for which a state-of-the-art system [40,41] for dominant melody estimation was used. The experiment was conducted in collaboration with G. Doras (IRCAM) 6 , who is the main author of [40,41]. ...
Thesis
Full-text available
Time-frequency representations (TFR) are one of the most valuable tools in digital audio processing, being used in many applications. TFRs can be computed having different time or frequency resolutions and can even represent a certain frequency variation over time, e.g. when using the fan-chirp transform. The main shortcoming of TFRs is the energy smearing related to non-stationarity of the signal within the analysis windows used. This kind of artifact usually results in performance degradation of applications that make use of TFRs, hence providing TFRs that precisely represent the signals of interest is crucial to enhance the performance of such systems. A way to compute a high-resolution TFR is to combine TFRs having different resolutions in such a way that preserves the best aspects of each representation. This is the general idea behind all methods proposed in this thesis, of which the main goal is to allow for a sharp representation of main melody signals in polyphonic contexts. The methods are classified as: bin-wise combinations, combinations based on local information, and methods based on image analysis. Their performance are assessed by means of several experiments using both synthetic and real-world signals, and the results indicate the proposed multi-resolution fan-chirp interpolation method as the best in terms of frequency bandwidth, onset definition and dynamic range. Also, an automatic annotation scheme was devised to diminish the human effort in the transcription of rhythm patterns. This method utilizes TFRs with coarse frequency resolution and a clustering procedure to classify the types of hit. The estimated accuracy in terms of classification is around 75% to 80%.
... Different types of signal representations are used as input. While end-to-end-learning models directly process signal blocks [8,9], other networks process time-frequency representations obtained from a Short-Time Fourier Transform (STFT) [10], a constant-Q transform (CQT), or a harmonic CQT (HCQT) [11][12][13]. ...
... As a consequence, U-net based neural network architectures have not just been used for image segmentation but were also successfully applied for various MIR tasks such as source separation [19,20], multi-instrument music transcription [21], and lyrics-to-music alignment [22]. In addition to [2], other melody transcription algorithms using U-nets were proposed, among others, by Lu and Su [23] as well as Doras et al. [13]. ...
Article
Full-text available
In this paper, we adapt a recently proposed U-net deep neural network architecture from melody to bass transcription. We investigate pitch shifting and random equalization as data augmentation techniques. In a parameter importance study, we study the influence of the skip connection strategy between the encoder and decoder layers, the data augmentation strategy, as well as of the overall model capacity on the system’s performance. Using a training set that covers various music genres and a validation set that includes jazz ensemble recordings, we obtain the best transcription performance for a downscaled version of the reference algorithm combined with skip connections that transfer intermediate activations between the encoder and decoder. The U-net based method outperforms previous knowledge-driven and data-driven bass transcription algorithms by around five percentage points in overall accuracy. In addition to a pitch estimation improvement, the voicing estimation performance is clearly enhanced.
... Despite recent advances in end-to-end learning in the audio domain, such as WaveNet [8] and Sam-pleCNN [9], which make model training on raw audio data possible, many recent publications still use spectrograms as the input to their models for various applications [10]. These applications include speech recognition [11,12], speech emotion detection [13], speech-to-speech translation [14], speech enhancement [15], voice separation [16], singing voice conversion [17], music tagging [18], cover detection [19], melody extraction [20], and polyphonic music transcription [21]. One drawback of training an end-to-end model on raw audio data is the longer training time. ...
... This type of filtering can be implemented efficiently using a convolutional neural network. The definition of FIR is shown in (20), where x[n − i] is the input signal at time step n, b_i is the FIR filter. ...
Article
Full-text available
In this paper, we present nnAudio, a new neural network-based audio processing framework with graphics processing unit (GPU) support that leverages 1D convolutional neural networks to perform time domain to frequency domain conversion. It allows on-the-fly spectrogram extraction due to its fast speed, without the need to store any spectrograms on the disk. Moreover, this approach also allows back-propagation on the waveforms-to-spectrograms transformation layer, and hence, the transformation process can be made trainable, further optimizing the waveform-to-spectrogram transformation for the specific task that the neural network is trained on. All spectrogram implementations scale as Big-O of linear time with respect to the input length. nnAudio, however, leverages the compute unified device architecture (CUDA) of 1D convolutional neural network from PyTorch, its short-time Fourier transform (STFT), Mel spectrogram, and constant-Q transform (CQT) implementations are an order of magnitude faster than other implementations using only the central processing unit (CPU). We tested our framework on three different machines with NVIDIA GPUs, and our framework significantly reduces the spectrogram extraction time from the order of seconds (using a popular python library librosa) to the order of milliseconds, given that the audio recordings are of the same length. When applying nnAudio to variable input audio lengths, an average of 11.5 hours are required to extract 34 spectrogram types with different parameters from the MusicNet dataset using librosa. An average of 2.8 hours is required for nnAudio, which is still four times faster than librosa. Our proposed framework also outperforms existing GPU processing libraries such as Kapre and torchaudio in terms of processing speed.
... For these reasons, we base our methods on convolutional blocks and use an HCQT as input representation. As a major advancement, U-net models [35] were shown to improve performance of AMT [3], [20]- [23], [36] and other MIR tasks [37], [38]. More recently, the inclusion of self-attention components into U-nets [3], [21] and other models [39] was applied successfully to MPE. ...
Preprint
Extracting pitch information from music recordings is a challenging but important problem in music signal processing. Frame-wise transcription or multi-pitch estimation aims for detecting the simultaneous activity of pitches in polyphonic music recordings and has recently seen major improvements thanks to deep-learning techniques, with a variety of proposed network architectures. In this paper, we realize different architectures based on CNNs, the U-net structure, and self-attention components. We propose several modifications to these architectures including self-attention modules for skip connections, recurrent layers to replace the self-attention, and a multi-task strategy with simultaneous prediction of the degree of polyphony. We compare variants of these architectures in different sizes for multi-pitch estimation, focusing on Western classical music beyond the piano-solo scenario using the MusicNet and Schubert Winterreise datasets. Our experiments indicate that most architectures yield competitive results and that larger model variants seem to be beneficial. However, we find that these results substantially depend on randomization effects and the particular choice of the training-test split, which questions the claim of superiority for particular architectures given only small improvements. We therefore investigate the influence of dataset splits in the presence of several movements of a work cycle (cross-version evaluation) and propose a best-practice splitting strategy for MusicNet, which weakens the influence of individual test tracks and suppresses overfitting to specific works and recording conditions. A final evaluation on a mixed dataset suggests that improvements on one specific dataset do not necessarily generalize to other scenarios, thus emphasizing the need for further high-quality multi-pitch datasets in order to reliably measure progress in music transcription tasks.
... Recent advances in deep learning and traditional machine learning have led to several data-driven pitch estimation frameworks [9,10,11,12,13,14,15]. State-of-the-art data-driven pitch detectors commonly train an end-to-end deep neural network (DNN) that predicts a pitch sequence from the corresponding time-domain audio excerpt. ...
... For example, Zhang et al. [13] used the PEFAC algorithm [14] to extract spectral domain features from each frame as input and proposed to exploit RNN-BLSTM to model the two pitch contours of a mixture of two speech signals. However, recently, some data-driven methods using various neural networks have been proposed for monophonic pitch estimation [15,16] and multi-pitch [17,18] or melody tracking [19][20][21][22][23], outperforming previous results. ...
Article
Full-text available
The task of pitch estimation is an essential step in many audio signal processing applications. In this paper, we propose a data-driven pitch estimation network, the Dual Attention Network (DA-Net), which processes directly on the time-domain samples of monophonic music. DA-Net includes six Dual Attention Modules (DA-Modules), and each of them includes two kinds of attention: element-wise and channel-wise attention. DA-Net is to perform element attention and channel attention operations on convolution features, which reflects the idea of "symmetry". DA-Modules can model the semantic interdependencies between element-wise and channel-wise features. In the DA-Module, the element-wise attention mechanism is realized by a Convolutional Gated Linear Unit (ConvGLU), and the channel-wise attention mechanism is realized by a Squeeze-and-Excitation (SE) block. We explored three kinds of combination modes (serial mode, parallel mode, and tightly coupled mode) of the element-wise attention and channel-wise attention. Element-wise attention selectively emphasizes useful features by re-weighting the features at all positions. Channel-wise attention can learn to use global information to selectively emphasize the informative feature maps and suppress the less useful ones. Therefore, DA-Net adaptively integrates the local features with their global dependencies. The outputs of DA-Net are fed into a fully connected layer to generate a 360-dimensional vector corresponding to 360 pitches. We trained the proposed network on the iKala and MDB-stem-synth datasets, respectively. According to the experimental results, our proposed dual attention network with tightly coupled mode achieved the best performance.
... To solve the global frequency errors, we perform a new correlation in frequency between the note level (a k,notes ) Knotes k=1 = (t 0 k , t 1 k , f k , l k , i k ) notes → (t 0 k , t 1 k , f k ) notes and a f 0 extraction (Doras et al., 2019), that extracts the main pitch for each instant. Pitch defines how we perceive the periodicity of music signals and it is highly related to the notion of "musical notes". ...
Preprint
Full-text available
This dissertation proposes the study of multimodal learning in the context of musical signals. Throughout, we focus on the interaction between audio signals and text information. Among the many text sources related to music that can be used (e.g. reviews, metadata, or social network feedback), we concentrate on lyrics. The singing voice directly connects the audio signal and the text information in a unique way, combining melody and lyrics where a linguistic dimension complements the abstraction of musical instruments. Our study focuses on the audio and lyrics interaction for targeting source separation and informed content estimation.
... To solve the global frequency errors, we perform a new correlation in frequency between the note level (a k,notes ) Knotes k=1 = (t 0 k , t 1 k , f k , l k , i k ) notes → (t 0 k , t 1 k , f k ) notes and a f 0 extraction (Doras et al., 2019), that extracts the main pitch for each instant. Pitch defines how we perceive the periodicity of music signals and it is highly related to the notion of "musical notes". ...
Thesis
This dissertation proposes the study of multimodal learning in the context of musical signals. Throughout, we focus on the interaction between audio signals and text information. Among the many text sources related to music that can be used (e.g. reviews, metadata, or social network feedback), we concentrate on lyrics. The singing voice directly connects the audio signal and the text information in a unique way, combining melody and lyrics where a linguistic dimension complements the abstraction of musical instruments. Our study focuses on the audio and lyrics interaction for targeting source separation and informed content estimation.Real-world stimuli are produced by complex phenomena and their constant interaction in various domains. Our understanding learns useful abstractions that fuse different modalities into a joint representation. Multimodal learning describes methods that analyse phenomena from different modalities and their interaction in order to tackle complex tasks. This results in better and richer representations that improve the performance of the current machine learning methods.To develop our multimodal analysis, we need first to address the lack of data containing singing voice with aligned lyrics. This data is mandatory to develop our ideas. Therefore, we investigate how to create such a dataset automatically leveraging resources from the World Wide Web. Creating this type of dataset is a challenge in itself that raises many research questions. We are constantly working with the classic ``chicken or the egg'' problem: acquiring and cleaning this data requires accurate models, but it is difficult to train models without data. We propose to use the teacher-student paradigm to develop a method where dataset creation and model learning are not seen as independent tasks but rather as complementary efforts. In this process, non-expert karaoke time-aligned lyrics and notes describe the lyrics as a sequence of time-aligned notes with their associated textual information. We then link each annotation to the correct audio and globally align the annotations to it. For this purpose, we use the normalized cross-correlation between the voice annotation sequence and the singing voice probability vector automatically, which is obtained using a deep convolutional neural network. Using the collected data we progressively improve that model. Every time we have an improved version, we can in turn correct and enhance the data.Collecting data from the Internet comes with a price and it is error-prone. We propose a novel data cleansing (a well-studied topic for cleaning erroneous labels in datasets) to identify automatically any errors which remain, allowing us to estimate the overall accuracy of the dataset, select points that are correct, and improve erroneous data. Our model is trained by automatically contrasting likely correct label pairs against local deformations of them. We demonstrate that the accuracy of a transcription model improves greatly when trained on filtered data with our proposed strategy compared with the accuracy when trained using the original dataset. After developing the dataset, we center our efforts in exploring the interaction between lyrics and audio in two different tasks.First, we improve lyric segmentation by combining lyrics and audio using a model-agnostic early fusion approach. As a pre-processing step, we create a coordinate representation as self-similarity matrices (SMMs) of the same dimensions for both domains. This allows us to easy adapt an existing deep neural model to capture the structure of both domains. Through experiments, we show that each domain captures complementary information that benefit the overall performance.Secondly, we explore the problem of music source separation (i.e. to isolate the different instruments that appear in an audio mixture) using conditioned learning. In this paradigm, we aim to effectively control data-driven models by context information. We present a novel approach based on the U-Net that implements conditioned learning using Feature-wise Linear Modulation (FiLM). We first formalise the problem as a multitask source separation using weak conditioning. In this scenario, our method performs several instrument separations with a single model without losing performance, adding just a small number of parameters. This shows that we can effectively control a generic neural network with some external information. We then hypothesize that knowing the aligned phonetic information is beneficial for the vocal separation task and investigate how we can integrate conditioning mechanisms into informed-source separation using strong conditioning. We adapt the FiLM technique for improving vocal source separation once we know the aligned phonetic sequence. We show that our strategy outperforms the standard non-conditioned architecture.Finally, we summarise our contributions highlighting the main research questions we approach and our proposed answers. We discuss in detail potential future work, addressing each task individually. We propose new use cases of our dataset as well as ways of improving its reliability, and analyze our conditional approach and the different strategies to improve it.
... To solve the global frequency errors, we perform a correlation in frequency between the note level (Doras et al., 2019). The f 0 is a matrix over time where each frame stores the note likelihoods obtained directly from the original audio. ...
Article
Full-text available
The DALI dataset is a large dataset of time-aligned symbolic vocal melody notations (notes) and lyrics at four levels of granularity. DALI contains 5358 songs in its first version and 7756 for the second one. In this article, we present the dataset, explain the developed tools to work the data and detail the approach used to build it. Our method is motivated by active learning and the teacher-student paradigm. We establish a loop whereby dataset creation and model learning interact, benefiting each other. We progressively improve our model using the collected data. At the same time, we correct and enhance the collected data every time we update the model. This process creates an improved DALI dataset after each iteration. Finally, we outline the errors still present in the dataset and propose solutions to global issues. We believe that DALI can encourage other researchers to explore the interaction between model learning and dataset creation, rather than regarding them as independent tasks.
Article
Extracting pitch information from music recordings is a challenging but important problem in music signal processing. Frame-wise transcription or multi-pitch estimation aims for detecting the simultaneous activity of pitches in polyphonic music recordings and has recently seen major improvements thanks to deep-learning techniques, with a variety of proposed model architectures. In this paper, we compare different architectures based on convolutional neural networks, the U-net structure, and self-attention components. We propose several modifications to these architectures including self-attention modules for skip connections, recurrent layers to replace the self-attention, and a multi-task strategy with simultaneous prediction of the degree of polyphony. We compare variants of these architectures in different sizes for multi-pitch estimation, focusing on Western classical music beyond the piano-solo scenario using the MusicNet and Schubert Winterreise datasets. Our experiments indicate that most architectures yield competitive results and that larger model variants seem to be beneficial. However, we find that these results substantially depend on randomization effects and the particular choice of the training–test split, which questions the claim of superiority for particular architectures given only small improvements. We therefore investigate the influence of dataset splits in the presence of several movements of a work cycle (cross-version evaluation) and propose a best-practice evaluation strategy for MusicNet, which weakens the influence of individual test tracks and suppresses overfitting to specific works and recording conditions. A final cross-dataset evaluation suggests that improvements on one specific dataset do not necessarily generalize to other scenarios, thus emphasizing the need for further high-quality multi-pitch datasets in order to reliably measure progress in music transcription tasks.
Conference Paper
Full-text available
Generating continuous f0 annotations for tasks such as melody extraction and multiple f0 estimation typically involves running a monophonic pitch tracker on each track of a multitrack recording and manually correcting any estimation errors. This process is labor intensive and time consuming, and consequently existing annotated datasets are very limited in size. In this paper we propose a framework for automatically generating continuous f0 annotations without requiring manual refinement: the estimate of a pitch tracker is used to drive an analysis/synthesis pipeline which produces a synthesized version of the track. Any estimation errors are now reflected in the synthesized audio, meaning the tracker's output represents an accurate annotation. Analysis is performed using a wide-band harmonic sinusoidal modeling algorithm which estimates the frequency, amplitude and phase of every harmonic, meaning the synthesized track closely resembles the original in terms of timbre and dynamics. Finally the synthesized track is automatically mixed back into the multitrack. The framework can be used to annotate multitrack datasets for training learning-based algorithms. Furthermore, we show that algorithms evaluated on the automatically gen-erated/annotated mixes produce results that are statistically indistinguishable from those they produce on the original, manually annotated, mixes. We release a software library implementing the proposed framework, along with new datasets for melody, bass and multiple f0 estimation.
Conference Paper
Full-text available
Estimating fundamental frequencies in polyphonic music remains a notoriously difficult task in Music Information Retrieval. While other tasks, such as beat tracking and chord recognition have seen improvement with the application of deep learning models, little work has been done to apply deep learning methods to fundamental frequency related tasks including multi-f0 and melody tracking, primarily due to the scarce availability of labeled data. In this work, we describe a fully convolutional neural network for learning salience representations for estimating fundamental frequencies, trained using a large, semi-automatically generated f0 dataset. We demonstrate the effectiveness of our model for learning salience representations for both multi-f0 and melody tracking in polyphonic audio, and show that our models achieve state-of-the-art performance on several multi-f0 and melody datasets. We conclude with directions for future research.
Article
Full-text available
We present a neural network model for polyphonic music transcription. The architecture of the proposed model is analogous to speech recognition systems and comprises an acoustic model and a music language mode}. The acoustic model is a neural network used for estimating the probabilities of pitches in a frame of audio. The language model is a recurrent neural network that models the correlations between pitch combinations over time. The proposed model is general and can be used to transcribe polyphonic music without imposing any constraints on the polyphony or the number or type of instruments. The acoustic and language model predictions are combined using a probabilistic graphical model. Inference over the output variables is performed using the beam search algorithm. We investigate various neural network architectures for the acoustic models and compare their performance to two popular state-of-the-art acoustic models. We also present an efficient variant of beam search that improves performance and reduces run-times by an order of magnitude, making the model suitable for real-time applications. We evaluate the model's performance on the MAPS dataset and show that the proposed model outperforms state-of-the-art transcription systems.
Article
We consider the problem of transcribing polyphonic piano music with an emphasis on generalizing to unseen instruments. We use deep neural networks and propose a novel approach that predicts onsets and frames using both CNNs and LSTMs. This model predicts pitch onset events and then uses those predictions to condition framewise pitch predictions. During inference, we restrict the predictions from the framewise detector by not allowing a new note to start unless the onset detector also agrees that an onset for that pitch is present in the frame. We focus on improving onsets and offsets together instead of either in isolation as we believe it correlates better with human musical perception. This technique results in over a 100% relative improvement in note with offset score on the MAPS dataset.
Article
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image.
Conference Paper
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .