Conference PaperPDF Available

Re-thinking sound separation: prior information and additivity constraint in separation algorithms


Abstract and Figures

In this paper, we study the effect of prior information on the quality of informed source separation algorithms. We present results with our system for solo and accompaniment separation and contrast our findings with two other state-of-the art approaches. Results suggest current separation techniques limit performance when compared to extraction process of prior information. Furthermore, we present an alternative view of the separation process where the additivity constraint of the algorithm is removed in the attempt to maximize obtained quality. Plausible future directions in sound separation research are discussed.
Content may be subject to copyright.
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-6, 2013
Estefanía Cano & Christian Dittmar
Semantic Music Technologies
Fraunhofer IDMT
Ilmenau, Germany
Gerald Schuller
Institute for Media Technology
Ilmenau University of Technology
Ilmenau, Germany
In this paper, we study the effect of prior information on the quality
of informed source separation algorithms. We present results with
our system for solo and accompaniment separation and contrast
our findings with two other state-of-the art approaches. Results
suggest current separation techniques limit performance when com-
pared to extraction process of prior information. Furthermore, we
present an alternative view of the separation process where the ad-
ditivity constraint of the algorithm is removed in the attempt to
maximize obtained quality. Plausible future directions in sound
separation research are discussed.
Sound source separation deals with the extraction of independent
sound sources from an audio mix. To address this problem, many
approaches have been proposed in the literature: filtering and mask-
ing techniques, statistical approaches, perceptually motivated sys-
tems, time-frequency representation and signal models are some
of the techniques used. However even today, sound separation is
still considered an unsolved problem. Separation quality of state-
of-the-art systems is very limited and dependent on the type of sig-
nals used. Currently, there is still no clear direction for a general
solution to this problem.
After many years of research, results in the field suggest that
separation performance can be improved when prior information
about the sources is available. The inclusion of known informa-
tion about the sources in the separation scheme is referred to as
Informed Sound Source Separation (ISS) and comprises, among
others, the use of MIDI-like musical scores, the use of pitch tracks
of one or several sources, oracle sound separation where the orig-
inal sources are available, and the extraction of model parameters
from training data of a particular sound source.
In Sec. 3 we present a general overview of the state-of-the-art in
sound source separation. In an attempt to further understand the
potential of current algorithms for informed sound separation, we
ask ourselves the questions: How far can we get by using prior in-
formation as pitch or musical scores? What could be the expected
quality improvement of separation algorithms if we could provide
more accurate prior-information of this kind? To address these
questions, we discuss in Sec. 5.1 results from three state-of-the-art
systems for informed source separation when ground truth (or very
accurate) prior information is available. In Sec. 5.2 we use the in-
sights obtained in the previous analyses to propose an alternative
to sound source separation. We also address the fundamental goal
of sound separation in an attempt to get some insight for future
research directions. Concluding remarks are presented in Sec. 6
In general, source separation approaches can be classified accord-
ing to the processing technique used. Three main categories exist:
statistical approaches, classical signal processing approaches, and
computational auditory scene analysis (CASA) approaches. Sta-
tistical techniques for sound separation generally assume certain
statistical properties of the sound sources. Systems based on In-
dependent Subspace Analysis (ISA) [1], [2], Non-Negative Matrix
Factorization (NMF) [3], [4], [5], tensor factorization[6], [7], and
sparse coding [8], [9], [10] have been proposed. In the case of sig-
nal processing approaches for sound separation, different forms of
masking and filtering techniques to extract the desired sources are
used [11], [12]. Computational auditory scene analysis (CASA)
techniques have also been proposed [13], [14].
Many systems for sound source separation have attempted to
use pitch as prior information.These systems are based on the as-
sumption that every sound source follows a defined pitch sequence
over time. The system described in [15] proposes an invertible
mid-level representation of the audio signal which gives access
to some semantically rich salience functions for pitch and timbre
content analysis. The system uses an instantaneous signal model
(IMM) which represents the audio signal as the sum of a signal
of interest, i.e., the lead instrument, and a residual, i.e., accom-
paniment. A source-filter model is used to represent the signal of
interest. Information from the source is related to the pitch of the
lead instrument and information from the filter is related to the tim-
bre of the instrument. The residual is modeled using non-negative
matrix factorization (NMF). The mid-level representation is used
to separate lead instrument form accompaniment in conjunction
with a Wiener masking approach. In [16], an approach for singing
voice extraction in stereo recordings that uses panning information
and a probabilistic pitch tracking approach is presented. Some ap-
proaches have been proposed for supervised pitch extraction with
a subsequent separation scheme [17], [18], [19], [20].
Score-informed source separation extracts audio sources from
the mix by using a MIDI-like score representation of the desired
source(s). A score-informed source separation algorithm is pre-
sented in [21]. This system attempts to separate solo instruments
from their accompaniment using MIDI-like scores of the lead in-
strument as prior information. The approach uses chroma-based
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-5, 2013
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-6, 2013
dynamic time warping (DTW) to address global misalignments be-
tween the score and the audio signal. Furthermore, a MIDI confi-
dence measure is proposed to deal with small-scale misalignments.
In [22] a score-informed separation algorithm is described, which
is based on Probabilistic Latent Component Analysis (PLCA) and
the use of synthesized versions of the score as prior distributions
in the PLCA decomposition of the original mix.
In this section, we describe two algorithms used for the experi-
ments described in sections 5.1 and 5.2. The results from the dif-
ferent experiments are used to address the questions posed in the
motivation in Sec. 2.
In [23], we propose a pitch-informed method to separate solo
instruments from accompaniment. Pitch information from the solo
instrument is extracted with an approach described in [24]. The
rough pitch estimates are refined using a linear interpolation ap-
proach where the energy of the fundamental frequency and its har-
monic components is calculated for each interpolation step. The
maximum energy is taken as an indicator of the new fundamen-
tal frequency. A harmonic component refinement stage iteratively
constructs a harmonic series for each fundamental frequency us-
ing known acoustical characteristics of musical instruments. Ini-
tial binary masks are created based on the iterative estimation and
a post-processing stage is used to take care of attack frames and re-
duce interference from percussive sources. After post-processing,
masks are no longer binary. Solo and accompaniment sources are
re-synthesized using the obtained masks. For the remainder of this
paper, this algorithm will be referred to as Cano1.
Furthermore, we present a basic modification that makes our
algorithm more suited for vocal extraction. This algorithm will be
referred to as Cano2 and simply modifies the estimation stage by
including a noise spectrum in the Harmonic Refinement stage to
capture characteristic noise-like sounds in vocal signals. Similar
approaches have also been used in [17].
For the experiments conducted in this paper, a dataset of 10 multi-
track recordings was used. These recording are part of the PEASS
1and BSS 2datasets and are freely available for download un-
der CC license. All the signals in the dataset are vocal tracks
(male or female) with accompaniment. For all signals, the multi-
track recordings were mixed to obtain accompaniment tracks, solo
tracks, and a final monaural mix. The signals are described in Ta-
ble 1.
Recognizing the importance of including perceptual aspects
in the evaluation of sound separation results, we use the PEASS
Toolkit- Perceptual Evaluation Methods for Audio Source Separa-
tion [25] to measure quality of the separated signals.The PEASS
Toolkit presents a set of four objective perceptual measures for
separation quality assessment, i.e., Overall Perceptual Score (OPS),
Target Perceptual Score (TPS), Interference Perceptual Score (IPS),
Artifact Perceptual Score (APS). For reference purposes, we also
2 dB/?show=
present common objective scores based on energy ratios, i.e., Sig-
nal to Distortion Ratio (SDR), Image to Spatial Distortion Ratio
(ISR), Signal to Interference Ratio (SIR), Signal to Artifact Ratio
5.1. Prior Information in Separation Algorithms
In this section, we evaluate the effects of prior information on the
quality of the proposed approach and contrast our finding with two
other state-of-the-art algorithms.
We refer to our previous work on pitch-informed sound sepa-
ration (Cano1) to separate audio recordings into solo and accom-
paniment tracks. It is to be noted that both the pitch extraction and
separation stages in Cano1 are completely automatic. To create
ground truth information, the Songs2See Editor [19] was used to
manually correct and refine the pitch extraction of the solo instru-
ment. We used the corrected pitch sequences to feed our separation
algorithm and obtain solo and accompaniment tracks by bypassing
the automatic pitch detection stage. The goal of this experiment is
to assess the potential of the separation algorithm when accurate
prior information is available. The algorithm that uses ground truth
pitch information will be referred to as CanoU.
For this study, signals 1, 8, and 9 from our dataset were se-
lected to obtain ground truth pitch information and perform sepa-
ration with the two algorithms ( Cano1,CanoU). Results are pre-
sented in Table 2. Perceptual measures tend to evidence a qual-
ity improvement when accurate pitch information is provided to
the algorithm. However, the more interesting observation is the
fact that even though there is a quality improvement, results are
far from reaching maximum quality scores. The maximum Over-
all Perceptual Score (OPS) obtained was 32.02 for accomp9 and
high Interference Perceptual Scores (IPS) are obtained in general,
reaching the highest score of 74.8 in solo1. The highest scores ob-
tained for the four perceptual measures are written in bold font in
Table 2.
Given that these results only represent the particular case of
our algorithm and cannot be generalized, we revisit the results pre-
sented by [15] and [21] to get a wider view of the performance of
separation algorithms. The details of these algorithms were briefly
described in Sec. 3. For the remainder of this paper, these al-
gorithms will be referred to as Durrieu and Bosch respectively.
In both cases, a system for solo/accompaniment separation is pre-
sented. Similarly, the authors present results comparing the perfor-
mance of the fully automatic algorithm with the one obtained when
ground truth information is available. In [17], Durrieu presents an
user-assisted algorithm where the pitch track of the solo instru-
ment can be refined using a specially designed user interface. This
algorithm will be referred to as DurrieuU. In the case of [21], the
authors manually align the scores to the audio tracks and use them
as ground truth information. The user-assisted version of this al-
gorithm will be referred to as BoschU.
In Table 3 we show some of the results presented by the au-
thors related to these four algorithms. In the case of Durrieu and
DurrieuU, we present the average scores for solo extraction ob-
tained in the 2011 Signal Separation Evaluation Campaign (SiSEC11),
which can be accessed in the campaign’s website 3. In this case,
both the PEASS measures and the energy ratios are available. In
the case of Bosch and BoschU, the results for solo and accompa-
niment extraction for the dataset S1-D3 using a Wall-N mask are
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-5, 2013
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-6, 2013
Table 1: Data set used: In the text, signal numbers are used to refer to each of the signals.
Signal Num. Name Segment [sec]
1 Bearlin Roads 85 - 99
2 Tamy que pena tanto faz 6 - 19
3 Another Dreamer 69 - 94
4 Ultimate nz tour 43 - 61
5 Dreams 0 - 35
6 Life as a disturbed infobeing 0 - 57
7 Mix Tape 7 - 53
8 The ones we love 32 - 48
9 We weren’t there 0 - 32
10 Wreck 15 - 34
Table 2: Signal version obtained with the different masking and post-processing approaches
solo Cano1 20.01 2.41 72.25 6.52 -5.19 -4.15 5.26 9.85
CanoU 22.09 3.17 74.8 9.21 -6.71 -5.99 8.38 11.58
accomp Cano1 24.99 34.38 57.30 36.97 -3.37 -3.25 12.66 15.55
CanoU 27.08 39.63 54.31 40.64 -3.544 -3.42 13.02 17.13
solo Cano1 19.31 2.60 51.16 9.36 -4.91 -3.70 3.17 11.368
CanoU 23.23 3.20 72.73 8.49 -4.43 -3.72 6.47 12.67
accomp Cano1 26.60 43.46 62.41 42.53 -3.54 -3.28 10.73 13.46
CanoU 30.73 56.13 55.33 48.41 -3.623 -3.41 12.65 14.66
solo Cano1 25.66 0.96 67.7 2.23 -3.64 -3.11 8.47 10.23
CanoU 18.05 3.8 61.31 10.83 -3.72 -3.01 5.81 11.95
accomp Cano1 30.84 29.03 59.74 33.41 -3.44 -3.15 9.87 14.48
CanoU 32.02 30.44 53.96 33.69 -2.81 -2.55 9.59 15.12
presented. In this case only the Signal to Distortion Ratio (SDR)
is available. We refer the reader to [21] for details. It is important
to bear in mind that the three algorithms presented in this Section
(Cano, Durrieu, Bosch) make use of different datasets and their re-
sults cannot be used for direct comparison. The goal of presenting
these results is to describe a similar phenomenon occurring in dif-
ferent algorithms but not to perform a direct comparison between
A similar behavior is observed in both of the comparison al-
gorithms. The use of ground truth prior information tends to result
in higher quality measures. However, two important observations
can be made: (1) Scores with ground truth information are still
far from reaching maximum levels, and (2) Quality differences be-
tween ground truth and automatically extracted pitch are marginal.
In general, results reveal that there is still much room for im-
provement when it comes to informed sound separation algorithms.
Including prior information obviously benefits performance but it
is hard to envision a generalized and robust solution given cur-
rent results. This leads us to two possible paths: (1) On the one
hand we could consider the possibility that the type of prior infor-
mation that we are currently using does not carry enough signal
details to allow robust separation and consider possibilities to en-
hance the information used. Taking into account the great diversity
not only of musical signals, but also of playing styles, genres, and
recording conditions which a separation algorithm can encounter,
the expectation that such general information as pitch could suf-
fice to guide the separation schemes, comes short. We could then
consider including, besides pitch information, other types of infor-
mation that allow better characterization of sound sources. This
would naturally lead to the development of target-designed algo-
rithms optimized for the extraction of a particular class of signals.
The use of instrument-specific information and instrument models
within the separation schemes could be an option as for example
in [26]. (2) On the other hand we could consider the possibility to
further improve our current sound analysis and synthesis methods
so that more accurate information can be extracted. Works on this
topic include developments of reassignment and derivative meth-
ods among others [27]. In [28] and [29], the authors already rec-
ognize the limitations of current analysis techniques and propose
an informed-analysis front-end to improve sound source separa-
tion. In these approaches prior information is included directly in
the analysis in the form of watermarks and bits of ground truth in-
formation, respectively. In both cases, separation results evidence
an improvement. However, theses approaches have a limitation in
the sense that the original signals need to be available to extract
the prior information inserted in the analysis. Having the original,
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-5, 2013
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-6, 2013
Table 3: Comparison between two automatic algorithms and their corresponding user-assisted versions
Durrieu [solo] 22.4 28.8 59.0 30.8 3.8 6.2 Inf 3.1
DurrieuU [solo] 26.0 28.4 61.1 29.9 5.4 8.3 Inf 5.4
Bosch [Accom/solo] - - - - 10.35/6.20 - - -
BoschU [Accom/solo] - - - - 10.46/6.31 - - -
unmixed signals is not always possible.
There is however an alternative possibility to be explored. Re-
sults have shown that not only is the available information about
the sources of critical importance for separation performance, but
also the mechanisms used to include such information in the sep-
aration scheme. Including prior information in the time-frequency
domain (like pitch or MIDI scores) has proven to contribute to the
quality of sound separation. Including information in the analy-
sis stage (like watermarks and bits of information) has also proven
to improve separation. This option with the difficulty of requiring
the original sources to extract the prior information. The third op-
tion is then naturally, including prior information about the sources
in the synthesis stage. Here again, separation algorithms would be
optimized to deal with a particular class of signals. Instrument syn-
thesis models, developed and trained off-line, could be used to re-
create the original signals as closely as possible. This option leaves
open the possibility to include information in the time-frequency
transform coming from domains as pitch, timbre, scores, etc. Fur-
thermore, having the original signals would not be required. An-
other important characteristic of such an approach would be that
a strict constraint to exactly reconstruct the original mix from the
extracted sources, could not be set. In the remainder of this pa-
per, the hard constraint imposed to most separation algorithms to
exactly reconstruct the mix from the extracted sources is referred
to as additivity constraint. With the third option, the analysis and
synthesis stages of separation algorithms would most likely use
different signal processing techniques and such constraint would
be difficult to impose.
5.2. Redefining Sound Separation
After the discussion presented in Sec. 5.1, there is still an open
question that we wish to address: Is there any possibility to obtain
a performance gain with our current separation approaches without
fundamentally changing them?
To address this question, we created different versions of our
separation algorithm which are described in Table 4. For each ver-
sion, the pitch detection and spectral component estimation is kept
unchanged, but different spectral masking techniques are used to
obtain the resulting signals. For the cases where Wiener filtering is
used, pdenotes the power to which each individual element of the
spectrograms are raised. In versions 1 and 6, the Post-Processing
stage of the algorithm is bypassed to allow binary and Wiener
masking respectively. We advise the reader to refer to [23] for
algorithm details. As can be seen, the modifications performed to
the algorithm are rather basic and do not fundamentally change the
original system. However, each one of them has clear effects on
its performance.
For this experiment we use the ten tracks from the dataset
described in Table 1 and for each one of them, solo and accom-
paniment tracks are extracted using each of the seven algorithm
versions. As in Sec. 5.1, the PEASS Toolkit is used for qual-
ity evaluation and measures based on energy ratios are presented
for reference. Separation results for the solo and accompaniment
signals using the PEASS Toolkit are presented in Figures 1 and
2 respectively. Similarly, results for the solo and accompaniment
signals using the energy ratios are presented in Figures 3 and 4.
In all figures, the mean performance over the ten signals is pre-
sented for the seven algorithm versions. The whiskers indicate the
standard deviation of each version and the highest score among all
versions, is shown with an inner (blue) dot in the marker.
PEASS Quality Measures: Solo
20 APS
75 IPS
15 TPS
28 OPS
Figure 1: Mean perceptual scores for the solo signals : Overall
Perceptual Score (OPS), Target Perceptual Score (TPS), Interfer-
ence Perceptual Score (IPS), Artifact Perceptual Score (APS).
As can be seen in Fig. 1, for three of the four perceptual
scores (OPS, TPS, APS) for the solo tracks, the highest mean per-
formance is obtained with the Cano2 algorithm (version 5). This
is however an expected result, as the algorithm was specifically
modified to better handle vocal signals. On the other hand, results
for the accompaniment tracks differ. As shown in Fig. 2, the high-
est scores are obtained for three of the measures (OPS, TPS, APS)
with different versions of the Cano1 algorithm. For the Interfer-
ence Perceptual Score (IPS), the highest mean performance is ob-
tained with the Cano2 algorithm. This result further evidences that
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-5, 2013
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-6, 2013
Table 4: Signal version obtained with the different masking and post-processing approaches
Version Description
1 Binary
2 Cano1
3 Cano1 + Wiener [p=2]
4 Cano2 + Wiener [p=2]
5 Cano2
6 Wiener [p=0.3]
7 Cano1 + Wiener [p=0.3]
PEASS Quality Measures: Accompaniment
55 APS
75 IPS
55 TPS
32 OPS
Figure 2: Mean perceptual scores for the accompaniment sig-
nals: Overall Perceptual Score (OPS), Target Perceptual Score
(TPS), Interference Perceptual Score (IPS), Artifact Perceptual
Score (APS).
better solo extraction is obtained with the Cano2 algorithm as for
the backing track, the vocal signal is, in this case, the only source
of interference.
Results suggest that for the particular task of solo and accom-
paniment separation, the highest perceptual scores can be obtained
differently for each of the desired sources. Algorithm modifica-
tions that might benefit solo extraction can potentially have a neg-
ative effect in the performance for accompaniment extraction. In
this line of thought, we conducted informal tests using the Cano2
algorithm to extract solo tracks from different musical instruments
(clarinet, trumpet, and saxophone). In all cases, performance of
solo extraction suggested a performance decrease. This further
confirms the idea that for our current approach, performance might
be maximized if different versions of the algorithm are used for the
solo and accompaniment tracks. This brings us back to the con-
cept of additivity constraint presented in Sec.5.1. Bearing in mind
the possibilities and limitations of our current sound analysis tech-
niques, and the fact that theoretical bound exist for them, allowing
BSS Quality Measures: Solo
15 SAR
12 SIR
−1 ISR
−2 SDR
Figure 3: Mean energy ratios for the solo signals: Signal to Dis-
tortion Ratio (SDR), Image to Spatial Distortion Ratio (ISR), Sig-
nal to Interference Ratio (SIR), Signal to Artifact Ratio (SAR).
separation algorithms to extract sources with the goal of maximiz-
ing perceptual quality instead of reconstructing the original mix,
might bring us better final results.
Following this line of thought, the idea of moving from sep-
aration to understanding presented in a keynote presentation by
Smaragdis 4, becomes relevant. In most cases, source separation
is not the final goal but most likely, an intermediate step to further
types of processing: more accurate music transcription, re-mixing
audio tracks, audio classification, etc. In that sense, extracting the
exact original source might not even be necessary for the final ap-
plication. Different quality requirements for different applications
might be needed: music transcription, for example, would most
likely require high Interference Perceptual Scores (IPS) for robust
performance, as pitch tracks of the independent sources are the fi-
nal goal. On the other hand, IPS requirements might not be so
strict when it comes to re-mixing audio tracks. Minimizing arti-
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-5, 2013
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-6, 2013
BSS Quality Measures: Accompaniment
18 SAR
12 SIR
Figure 4: Mean energy ratios for the accompaniment signals: Sig-
nal to Distortion Ratio (SDR), Image to Spatial Distortion Ratio
(ISR), Signal to Interference Ratio (SIR), Signal to Artifact Ratio
facts and preserving the sources are probably more relevant and
consequently higher APS and TPS might be required. Thus, con-
sidering the final processing goal and its quality requirements, in-
stead of focusing on the separation task only, might bring better
overall results and open possibilities for further analyses.
We have addressed two defining topics in informed sound separa-
tion research: (1) The effects of pitch and score information in the
performance of separation algorithms were studied showing that
attainable quality, when accurate prior information is available,
still fails to reach maximal scores. We review three possibilities
to include prior information in separation approaches; namely in
the analysis , time-frequency transform, and synthesis stages. Due
to the processing possibilities and flexibility that it provides to the
separation scheme, we see great potential in including information
directly in the synthesis stage. Future work will be conducted in
this direction. (2) We propose the possibility to remove the addi-
tivity constraint to improve quality of separation and as a future di-
rection where algorithms could have completely independent anal-
ysis and synthesis approaches. In this sense we redefine our goal
with separation research from extracting the original sources from
the mix, to obtaining sources that meet perceptual quality require-
ments imposed for different applications and that allow the use of
separation schemes as intermediate processing steps.
[1] Derry FitzGerald, R Lawlor, and Eugene Coyle, “Sub-band
independent subspace analysis for drum transcription,” in 5th
International Conference on Digital Audio Effects (DAFx-
02), Hamburg, Germany, 2002, number 5, pp. 1–5.
[2] Christian Uhle, C Dittmar, and T Sporer, “Extraction of
drum tracks from polyphonic music using independent sub-
space analysis,” in 4th International Symposium on Indepen-
dent Component Analysis and Blind Signal Separation (ICA
2003), Nara, Japan, 2003, pp. 843–848.
[3] Romain Hennequin, Bertrand David, and Roland Badeau,
“Score informed audio source separation using a paramet-
ric model of non-negative spectrogram, in IEEE Interna-
tional Conference on Acoustics, Speech, and Signal Process-
ing (ICASSP 2011), Prague, Czech Republic, 2011, num-
ber 1, pp. 45–48.
[4] S. Kirbiz and Paris Smaragdis, Adaptive time-frequency
resolution for single channel sound source separation based
on non-negative matrix factorization, in IEEE 19th Sig-
nal Processing and Communications Applications Confer-
ence (SIU 2011), Antalya, Turkey, 2011, pp. 964–967.
[5] A. Lefevre, Francis Bach, and C Févotte, “Itakura-Saito non-
negative matrix factorization with group sparsity,” in IEEE
International Conference on Acoustics, Speech, and Signal
Processing (ICASSP 2011), Prague, Czech Republic, 2011,
number 1, pp. 21–24.
[6] Derry Fitzgerald, Matt Cranitch, and Eugene Coyle, “Us-
ing tensor factorisation models to separate drums from poly-
phonic music,” in 12th International Conference on Digital
Audio Effects (DAFx-09), Como, Italy, 2009, pp. 1–5.
[7] U. Simsekli and A. Cemgil, “Score guided musical source
separation using generalized coupled tensor factorization,”
in 20th European Signal Processing Conference (EUSIPCO
2012), Bucharest, Romania, 2012, number Eusipco, pp.
[8] Mark D. Plumbley, Thomas Blumensath, Laurent Daudet,
Rémi Gribonval, and Mike Davies, “Sparse representations
in audio and music: from coding to source separation,” Pro-
ceedings of the IEEE, vol. 98, no. 6, pp. 995–1005, 2010.
[9] Samer a Abdallah and Mark D Plumbley, “Unsupervised
analysis of polyphonic music by sparse coding.,” IEEE trans-
actions on neural networks / a publication of the IEEE Neu-
ral Networks Council, vol. 17, no. 1, pp. 179–96, Jan. 2006.
[10] Rémi Gribonval and Synlvain Lesage, “A survey of sparse
component analysis for blind source separation: principles,
perspectives, and new challenges, in European Symposium
on Artificial Neural Networks (ESANN 2006), Bruges, Bel-
gium, 2006, number April.
[11] Derry Fitzgerald, “Harmonic/percussive separation using
median filtering,” in 13th International Conference on Digi-
tal Audio Effects (DAFx-10), Graz, Austria, 2010, number 1,
pp. 10–13.
[12] David Gunawan and Deep Sen, “Separation of Harmonic
Musical Instrument Notes Using Spectro-Temporal Mod-
eling of Harmonic Magnitudes and Spectrogram Inversion
with Phase Optimization,” Journal of the Audio Engineering
Society (AES), vol. 60, no. 12, 2012.
[13] L. Drake, J. Rutledge, J. Zhang, and a Katsaggelos, “A Com-
putational Auditory Scene Analysis-Enhanced Beamforming
Approach for Sound Source Separation,” EURASIP Jour-
nal on Advances in Signal Processing, vol. 2009, no. 1, pp.
403681, 2009.
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-5, 2013
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-6, 2013
[14] Yipeng Li, John Woodruff, and D. Wang, “Monaural musi-
cal sound separation based on pitch and common amplitude
modulation,” IEEE Transactions on Acoustics, Speech, and
Signal Processing, vol. 17, no. 7, pp. 1361–1371, 2009.
[15] Jean-Louis Durrieu, Bertrand David, and Gaël Richard, “A
musically motivated mid-level representation for pitch esti-
mation and musical audio source separation,” IEEE Journal
of Selected Topics in Signal Processing, vol. 5, no. 6, pp.
1180–1191, 2011.
[16] Ricard Marxer, Jordi Janer, and Jordi Bonada, “Low-Latency
instrument separation in polyphonic audio using timbre mod-
els,” Latent Variable Analysis and Signal Separation, vol.
7191, pp. 314–321, 2012.
[17] Jean-Louis Durrieu and Jean-Philippe Thiran, “Musical au-
dio source separation based on user-selected f0 track,Latent
Variable Analysis and Signal Separation, , no. 1, pp. 1–8,
[18] Derry FitzGerald, “User assisted separation using tensor
factorisation,” in 20th European Signal Processing Con-
ference (EUSIPCO 2012), Bucharest, Romania, 2012, pp.
[19] Estefanía Cano, Christian Dittmar, and Sascha Grollmisch,
“Songs2See: Learn to Play by Playing,” in 12th International
Society for Music Information Retrieval Conference (ISMIR
2011), Miami, USA, 2011.
[20] Benoit Fuentes, Roland Badeau, and Gaël Richard, “Blind
Harmonic Adaptive Decomposition Applied to Supervised
Source Separation,” in 2009 Ninth IEEE International Con-
ference on Advanced Learning Technologies, Bucharest, Ro-
mania, 2012, number Eusipco, pp. 2654–2658.
[21] Juan J. Bosch, K. Kondo, R. Marxer, and J. Janer, “Score-
informed and timbre independent lead instrument separation
in real-world scenarios,” in 20th European Signal Processing
Conference (EUSIPCO 2012), Bucharest, Romania, 2012,
vol. 25, pp. 2417–2421.
[22] Joachim Ganseman, Paul Scheunders, Gautham J. Mysore,
and Jonathan S. Abel, “Evaluation of a score-informed
source separation system,” in 11th International Society
for Music Information Retrieval Conference (ISMIR 2010),
Utrecht, Netherlands, 2010.
[23] Estefanía Cano, Christian Dittmar, and Gerald Schuller, “Ef-
ficient Implementation of a System for Solo and Accompa-
niment Separation in Polyphonic Music,” in 20th European
Signal Processing Conference (EUSIPCO 2012), Bucharest,
Romania, 2012, pp. 285–289.
[24] Karin Dressler, “Pitch Estimation by pair-wise Evaluation of
Spectral Peaks,” in Proceedings of the AES 42nd Conference
on Semantic Audio, Ilmenau, 2011, pp. 278–290.
[25] Valentin Emiya, Emmanuel Vincent, Niklas Harlander, and
Volker Hohmann, “Subjective and Objective Quality Assess-
ment of Audio Source Separation,” IEEE Transactions on
Audio, Speech, and Language Processing, vol. 19, no. 7, pp.
2046–2057, Sept. 2011.
[26] Mathieu Coïc and Juan José Burred, “Bayesian Non-negative
Matrix Factorization with Learned Temporal Smoothness
Priors,” in International Conference on Latent Variable
Analysis and Signal Separation (LVA/ICA), Tel-Aviv, Israel,
2012, pp. 280–287.
[27] Sylvain Marchand and Philippe Depalle, “Generalization of
the derivative analysis method to non-stationary sinusoidal
modeling.,” in 11th International Conference on Digital Au-
dio Effects (DAFx-08), Espoo, Finland, 2008, pp. 1–8.
[28] Mathieu Parvaix, Laurent Girin, and Jean-Marc Brossier, “A
watermarking-based method for single-channel audio source
separation,” in IEEE International Conference on Acoustics,
Speech, and Signal Processing (ICASSP 2009), Taiper, Tai-
wan, 2009, pp. 101–104.
[29] Sylvain Marchand and Dominique Fourer, “Breaking the
bounds: Introducing informed spectral analysis,” in 13th In-
ternational Conference on Digital Audio Effects (DAFx-10),
Graz, Austria, 2010, pp. 1–8.
[30] MA Casey and A Westner, “Separation of mixed audio
sources by independent subspace analysis,” in International
Computer Music Conference (ICMC 2000), Berlin, Ger-
many, 2000.
[31] Niall Cahill, Rory Cooney, and Robert Lawlor, “An En-
hanced Implementation of the ADRess (Azimuth Discrim-
ination and Resynthesis) Music Source Separation Algo-
rithm,” in 121th Convention of the Audio Engineering So-
ciety (AES), San Francisco, USA, 2006.
Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-5, 2013
... Given additional side information (as shown in Figure 1), we need to investigate the benefit of imposing different constraints to NMFD [13], [14], [8]. Since imperfect decomposition may result in audible cross-talk (leakage, interference) between the individual drum instruments, we focus on improving their perceptual quality through signal-specific restoration [15], [16]. ...
... For the sake of brevity, we focus on the SDR, SIR and IPS metrics in this article. We follow the recommendations from [15] and favor perceptual quality of the individual sources over the ability to perfectly reconstruct the original mixture from its components. Thus, we emphasize interference-related metrics that correlate to the amount of cross-talk between the component signals. ...
This work addresses the extraction of high-quality component signals from drum solo recordings (breakbeats) for music production and remixing purposes. Specifically, we employ audio source separation techniques to recover sound events from the drum sound mixture corresponding to the individual drum strokes. Our separation approach is based on an informed variant of non-negative matrix factor deconvolution (NMFD) that has been proposed and applied to drum transcription and separation in earlier works. In this paper, we systematically study the suitability of NMFD and the impact of audio-and score-based side information in the context of drum separation. In the case of imperfect decompositions, we observe different cross-talk artifacts appearing during the attack and the decay segment of the extracted drum sounds. Based on these findings, we propose and evaluate two extensions to the core technique. The first extension is based on applying a cascaded NMFD decomposition while retaining selected side information. The second extension is a time-frequency selective restoration approach using a dictionary of single note drum sounds. For all our experiments, we use a publicly available data set consisting of multitrack drum recordings and corresponding annotations that allows us to evaluate the source separation quality. Using this test set, we show that our proposed methods improve the quality of the component signals.
... In [38], we presented a study that evaluates the performance of our proposed method when processing parameters of the algorithm are slightly modified, but its main processing chain remains unchanged. The main goals of this study were on the one hand, to get a better understanding of the behavior and performance of the algorithm under different conditions, and on the other hand, to find ways of maximizing perceptual quality of separated solo and backing tracks under our current approach. ...
... Confirming our findings from the preliminary study presented in [38], results suggest that for our current approach, perceptual quality of solo and backing tracks is optimized differently. Modification that improve quality of solo tracks do not necessarily result in better backing tracks. ...
Full-text available
We present a system for the automatic separation of solo instruments and music accompaniment in polyphonic music recordings. Our approach is based on a pitch detection front-end and a tone-based spectral estimation. We assess the plausibility of using sound separation technologies to create practice material in a music education context. To better understand the sound separation quality requirements in music education, a listening test was conducted to determine the most perceptually relevant signal distortions that need to be improved. Results from the listening test show that solo and accompaniment tracks pose different quality requirements and should be optimized differently. We propose and evaluate algorithm modifications to better understand their effects on objective perceptual quality measures. Finally, we outline possible ways of optimizing our separation approach to better suit the requirements of music education applications.
... They finally used a post-processing stage to refine the separation. In [42], they included a noise spectrum in the harmonic refinement stage to also capture noise-like sounds in vocals. In [39], they additionally included common amplitude modulation characteristics in the separation scheme. ...
Everyday audio recordings involve mixture signals: music contains a mixture of instruments; in a meeting or conference, there is a mixture of human voices. For these mixtures, automatically separating or estimating the number of sources is a challenging task. A common assumption when processing mixtures in the time-frequency domain is that sources are not fully overlapped. However, in this work we consider some cases where the overlap is severe — for instance, when instruments play the same note (unison) or when many people speak concurrently ("cocktail party") — highlighting the need for new representations and more powerful models. To address the problems of source separation and count estimation, we use conventional signal processing techniques as well as deep neural networks (DNN). We first address the source separation problem for unison instrument mixtures, studying the distinct spectro-temporal modulations caused by vibrato. To exploit these modulations, we developed a method based on time warping, informed by an estimate of the fundamental frequency. For cases where such estimates are not available, we present an unsupervised model, inspired by the way humans group time-varying sources (common fate). This contribution comes with a novel representation that improves separation for overlapped and modulated sources on unison mixtures but also improves vocal and accompaniment separation when used as an input for a DNN model. Then, we focus on estimating the number of sources in a mixture, which is important for real-world scenarios. Our work on count estimation was motivated by a study on how humans can address this task, which lead us to conduct listening experiments, confirming that humans are only able to estimate the number of up to four sources correctly. To answer the question of whether machines can perform similarly, we present a DNN architecture, trained to estimate the number of concurrent speakers. Our results show improvements compared to other methods, and the model even outperformed humans on the same task. In both the source separation and source count estimation tasks, the key contribution of this thesis is the concept of “modulation”, which is important to computationally mimic human performance. Our proposed Common Fate Transform is an adequate representation to disentangle overlapping signals for separation, and an inspection of our DNN count estimation model revealed that it proceeds to find modulation-like intermediate features.
... They finally used a post-processing stage to refine the separation. In [84], they included a noise spectrum in the harmonic refinement stage to also capture noise-like sounds in vocals. In [85], they additionally included common amplitude modulation characteristics in the separation scheme. ...
Popular music is often composed of an accompaniment and a lead component, the latter typically consisting of vocals. Filtering such mixtures to extract one or both components has many applications, such as automatic karaoke and remixing. This particular case of source separation yields very specific challenges and opportunities, including the particular complexity of musical structures, but also relevant prior knowledge coming from acoustics, musicology or sound engineering. Due to both its importance in applications and its challenging difficulty, lead and accompaniment separation has been a popular topic in signal processing for decades. In this article, we provide a comprehensive review of this research topic, organizing the different approaches according to whether they are model-based or data-centered. For model-based methods, we organize them according to whether they concentrate on the lead signal, the accompaniment, or both. For data-centered approaches, we discuss the particular difficulty of obtaining data for learning lead separation systems, and then review recent approaches, notably those based on deep learning. Finally, we discuss the delicate problem of evaluating the quality of music separation through adequate metrics and present the results of the largest evaluation, to-date, of lead and accompaniment separation systems. In conjunction with the above, a comprehensive list of references is provided, along with relevant pointers to available implementations and repositories.
... In [14], Cano et al. investigated the complex mutual influence of magnitude and phase on the quality of separated signals in source separation. Cano proposed to soften the additivity constraint in source separation and suggested to use instrument specific resynthesis approaches in [15]. ...
Full-text available
This thesis addresses the development of a system for pitch-informed solo and accompaniment separation capable of separating main instruments from music accompaniment regardless of the musical genre of the track, or type of music accompaniment. For the solo instrument, only pitched monophonic instruments were considered in a single-channel scenario where no panning or spatial location information is available. In the proposed method, pitch information is used as an initial stage of a sinusoidal modeling approach that attempts to estimate the spectral information of the solo instrument from a given audio mixture. Instead of estimating the solo instrument on a frame by frame basis, the proposed method gathers information of tone objects to perform separation. Tone-based processing allowed the inclusion of novel processing stages for attack refinement, transient interference reduction, common amplitude modulation (CAM) of tone objects, and for better estimation of non-harmonic elements that can occur in musical instrument tones. The proposed solo and accompaniment algorithm is an efficient method suitable for real-world applications. A study was conducted to better model magnitude, frequency, and phase of isolated musical instrument tones. As a result of this study, temporal envelope smoothness, inharmonicty of musical instruments, and phase expectation were exploited in the proposed separation method. Additionally, an algorithm for harmonic/percussive separation based on phase expectation was proposed. The algorithm shows improved perceptual quality with respect to state-of-the-art methods for harmonic/percussive separation. The proposed solo and accompaniment method obtained perceptual quality scores comparable to other state-of-the-art algorithms under the SiSEC 2011 and SiSEC 2013 campaigns, and outperformed the comparison algorithm on the instrumental dataset described in this thesis. As a use-case of solo and accompaniment separation, a listening test procedure was conducted to assess separation quality requirements in the context of music education. Results from the listening test showed that solo and accompaniment tracks should be optimized differently to suit quality requirements of music education. The Songs2See application was presented as a commercial music learning software which includes the proposed solo and accompaniment separation method.
Musical works are often composed of two characteristic components: the background (typically the musical accompaniment), which generally exhibits a strong rhythmic structure with distinctive repeating time elements, and the melody (typically the singing voice or a solo instrument), which generally exhibits a strong harmonic structure with a distinctive predominant pitch contour. Drawing from findings in cognitive psychology, we propose to investigate the simple combination of two dedicated approaches for separating those two components: a rhythm-based method that focuses on extracting the background via a rhythmic mask derived from identifying the repeating time elements in the mixture and a pitch-based method that focuses on extracting the melody via a harmonic mask derived from identifying the predominant pitch contour in the mixture. Evaluation on a data set of song clips showed that combining such two contrasting yet complementary methods can help to improve separation performance—from the point of view of both components—compared with using only one of those methods, and also compared with two other state-of-the-art approaches.
Conference Paper
Full-text available
Our goal is to obtain improved perceptual quality for separated solo instruments and accompaniment in polyphonic music. The proposed approach uses a pitch detection algorithm in conjunction with a spectral filtering based source separation. The algorithm was designed to work with polyphonic signals regardless of the main instrument, type of accompaniment or musical style. Our approach features a fundamental frequency estimation stage, a refined harmonic structure for the spectral mask and a post-processing stage to reduce artifacts. The processing chain has been kept light. The use of perceptual measures for quality assessment revealed improved quality in the extracted signals with respect to our previous approach. The results obtained with our algorithm were compared with other state-of-the-art algorithms under SISEC 2011.
Full-text available
Songs2See is a music game developed based on pitch de-tection, sound separation, music transcription, interface de-velopment and audio analysis technologies.While keeping the entertainment and excitement of normal video games, Songs2See provides the users with a practice tool that makes the process of learning to play a musical instrument, a more enjoyable and engaging one. The two key features of this application are: 1. The use of real musical instruments in-stead of game controllers, 2. The possibility to create your own musical exercise content for the game. Songs2See is composed of two main applications: the Music Game used at practice time and the Music Editor used to create content for the game out of normal audio tracks. During the demo session, the main features and options of both applications will be explained. We will show how the game works with different musical instruments, the practice modes supported, the learning aids included, i.e., fingerings, score view, piano roll view, and the analysis features avail-able in the Music Editor.
Conference Paper
Full-text available
Recent research has demonstrated that user assisted techniques, where the user provides a “guide” version of the source to be separated, are capable of giving good sound source separation. Here the user sings or plays along with the target source, and the user input is used to guide the separation towards the source of interest. This is typically done in a factorisation framework, such as non-negative matrix factorisation. Here we extend such approaches to a tensor factorisation framework to deal with multichannel signals. Further, we demonstrate how this framework can be used to improve the output from other user assisted techniques, such as the Adress algorithm, where the user manually selects a region from the stereo space corresponding to a given source.
Conference Paper
Full-text available
In this paper, a new supervised source separation system is introduced. The Constant-Q Transform (CQT) of an audio signal is first analyzed through an algorithm called Blind Harmonic Adaptive Decomposition (BHAD). This algorithm provides an estimation of the polyphonic pitch content of the input signal, from which the user can select the notes to be extracted. The system then automatically separates the corresponding source from the audio mixture, by means of time-frequency masking of the CQT. The system has been evaluated both in a task of multipitch estimation in order to measure the quality of the decomposition, and in a task of user-guided melody extraction to assess the quality of the separation. The very promising results obtained highlight the reliability of the proposed model.
Full-text available
In this paper, we present a fast, simple and effective method to separate the harmonic and percussive parts of a monaural audio signal.The technique involves the use of median filtering on a spectrogram of the audio signal, with median filtering performed across successive frames to suppress percussive events and enhance harmonic components, while median filtering is also performed across frequency bins to enhance percussive events and supress harmonic components. The two resulting median filtered spectrograms are then used to generate masks which are then applied to the original spectrogram to separate the harmonic and percussive parts of the signal. We illustrate the use of the algorithm in the context of remixing audio material from commercial recordings.
Resolving overlapping harmonics and the accurate re-synthesis of source signals remain persistent and unsolved issues when separating individual sources from a single channel mixture of harmonic musical instruments. In this paper we present methods that address both these issues. It is shown that harmonics are better resolved using a harmonic magnitude track prediction model that exploits the spectro-temporal correlations that exist between instrument harmonics. The performance of the model is evaluated against other approaches and is shown to provide improved and more robust estimates of the harmonic magnitudes as a function of time. In addition, we employ a closed-loop, augmented spectrogram inversion algorithm that provides a method of synthesizing the separated sources from accurate harmonic magnitude information. This algorithm is shown to provide signal-to-noise ratio improvements of more than 4 dB for mixtures containing two sources, and more than 6 dB for mixtures containing six sources, compared to synthesis using the phase spectrum of the mixture.
In this paper, a new approach for pitch estimation in polyphonic musical audio is presented. The algorithm is based on the pair-wise analysis of spectral peaks. The idea of the technique lies in the identification of partials with successive (odd) harmonic numbers. Since successive partials of a harmonic sound have well defined frequency ratios, a possible fundamental can be derived from the instantaneous frequencies of the two spectral peaks. Consecutively, the identified harmonic pairs are rated according to harmonicity, timbral smoothness, the appearance of intermediate spectral peaks, and harmonic number. Finally, the resulting pitch strengths are added to a pitch spectrogram. The pitch estimation was developed for the identification of the predominant voice (e.g. melody) in poly-phonic music recordings. It was evaluated as part of a melody extraction algorithm during the Music Information Retrieval Evaluation eXchange (MIREX 2006 and 2009), where the algorithm reached the best overall accuracy as well as very good performance measures.
In this paper we present a novel enhancement to an existing music source separation algorithm which allows for a 76% decrease in computational load whilst enhancing its separation capabilities. The enhanced implementation is based on the ADRess (Azimuth Discrimination and Resynthesis) algorithm which performs a separation of sources within stereo music recordings based on the spatial audio cues created by source localization techniques. The ADRess algorithm employs gain scaling and phase cancellation techniques to isolate sources based on their position across the stereo field. Objective measures and subjective listening tests have shown the separation performance of the enhanced algorithm to be objectively and perceptually comparable with that of the original ADRess algorithm, whilst realizing a finer spatial resolution.
Conference Paper
We present a method for lead instrument separation using an available musical score that may not be properly aligned with the polyphonic audio mixture. Improper alignment degrades the performance of existing score-informed source separation algorithms. Several techniques are proposed to manage local and global misalignments, such as a score information confidence measure, and a chroma based MIDIaudio alignment. The proposed separation approach uses time-frequency masks derived from a pitch tracking algorithm, which is guided by the MIDI file's main melody. Timbre information is not needed in the present approach. An evaluation conducted on a custom dataset of stereo convolutive audio mixtures showed significant improvement using the proposed techniques compared to the non score-informed separation.
Conference Paper
Providing prior knowledge about sources to guide source separation is known to be useful in many audio applications. In this paper we present two tensor factorization models for musical source separation where musical information is incorporated by using the Generalized Coupled Tensor Factorization (GCTF) framework. The approach is an extension of Nonnegative Matrix Factorization where more than one matrix or tensor object is simultaneously factorized. The first model uses a temporally aligned transcription of the mixture and incorporates spectral knowledge via coupling. In contrast of using a temporally aligned transcription, the second model incorporates harmonic information by taking an approximate, incomplete, and not necessarily aligned transcription of the musical piece as input. We evaluate our models on piano and cello duets where the experiments show that instead of using a temporally aligned transcription, we can achieve competitive results by using only a partial and incomplete transcription.