ArticlePDF Available

Harmonic/Percussive Separation using Median Filtering

Authors:

Abstract and Figures

In this paper, we present a fast, simple and effective method to separate the harmonic and percussive parts of a monaural audio signal.The technique involves the use of median filtering on a spectrogram of the audio signal, with median filtering performed across successive frames to suppress percussive events and enhance harmonic components, while median filtering is also performed across frequency bins to enhance percussive events and supress harmonic components. The two resulting median filtered spectrograms are then used to generate masks which are then applied to the original spectrogram to separate the harmonic and percussive parts of the signal. We illustrate the use of the algorithm in the context of remixing audio material from commercial recordings.
Content may be subject to copyright.
Proc. of the 13
th
Int. Conference on Digital Audio Effects (DAFx-10), Graz, Austria , September 6-10, 2010
HARMONIC/PERCUSSIVE SEPARATION USING MEDIAN FILTERING
Derry FitzGerald,
Audio Research Group
Dublin Institute of Technology
Kevin St.,Dublin 2, Ireland
derry.fitzgerald@dit.ie
ABSTRACT
In this paper, we present a fast, simple and effective method to sep-
arate the harmonic and percussive parts of a monaural audio signal.
The technique involves the use of median filtering on a spectro-
gram of the audio signal, with median filtering performed across
successive frames to suppress percussive events and enhance har-
monic components, while median filtering is also performed across
frequency bins to enhance percussive events and supress harmonic
components. The two resulting median filtered spectrograms are
then used to generate masks which are then applied to the origi-
nal spectrogram to separate the harmonic and percussive parts of
the signal. We illustrate the use of the algorithm in the context of
remixing audio material from commercial recordings.
1. INTRODUCTION
The separation of harmonic and percussive sources from mixed au-
dio signals has numerous applications, both as an audio effect for
the purposes of remixing and DJing, and as a preprocessing stage
for other purposes. This includes the automatic transcription of
pitched instruments,key signature detection and chord detection,
where elimination of the effects of the percussion sources can help
improve results. Similarly, the elimination of the effects of pitched
instruments can help improve results for the automatic transcrip-
tion of drum instruments, rhythm analysis beat tracking.
Recently, the authors proposed a tensor factorisation based al-
gorithm capable of obtaining good quality separation of harmonic
and percussive sources [1]. This algorithm incorporated an addi-
tive synthesis based source-filter model for pitched instruments,
as well as constraints to encourage temporal continuity on pitched
sources. A principal advantage of this approach was that it re-
quired little or no pretraining in comparison to many other ap-
proaches [2, 3, 4]. Unfortunately, a considerable shortcoming of
the tensor factorisation approach is that it is both processor and
memory intensive, making it impractical for use when whole songs
need to be processed, for example such as when remixing a song.
In an effort to overcome this, it was decided to investigate
other approaches capable of separating harmonic and percussive
components without pretraining, but which were also computa-
tionally less intensive. Of particular interest was the approach de-
veloped by Ono et al [5]. This technique was based on the intuitive
idea that stable harmonic or stationary components form horizon-
tal ridges on the spectrogram, while percussive components form
vertical ridges with a broadband frequency response. This can be
seen in Figure 1, where the harmonic components are visible as
This work was supported by Science Foundation Ireland’s Stokes Lec-
turer Program
horizontal lines, while the percussive events can be seen as ver-
tical lines. Therefore, a process which emphasises the horizontal
lines in the spectrogram while suppressing vertical lines should re-
sult in a spectrogram which contains mainly pitched sources, and
vice-versa for the vertical lines to recover the percussion sources.
To this end, a cost function which minimised the L
2
norm of the
power spectrogram gradients was proposed.
Figure 1: Spectrogram of pitched and percussive mixture
Letting W
h,i
denote the element of the power spectrogram W
of a given signal at frequency bin h and the ith time frame, and
similarly defining H
h,i
as an element of H the harmonic power
spectrogram and P
h,i
as an element of P the percussive power
spectrogram, the cost function can then be defined as:
J(H, P) =
1
2σ
2
H
X
h,i
(H
h,i1
H
h,i
)
2
+
1
2σ
2
P
X
h,i
(P
h1,i
P
h,i
)
2
(1)
where σ
H
and σ
P
are parameters used to control the weights of
the harmonic and percussive smoothness respectively. The cost
function is further subject to the additional constraints that
H
h,i
+ P
h,i
= W
h,i
(2)
DAFX-1
Proc. of the 13
th
Int. Conference on Digital Audio Effects (DAFx-10), Graz, Austria , September 6-10, 2010
H
h,i
0, P
h,i
0 (3)
In effect, this is equivalent to assuming that the spectrogram gra-
dients (H
h,i1
H
h,i
) and (P
h1,i
P
h,i
) follow gaussian dis-
tributions. This is not the case, and so a compressed version of
the power spectrogram,
˜
W = W
γ
where 0 < γ 1, is used
instead to partially compensate for this. Iterative update equations
to minimise J(H, P) for H and P were then derived, and the re-
covered harmonic and percussive spectrograms used to generate
masks which were then applied to the original spectrogram before
inversion to the time domain.
In [6] an alternative cost function based on the generalised
Kullback-Liebler divergence was proposed:
J
KL
(H, P)
=
X
h,i
{W
h,i
log
W
h,i
H
h,i
+ P
h,i
W
h,i
+ H
h,i
+ P
h,i
}
+
1
2σ
2
H
X
h,i
(
p
H
h,i1
p
H
h,i
)
2
+
1
2σ
2
P
X
h,i
(
p
P
h1,i
p
P
h,i
)
2
(4)
and new update equations for H and P derived from this cost func-
tion. A real-time implementation of the algorithm using a sliding
block analysis, rather than processing the whole signal, was also
implemented and described in [6].
The system was shown to give good separation performance
at low computational cost, thereby making it suitable as a prepro-
cessor for other applications. Further, the underlying principle of
the algorithm represents a simple intuitive idea that can be used
to derive alternate means of separating harmonic and percussive
components, as will be seen in the next section.
2. MEDIAN FILTERING BASED SEPARATION
As was shown previously, regarding percussive events as vertical
lines and harmonic events as horizontal lines in a spectrogram is
a useful approximation when attempting to separate harmonic and
percussive source. Taking the percussive events as an example,
the algorithms described above in effect smooth out the frequency
spectrum in a given time frame by removing large “spikes” in the
spectrum which correspond to harmonic events. Similarly, har-
monic events in a given frequency bin are smoothed out by remov-
ing “spikes” related to percussive events. Another way of looking
at this is to regard harmonic events as outliers in the frequency
spectrum at a given time frame, and to regard percussive events as
outliers across time in a given frequency bin. This brings us to the
concept of using median filters individually in the horizontal and
vertical directions to separate harmonic and percussive events.
Median filters have been used extensively in image processing
for removing speckle noise and salt and pepper noise from images
[7]. Median filters operate by replacing a given sample in a signal
by the median of the signal values in a window around the sample.
Given an input vector x(n) then y(n) is the output of a median
filter of length l where l defines the number of samples over which
median filtering takes place. Where l is odd, the median filter can
be defined as:
y(n) = median {x(n k : n + k), k = (l 1)/2} (5)
In effect, the original sample is replaced with the middle value
obtained from a sorted list of the samples in the neighbourhood
of the original sample. In cases where l is even, the median is
obtained as the mean of the two values in the middle of in the
sorted list. As opposed to moving average filters, median filters are
effective in removing impulse noise because they do not depend
on values which are outliers from the typical values in the region
around the original sample.
A number of examples are now presented to illustrate the ef-
fects of median filtering in suppressing harmonic and percussive
events in audio spectrograms. Figure 2(a) shows the plot of a fre-
quency spectrum containing a mixture of noise from a snare drum
and notes played by a piano. The harmonics from the piano are
clearly visible as large spikes in the spectrum. Figure 2(b) shows
the same spectrum after median filtering with a filter length of 17.
It can be seen that the spikes associated with the harmonics have
been suppressed, leaving a spectrum where the drum noise now
predominates. Similarly, Figure 3(a) shows the output of a fre-
quency bin across time, again taken from a mixture of snare drum
and piano. The onset of the snare is clearly visible as a large spike
in energy in the frequency bin, while the harmonic energy is more
constant over time. Figure 3(b) shows the output of the frequency
bin after median filtering, and it can be appreciated that the spike
associated with the onset is removed by median filtering, thereby
suppressing the energy due to the percussion event.
Figure 2: Spectrogram frame containing mixture of piano and
snare a) Original spectrum b) Spectrum after median filtering
Given an input magnitude spectrogram S, and denoting the ith
time frame as S
i
, and the hth frequency slice as S
h
a percussion-
enhanced spectrogram frame P
i
can be generated by performing
median filtering on S
i
:
P
i
= M{S
i
, l
perc
} (6)
where M denotes median filtering and l
perc
is the filter length of
the percussion-enhancing median filter. The individual percussion-
enhanced frames P
i
are then combined to yield a percussion-enhanced
spectrogram P. Similarly, a harmonic-enhanced spectrogram fre-
quency slice H
h
can be obtained by median filtering frequency
DAFX-2
Proc. of the 13
th
Int. Conference on Digital Audio Effects (DAFx-10), Graz, Austria , September 6-10, 2010
Figure 3: Spectrogram frequency slice containing mixture of piano
and snare a) Original slice b) Slice after median filtering
slice S
h
.
H
i
= M{S
h
, l
harm
} (7)
where l
harm
is the length of the harmonic median filter. The slices
are then combined to give a harmonic enhanced spectrogram H.
The resulting harmonic and percussion suppressed spectro-
grams can then be used to generate masks which can then be ap-
plied to the original spectrogram. Two families of masks were then
investigated for the separation of the sources. The first of these is
a hard or binary mask, where it is assumed that each frequency
bin in the spectrogram belongs either to the percussion or to the
harmonic source. In this case, the masks are defined as:
M
H
h,i
=
(
1, if H
h,i
> P
h,i
0, otherwise
(8)
M
P
h,i
=
(
1, if P
h,i
> H
h,i
0, otherwise
(9)
The second family of masks are soft masks based on Wiener Fil-
tering and are defined as:
M
H
h,i
=
H
p
h,i
(H
p
h,i
+ P
p
h,i
)
(10)
M
P
h,i
=
P
p
h,i
(H
p
h,i
+ P
p
h,i
)
(11)
where p denotes the power to which each individual element of the
spectrograms are raised. Typically p is given a value of 1 or 2.
Complex spectrograms are then recovered for inversion from:
ˆ
H =
ˆ
S M
H
(12)
and
ˆ
P =
ˆ
S M
P
(13)
where denotes elementwise mulitiplication and where
ˆ
S denotes
the original complex valued spectrogram. These complex spectro-
grams are then inverted to the time domain to yield the separated
harmonic and percussive waveforms respectively.
In comparison to the iterative approach developed by Ono et
al., which typically requires 30-50 iterations to converge, only two
passes are required through the input spectrogram, one each for H
and P. This means that the median filter based algorithm is faster,
which is of considerable benefit when used as preprocessing for
other tasks. In tests, the proposed algorithm performs approxi-
mately twice as fast as that of Ono et al with the number of itera-
tions set to 30. This raises the possibility of performing real-time
harmonic/percussive separation on stereo files, as the proposed al-
gorithm can easily be extended to handle stereo signals.
3. SEPARATION AND REMIXING EXAMPLES
We now present examples of the use of the median filtering har-
monic/percussive separation algorithm. Figure 4 shows an excerpt
from “Animal” by Def Leppard, as well as the separated harmonic
and percussive waveforms respectively, obtained using a median
filter of length 17 for both harmonic and percussive filters, as well
as using a soft mask with p = 2. It can be seen that the recovered
harmonic waveform contains little or no evidence of percussion
events, while the percussive waveform contains little or no evi-
dence of the harmonic instruments. On listening to the waveforms,
some traces of the drums can be heard in the harmonic waveform,
though at a very reduced level, while the attack portion of some of
the instruments such as guitar has been captured by the percussive
waveform, as well as traces of some guitar parts where the pitch
is changing constantly. This is to be expected as the attacks of
many instruments such as guitar and piano can be considered as
percussive in nature, and as the algorithm assumes that the pitched
instruments are stationary in pitch. This also occurs in other algo-
rithms for separating harmonic and percussive components.
Also shown in Figure 4 are remixed versions of the original
signal, the first has the percussion components reduced by 6dB,
while the second shows the harmonic components reduced by 6dB.
On listening to these waveforms, there are no noticeable artifacts in
the resynthesis, while the reduction in amplitude of the respective
sources can clearly be heard. This demonstrates that the algorithm
is capable of generating audio which can be used for high-quality
remixing of the separated harmonic and percussive sources.
Figure 5 shows an excerpt from “Billie Jean” by Michael Jack-
son, the separated harmonic and percussive waveforms, and remixed
versions with the percussion reduced by 6dB, and the harmonics
reduced by 6dB respectively. Again, the algorithm can be seen to
have separated the harmonic and percussive parts well. On listen-
ing, the attack of the bass has been captured by the percussive part,
and a small amount of drum noise can be heard in the harmonic
waveform. In the remixed versions, no artifacts can be heard.
Both of the above examples were carried out using an FFT
size of 4096, with a hopsize of 1024, with a sampling frequency
of 44.1 kHz. Testing showed that better separation quality was
achieved at larger FFT lengths. The median filter length was set to
17 for both the harmonic and percussive filters, and testing showed
that once the median filter lengths were above 15 and below 30,the
separation quality did not vary dramatically, with good separation
achieved in most cases. Further, informal listening tests suggest
that the quality of separation is comparable to that achieved by the
algorithms proposed by Ono et al. The use of the soft masking was
found to result in less artifacts in the resynthesis, though at the ex-
pense of a slight increase in the amount of interference between the
percussive and harmonic sources. In general, it was observed that
setting p = 2 gave considerably better separation results than us-
DAFX-3
Proc. of the 13
th
Int. Conference on Digital Audio Effects (DAFx-10), Graz, Austria , September 6-10, 2010
Figure 4: Excerpt from “Animal” by Def Leppard a) Original
waveform, b) Separated harmonic waveform, c) Separated per-
cussive waveform, d) Remix, percussion reduced by 6dB, e) Remix,
harmonic components reduced by 6dB
ing p = 1. Audio examples are available for download at http:
//eleceng.dit.ie/derryfitzgerald/index.php?uid=
489&menu_id=42
4. CONCLUSIONS
Having described an fast effective method of harmonic-percussive
separation developed by Ono et al [6], which is based on the idea
that percussion events can be regarded as vertical lines, and har-
monic or stationary events as horizontal lines in a spectrogram, we
then took advantage of this idea to develop a simpler, faster and
more effective harmonic/percussive separation algorithm. This was
based on the idea that harmonics could be regarded as outliers in
a spectrum containing a mixture of percussion and pitched instru-
ments, while percussive onsets can be regarded as outliers in a
frequency slice containing a stable harmonic or stationary event.
To remove these outliers, we then used median filtering, as me-
dian filtering is effective at removing outliers for the purposes of
image denoising. The resulting harmonic-enhanced and percussion-
enhanced spectrograms were then used to generate masks which
were then applied to the original spectrogram to separate the har-
monic and percussive components. Real-world separation and remix-
ing examples using the algorithm were then discussed.
Future work will concentrate on developing a real-time im-
plementation of the algorithm and on investigating the use of the
algorithm as a preprocessor for other tasks such as key signature
detection and chord detection, where suppression of percussion
events is helpful in improving results. Further, the use of rank-
order filters, where a different percentile other than 50, which is
used in median filtering will be investigated as a means of poten-
tially improving the separation performance of the algorithm.
Figure 5: Excerpt from “Billie Jean” by Michael Jackson a) Origi-
nal waveform, b) Separated harmonic waveform, c) Separated per-
cussive waveform, d) Remix, percussion reduced by 6dB, e) Remix,
harmonic components reduced by 6dB
5. REFERENCES
[1] D. FitzGerald, E. Coyle, and M. Cranitch, “Using tensor fac-
torisation models to separate drums from polyphonic music,
in Proc. Digital Audio Effects (DAFx-09), Como, Italy, 2009.
[2] K. Yoshii, M. Goto, and H. Okuno, “Drum sound recognition
for polyphonic audio signals by adaptation and matching of
spectrogram templates with harmonic structure suppression,
IEEE Transactions on Audio, Speech and Language Process-
ing, vol. 15, no. 1, pp. 333–345, 2007.
[3] O. Gillet and G. Richard, “Transcription and separation of
drum signals from polyphonic music, IEEE Transactions on
Audio, Speech and Language Processing, vol. 16, no. 3, pp.
529–540, 2008.
[4] M. Helen and T. Virtanen, “Separation of drums from poly-
phonic music using non-negative matrix factorisation and sup-
port vector machine, in Proc. European Signal Processing
Conference, Anatalya, Turkey, 2005.
[5] N. Ono, K. Miyamoto, J. Le Roux, H. Kameoka, and
S. Sagayama, “Separation of a monaural audio signal into
harmonic/percussive components by complementary diffusion
on spectrogram, in Proceedings of the EUSIPCO 2008 Euro-
pean Signal Processing Conference, Aug. 2008.
[6] N. Ono, K. Miyamoto, H. Kameoka, and S Sagayama, A
real-time equalizer of harmonic and percussive components in
music signals, in Proc. Ninth International Conference on
Music Information Retrieval (ISMIR08), 2008, pp. 139–144.
[7] R. Jain, R. Kasturi, and B. Schunck, Machine Vision,
McGraw-Hill, 1995.
DAFX-4
... Consequently, we explore the performance of three datasets from Spoken Wikipedia Corpora [56], UrbanSound8K [57], and our A3Siren-Synthetic audio files collections; • the adoption of an open-set procedure in testing, considering the problem as a binary classification siren vs. noise; • the use of data from recordings in the actual testing environment directly at the inference stage. For prototypical networks, the availability of few labeled instances is not a limitation to generating the embedded prototypes of the target domain; • the validation of the proposed approach by comparison with a baseline model and the study of performance improvement strategies, such as harmonic-percussive source separation techniques [58,59] that have previously shown their efficiency in the ESD task [9]. ...
... Our previous study [9] investigated the noise reduction and the siren harmonic components enhancement in the ESD task. Hence, the effectiveness of the harmonic-percussive source separation technique [58,59] is also evaluated in this work. ...
Article
Full-text available
It is a well-established practice to build a robust system for sound event detection by training supervised deep learning models on large datasets, but audio data collection and labeling are often challenging and require large amounts of effort. This paper proposes a workflow based on few-shot metric learning for emergency siren detection performed in steps: prototypical networks are trained on publicly available sources or synthetic data in multiple combinations, and at inference time, the best knowledge learned in associating a sound with its class representation is transferred to identify ambulance sirens, given only a few instances for the prototype computation. Performance is evaluated on siren recordings acquired by sensors inside and outside the cabin of an equipped car, investigating the contribution of filtering techniques for background noise reduction. The results show the effectiveness of the proposed approach, achieving AUPRC scores equal to 0.86 and 0.91 in unfiltered and filtered conditions, respectively, outperforming a convolutional baseline model with and without fine-tuning for domain adaptation. Extensive experiments conducted on several recording sensor placements prove that few-shot learning is a reliable technique even in real-world scenarios and gives valuable insights for developing an in-car emergency vehicle detection system.
... 2) Harmonic Wheeze Segmenter: This segmentation method is based on harmonic-percussive source separation using median filtering [27]. Although the task of decomposing an audio signal into its harmonic and its percussive components has been used in many musical applications such as remixing or tempo estimation [28], it has not been applied in the segmentation of adventitious respiratory sounds. ...
Preprint
Full-text available
Wheezes are adventitious respiratory sounds commonly present in patients with respiratory conditions. The presence of wheezes and their time location are relevant for clinical reasons, such as understanding the degree of bronchial obstruction. Conventional auscultation is usually employed to analyze wheezes, but remote monitoring has become a pressing need during recent years. Automatic respiratory sound analysis is required to reliably perform remote auscultation. In this work we propose a method for wheeze segmentation. Our method starts by decomposing a given audio excerpt into intrinsic mode frequencies using empirical mode decomposition. Then, we apply harmonic-percussive source separation to the resulting audio tracks and get harmonic-enhanced spectrograms, which are processed to obtain harmonic masks. Subsequently, a series of empirically derived rules are applied to find wheeze candidates. Finally, the candidates stemming from the different audio tracks are merged and median filtered. In the evaluation stage, we compare our method to three baselines on the ICBHI 2017 Respiratory Sound Database, a challenging dataset containing various noise sources and background sounds. Using the full dataset, our method outperforms the baselines, achieving an F1 of 41.9%. Our method's performance is also better than the baselines across several stratified results focusing on five variables: recording equipment, age, sex, body-mass index, and diagnosis. We conclude that, contrary to what has been reported in the literature, wheeze segmentation has not been solved for real life scenario applications. Adaptation of existing systems to demographic characteristics might be a promising step in the direction of algorithm personalization, which would make automatic wheeze segmentation methods clinically viable.
... To separate harmonic and percussive elements, one simple approach is to apply a median filter to the STFT spectrogram of the signal (FitzGerald, 2010). Median filters are usually used to remove noisy parts of a signal by replacing each sample by the median value determined from the neighboring samples within a specific window. ...
... Harmonic separated Mel-spectrogram is the percussive filtered out part of the Mel-spectrogram. The approach employs median filtering on a spectrogram, with median filtering done over successive frames to decrease percussive occurrences and boost harmonic components (Fitzgerald, 2010). At the same time, median filtering across frequency bins enhances percussive events while suppressing harmonic components. ...
Article
Music information retrieval (MIR) has witnessed rapid advances in various tasks like musical similarity, music genre classification (MGC), etc. MGC and audio tagging are approached using various features through traditional machine learning and deep learning (DL) based techniques by many researchers. DL-based models require a large amount of data to generalize well on new data samples. Unfortunately, the lack of sizeable open music datasets makes the analyses of the robustness of musical features on DL models even more necessary. So, this paper assesses and compares the robustness of some commonly used musical and non-musical features on DL models for the MGC task by evaluating the performance of selected models on multiple employed features extracted from various datasets accounting for billions of segmented data samples. In our evaluation, Mel-Scale based features and Swaragram showed high robustness across the datasets over various DL models for the MGC task.
... As observed, the Tonic information is shadowed by noise and low frequencies in the frequency representation. We have used an easy and efficient way [15], [16] of separating a monophonic audio signal from the harmonic and percussive components. The procedure includes a median filtration on the audio signal spectrogram, with median filtration across successive frames to remove percussive events, improve harmonic components, and median filtration through frequency bins to increase percussive events and to suppress harmonic components. ...
Chapter
Speech Emotion Recognition (SER) affective technology enables the intelligent embedded devices to interact with sensitivity. Similarly, call centre employees recognise customers’ emotions from their pitch, energy, and tone of voice so as to modify their speech for a high-quality interaction with customers. This work explores, for the first time, the effects of the harmonic and percussive components of Mel spectrograms in SER. We attempt to leverage the Mel spectrogram by decomposing distinguishable acoustic features for exploitation in our proposed architecture, which includes a novel feature map generator algorithm, a CNN-based network feature extractor and a multi-layer perceptron (MLP) classifier. This study specifically focuses on effective data augmentation techniques for building an enriched hybrid-based feature map. This process results in a function that outputs a 2D image so that it can be used as input data for a pre-trained CNN-VGG16 feature extractor. Furthermore, we also investigate other acoustic features such as MFCCs, chromagram, spectral contrast, and the tonnetz to assess our proposed framework. A test accuracy of 92.79% on the Berlin EMO-DB database is achieved. Our result is higher than previous works using CNN-VGG16.
Article
Full-text available
In this study, we propose a methodology for separating a singing voice from musical accompaniment in a monaural musical mixture. The proposed method uses robust principal component analysis (RPCA), followed by postprocessing, including median filter, morphology, and high-pass filter, to decompose the mixture. Subsequently, a deep recurrent neural network comprising two jointly optimized parallel-stacked recurrent neural networks (sRNNs) with mask layers and trained on limited data and computation is applied to the decomposed components to optimize the final estimated separated singing voice and background music to further correct misclassified or residual singing and background music in the initial separation. The experimental results of MIR-1K, ccMixter, and MUSDB18 datasets and the comparison with ten existing techniques indicate that the proposed method achieves competitive performance in monaural singing voice separation. On MUSDB18, the proposed method reaches the comparable separation quality in less training data and lower computational cost compared to the other state-of-the-art technique.
Article
Full-text available
Detecting bird calls in audio is an important task for automatic wildlife monitoring, as well as in citizen science and audio library management. This paper presents front-end acoustic enhancement techniques to handle the acoustic domain mismatch problem in bird detection. A time-domain cross-condition data augmentation (TCDA) method is first proposed to enhance the domain coverage of a fixed training dataset. Then, to eliminate the distortion of stationary noise and enhance the transient events, we investigate a per-channel energy normalization (PCEN) to automatic control the gain of every subband in the mel-frequency spectrogram. Furthermore, a harmonic percussive source separation is investigated to extract robust percussive representations of bird call to alleviate the acoustic mismatch. Our experiments are performed on the Bird Audio Detection Task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events 2018. Extensive results show that the proposed TCDA leads to a relative 5.02% AUC improvements on mismatch conditions. And also on the cross-domain test set, the proposed percussive features (RPFs), and these RPFs with PCEN significantly improve the baseline with conventional log mel-spectrogram features from 81.79% AUC to 84.46% and 88.68%, respectively. Moreover, we find that combing different front-end features can further improve the system performances.
Article
Frogs play an important role in ecological systems, while frog species across the globe are threatened and declining. Therefore, it is valuable to estimate the frog population based on an intelligent computer system. Due to the success of deep learning (DL) in various pattern recognition tasks, previous studies have used DL-based methods for frog call analysis. However, the performance of DL-based systems is highly affected by their input (feature representation). In this study, we develop a frog calling activity detection system for continuous field recordings using a light convolutional neural network (CNN) with multi-view spectrograms. To be specific, a sliding window is first applied to continuous recordings for obtaining audio segments with a fixed duration. Then, the background noise is filtered out. Next, a multi-view spectrogram is used for characterizing those segments, which has more distinctive information than a single-view spectrogram. Finally, a lightweight CNN model is used for the detection of frog calling activity with a twin loss, where different train and test sets are used to validate the model’s robustness. Our experimental results indicate that the highest macro F1-score was 99.6 ± 0.2 and 96.4 ± 2.0 using 2016 and 2017 as the train data respectively, where CNN-GAP is used as the model with multi-view spectrogram as the input.
Article
Full-text available
This paper describes the use of Non-negative Tensor Factorisation models for the separation of drums from polyphonic audio. Improved separation of the drums is achieved through the incorporation of Gamma Chain priors into the Non-negative Tensor Factorisation framework. In contrast to many previous approaches, the method used in this paper requires little or no pre-training or use of drum templates. The utility of the technique is shown on real-world audio examples.
Article
Full-text available
In this paper, we present a simple and fast method to sep-arate a monaural audio signal into harmonic and percussive components, which is much useful for multi-pitch analysis, automatic music transcription, drum detection, modification of music, and so on. Exploiting the differences in the spectro-grams of harmonic and percussive components, the objective function is defined in a quadrature form of the spectrogram gradients. Applying the auxiliary function approach to that, simple and fast update equations are derived, which guaran-tee the decrease of the objective function at each iteration. We show some experimental results by applying our method to popular and jazz music songs.
Conference Paper
Full-text available
In this paper, we present a real-time equalizer to control a volume balance of harmonic and percussive components in music signals without a priori knowledge of scores or in- cluded instruments. The harmonic and percussive compo- nents of music signals have much different structures in the power spectrogram domain, the former is horizontal, while the latter is vertical. Exploiting the anisotropy, our methods separate input music signals into them based on the MAP es- timation framework. We derive two kind of algorithm based on a I-divergence-based mixing model and a hard mixing model. Although they include iterative update equations, we realized the real-time processing by a sliding analysis technique. The separated harmonic and percussive com- ponents are finally remixed in an arbitrary volume balance and played. We show the prototype system implemented on Windows environment.
Article
Full-text available
The purpose of this article is to present new advances in music transcription and source separation with a focus on drum signals. A complete drum transcription system is described, which combines information from the original music signal and a drum track enhanced version obtained by source separation. In addition to efficient fusion strategies to take into account these two complementary sources of information, the transcription system integrates a large set of features, optimally selected by feature selection. Concurrently, the problem of drum track extraction from polyphonic music is tackled both by proposing a novel approach based on harmonic/noise decomposition and time/frequency masking and by improving an existing Wiener filtering-based separation method. The separation and transcription techniques presented are thoroughly evaluated on a large public database of music signals. A transcription accuracy between 64.5% and 80.3% is obtained, depending on the drum instrument, for well-balanced mixes, and the efficiency of our drum separation algorithms is illustrated in a comprehensive benchmark.
Article
Full-text available
This paper describes a system that detects onsets of the bass drum, snare drum, and hi-hat cymbals in polyphonic audio signals of popular songs. Our system is based on a template-matching method that uses power spectrograms of drum sounds as templates. This method calculates the distance between a template and each spectrogram segment extracted from a song spectrogram, using Goto's distance measure originally designed to detect the onsets in drums-only signals. However, there are two main problems. The first problem is that appropriate templates are unknown for each song. The second problem is that it is more difficult to detect drum-sound onsets in sound mixtures including various sounds other than drum sounds. To solve these problems, we propose template-adaptation and harmonic-structure-suppression methods. First of all, an initial template of each drum sound, called a seed template, is prepared. The former method adapts it to actual drum-sound spectrograms appearing in the song spectrogram. To make our system robust to the overlapping of harmonic sounds with drum sounds, the latter method suppresses harmonic components in the song spectrogram before the adaptation and matching. Experimental results with 70 popular songs showed that our template-adaptation and harmonic-structure-suppression methods improved the recognition accuracy and achieved 83%, 58%, and 46% in detecting onsets of the bass drum, snare drum, and hi-hat cymbals, respectively
Separation of drums from polyphonic music using non-negative matrix factorisation and support vector machine
  • M Helen
  • T Virtanen
M. Helen and T. Virtanen, "Separation of drums from polyphonic music using non-negative matrix factorisation and support vector machine," in Proc. European Signal Processing Conference, Anatalya, Turkey, 2005.
Separation of a monaural audio signal into harmonic/percussive components by complementary diffusion on spectrogram
  • N Ono
  • K Miyamoto
  • J Le Roux
  • H Kameoka
  • S Sagayama
N. Ono, K. Miyamoto, J. Le Roux, H. Kameoka, and S. Sagayama, "Separation of a monaural audio signal into harmonic/percussive components by complementary diffusion on spectrogram," in Proceedings of the EUSIPCO 2008 European Signal Processing Conference, Aug. 2008.