Conference PaperPDF Available

Hierarchical modeling using automated sub-clustering for sound event recognition

  • Schibsted Media Group


The automatic recognition of sound events allows for novel applications in areas such as security, mobile and multimedia. In this work we present a hierarchical hidden Markov model for sound event detection that automatically clusters the inherent structure of the events into sub-events. We evaluate our approach on an IEEE audio challenge dataset consisting of office sound events and provide a systematic comparison of the various building blocks of our approach to demonstrate the effectiveness of incorporating certain dependencies in the model. The hierarchical hidden Markov model achieves an average frame-based F-measure recognition performance of 45.5% on a test dataset that was used to evaluate challenge submissions. We also show how the hierarchical model can be used as a meta-classifier, although in the particular application this did not lead to an increase in performance on the test dataset.
2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2013, New Paltz, NY
Maria E. Niessen, Tim L. M. Van Kasteren, Andreas Merentitis
AGT International
Hilpertstr. 35, 64295 Darmstadt, Germany
The automatic recognition of sound events allows for novel appli-
cations in areas such as security, mobile and multimedia. In this
work we present a hierarchical hidden Markov model for sound
event detection that automatically clusters the inherent structure
of the events into sub-events. We evaluate our approach on an
IEEE audio challenge dataset consisting of office sound events and
provide a systematic comparison of the various building blocks of
our approach to demonstrate the effectiveness of incorporating cer-
tain dependencies in the model. The hierarchical hidden Markov
model achieves an average frame-based F-measure recognition per-
formance of 45.5% on a test dataset that was used to evaluate chal-
lenge submissions. We also show how the hierarchical model can
be used as a meta-classifier, although in the particular application
this did not lead to an increase in performance on the test dataset.
Index Termssound event detection, hierarchical models,
Sound event recognition allows a broad range of applications in ar-
eas such as security (e.g. gunshot detection), mobile (e.g. con-
text awareness), and multimedia (e.g. search engines). Recogni-
tion of events in real-world environments is challenging because
different instances of the same type of event can have strong varia-
tion (e.g. speech) and one event label can be characterized by sev-
eral sounds (e.g. a printer loading paper and then printing it). We
present a hierarchical model and a meta-classifier to reliably detect
sound events making use of the temporal dependencies within and
between events.
Many studies in sound event recognition rely on techniques that
have proven successful in automatic speech recognition and multi-
media content analysis, mostly Mel-frequency cepstral coefficients
(MFCC) for signal description and hidden Markov models (HMM)
to classify events into a set of predefined categories. For exam-
ple, Mesaros et al. [1] uses this combination to detect a large selec-
tion of events in a variety of indoor and outdoor environments. To
deal with the variation of environmental sounds additional features
can be used, standard ones such as spectral moments or constructed
through a more elaborate feature selection process, resulting in a
higher recognition performance [2]. The variation and composi-
tionality of events in real-world environment has previously been
accounted for with a cascade hierarchical model, which means that
the output of a HMM was fed as input to another HMM, therefore
processing the data at different levels of abstraction [3].
In our hierarchical model we extract a range of standard audio
features from the sound and use these features in a hierarchical hid-
Temporal Spectral Auto-correlation Multi-dimensional
STE Flux Flux 13 MFCCs
ZCR Roll-off Roll-off 12 LPCs
Flatness Flatness
Table 1: Audio features
den Markov model (HHMM) which automatically learns to cluster
the intrinsic representation of each sound event from the data. The
output of the hierarchical model is combined with a discriminative
Random Forest method using a meta-classifier to complement the
strengths of both classifiers into a single model. We evaluate our
approach on an IEEE audio challenge dataset consisting of office
sound events [4] and provide a systematic comparison of the various
building blocks of our approach to demonstrate the effectiveness of
incorporating certain dependencies in the model.
The rest of this paper is organized as follows. Section 2 de-
scribes the features extracted from the audio data. Section 3 ex-
plains the HHMM and section 4 introduces the meta-classifier. In
section 5 we present our experiments and results, section 6 discusses
these results, and in section 7 we state our conclusions.
Sound events in an office environment have highly variable au-
dio characteristics. In addition to features that are successful in
speech and music processing, such as MFCCs1and zero-crossing
rate (ZCR), we implemented features such as short-term energy
(STE) that better describe impact sounds such as door knocking
and complex sounds such as the printer. Table 1 shows the list of
features that were used in our experiments. We use the same set-
tings for all features: The audio data is discretized into frames using
a window size of 80 ms with an overlap of 50% and a rectangular
window. For the two roll-off features we applied a threshold of 85%
of the maximum energy. The linear prediction coefficients2(LPCs)
are calculated with the covariance method. All features combined,
we obtain a 35-dimensional feature vector for each frame.
1K. Wojcicki, HTK MFCC
2VOICEBOX, voice-
2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2013, New Paltz, NY
zt-1 ztzt+1
... ...
t-1 tt+1
yt-1 ytyt+1
ft-1 ftft+1
Sub event
Figure 1: The graphical representation of a two-layer HHMM.
Shaded nodes represent observable variables, the white nodes rep-
resent hidden states.
Sound events are highly correlated in the temporal domain. Instead
of classifying each frame independently, a model that takes into ac-
count the temporal dependencies within and between sound events
can improve recognition performance [5]. Therefore we apply a
two-layer hierarchical hidden Markov model (HHMM) (Fig. 1) that
has been developed for human activity recognition to classify the
audio features [6]. The top layer (yt) of the hierarchical state repre-
sentations are the sound events we wish to recognize (e.g. printing
and phone ringing) and the bottom layer (zt) are the intrinsic sub-
events that an event consists of. For example, the sound event of
printing can be segmented into a sub-event for the sound the printer
makes when loading paper from a tray, one for the actual printing
sound, and one for the sound the printer makes while feeding the
printed paper to the output tray. A finished state variable (ft) is
used as a binary indicator to indicate the top layer has finished its
Although it is possible to train such a model using data which
is annotated with labels of both sound events and the sub-events,
we train the model using only labels for the top layer sound events.
There are two advantages to this approach: (i) annotating the data
becomes significantly less involved when only the sound events
have to be annotated and (ii) we do not force any structure upon
the model with respect to the sub-events, but rather let the model
find this structure in the data automatically. The automatic alloca-
tion of structure can be considered as a clustering task. The clusters
found in the data do not necessarily have to be meaningful clusters
that correspond to actual sub-events that are intuitive to humans.
We therefore refer to them as ‘sub-event clusters’ to indicate the
sub-events were found through clustering.
The observations are modeled using a multidimensional Gaus-
sian distribution, in which each sub-event cluster is associated with
a single Gaussian distribution p(~xt|yt=k, zt=l) =
N(xt|µkl,Σk). Note that the covariance matrix Σkonly has
a subscript k, meaning that we have a different covariance matrix
for each sound event, but that the covariance matrix for different
sub-event clusters is shared among the sub-event clusters for a par-
ticular sound event k. The ideal number of sub-event clusters to use
is determined through experimentation. The model parameters are
learned using the Expectation-Maximization (EM) algorithm and
novel sequences are inferred using the Viterbi algorithm.
In the context of supervised learning, algorithms are typically
searching through a hypothesis space to find a hypothesis that can be
used to make good predictions for the particular problem at hand.
Ensemble learning is a technique for combining numerous weak
learners in an attempt to produce a strong learner. An ensemble
also represents a single hypothesis in the solution space. However,
this hypothesis is not necessarily contained within the space of the
models used to construct the ensemble. Therefore ensembles typi-
cally have more flexibility in the functions they can represent, which
can result in a reduction of model bias [7]. Considering the typical
bias-variance decomposition and the bias-variance trade-off, an in-
crease in model complexity is often associated with an increase in
variance, since the more complex model is potentially more prone
to overfitting the training data. To minimize these effects we use
Bagging, which is short for Bootstrap Aggregating.
Bagging involves training each model in the ensemble using
a randomly selected subset of the training set (to promote model
diversity) and combining predictions with an equal weight voting
scheme (to avoid overfitting). A good basis for ensemble meth-
ods is a set of models that individually provide good predictions
while their diversity (i.e. disagreement) is high [8]. After some
experimentation with the provided datasets we have selected deci-
sion trees as the basic models of our ensemble classifier. The use
of Bagging in combination with decision trees and subset feature
selection gives an ensemble method known as Random Forest [9].
Random Forests do not use any temporal information in modeling
the data, which can limit their effectiveness on problems with strong
temporal dependencies such as sound event detection. Therefore, a
temporal probabilistic model such as the HHMM (as described in
the previous section) is combined with the Random Forest.
The final decision is taken by a meta-classifier that uses the
predictions of the Random Forest and the HHMM as input. Since
audio data have a strong temporal character a model that is designed
to exploit this property is a good candidate for the meta-classifier.
In our case a second instance of the HHMM is used as a meta-
classifier. The output of this model (from here on referred to as
meta-HHMM) is used as the final classification result.
We use the dataset that was created for the IEEE challenge “Detec-
tion and classification of acoustic scenes and events” [4]. For the
event detection challenge the organizers provided around 20 train-
ing examples of 16 different classes that can occur in an office en-
vironment, such as printer, speech, and phone. A development set
was provided for evaluation prior to submission and a separate test
dataset was used by the organizers to evaluate submissions. The
test set is not available to us at the time of writing this paper, but we
did receive the performance measures on the test set of the models
we submitted to the challenge (see Table 2). Both the test set and
the development set consist of multiple recordings of roughly 1 min
long, recorded in a various office environments (‘Office Live’) and
were annotated by two people (‘bdm’ and ‘sid’)3. The results for
the HHMM were obtained using a single sub-event cluster. In the
rest of this section we discuss the results we received on the test
dataset and explain the performance of our models in further detail
by presenting additional experiments on the development set.
3The development set consists of three recordings, referred to as
‘script01’, ‘script02’ and ‘script03’)
2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2013, New Paltz, NY
Table 2 shows the results of the meta-HHMM classifier and the
HHMM on the test dataset. The meta-HHMM classifier makes as-
sumptions about the event distribution, based on the distribution of
events in the development set. Our hypothesis was that the distribu-
tion of events in the development set would be predictive of the dis-
tribution in the test set and hence result in an improved performance
of the meta-classifier over the HHMM. However, the HHMM out-
performs the meta-HHMM classifier in terms of F-measure on all
three evaluation methods (i.e. event-based, class-wise event-based,
and frame-based), because the distribution of events in the test data
set differs significantly from the distribution in the development
data set.
F-Measure Performance
Model Event-based Class-wise
HHMM 34.5 33.5 45.5
Meta-HHMM 32.6 29.4 40.9
Table 2: F-measure performance of the meta-HHMM classifier and
the HHMM on the test dataset. The results were provided by the
organizers of the contest.
The recognition performance of the HHMM on the develop-
ment set is shown in Table 3. The first four event-based measures in
the table are calculated using onset only; each event is considered
to be correctly detected within a 100 ms tolerance window. The
offset measures in the table are calculated using both onset and off-
set; each event is correctly detected if its onset is within a 100 ms
tolerance window and its offset is within 50% range of the ground
truth events offset with respect to the duration of the event [4]. We
see a slight drop in performance for the offset measure due to the
additional constraint. Precision is higher than recall for all three
evaluation methods. In other words, our model is more likely to
miss (part of) a sound event than to recognize a sound event when
no event was taking place. This can be seen in Figure 2, where the
results of the HHMM are visualized together with the ground truth
for two of the development files. Further inspection of the results re-
veals that the ‘switch’ event is never recognized and the ‘pen drop’
and ‘knock’ events are difficult classes to detect. Other misclassi-
fications are more ‘acceptable’, because the mixed classes are very
similar, such as speech, laughter, cough, and clear throat.
Evaluation method
Metrics Event-based Class-wise
Recall (R) 40.3 40.3 43.2
Precision (P) 48.2 44.5 67.1
F-Meas. 43.5 39.1 52.1
AEER 1.43 1.34 0.96
Off. R34.7 34.7
Off. P41.6 39.0
Off. F-Meas. 37.5 33.7
Off. AEER 1.60 1.51
Table 3: Results of the HHMM with one sub-event cluster per sound
Our HHMM can be considered as a combination of building
blocks that correspond to commonly used machine learning mod-
els. Table 4 shows the recognition performance of these individual
building blocks in comparison with the HHMM. The observation
model corresponds closely to a Gaussian Mixture Model (GMM), it
performs the worst because it does not take into account any tempo-
ral information and treats the data as independently and identically
distributed (IID). By adding temporal dependencies to the model
we obtain a hidden Markov model (HMM), which results in a sig-
nificant increase in performance. Finally we obtain our proposed
HHMM by adding the intermediate layer and finite state variable,
resulting in a further significant increase in performance. The table
also lists the performance of the meta-HHMM on the development
set and an upper bound which is the maximum annotator agreement
between the two annotations that were provided. The meta-HHMM
outperforms the HHMM in this case, but data from the development
set was used to estimate the prior distribution of events, making this
a biased result for the meta-HHMM.
F-Measure Performance
Model Event-based Class-wise
GMM 8.5 20.3 26.8
HMM 24.4 29.4 43.7
HHMM 43.5 39.1 52.1
Meta-HHMM47.5 41.3 54.4
Upper bound 87.1 88.2 92.1
Table 4: F-measure performance of the GMM, HMM, HHMM and
the meta-HHMM on the development set (averaged over the three
files). The upper bound is the maximum annotator agreement. (*:
the meta-HHMM uses a prior distribution of the events, which was
estimated on the development set, making this a biased result).
Standard feature ranking methods indicate that the lower MFCCs,
lower LPCs, and spectral moments are the most predictive features,
either because they are very predictive for one class in particular
(e.g. first MFCC for ‘cough’) or because they have relatively small
within-class variation for some classes (e.g. spectral roll-off). The
combination of all features gives a substantially higher performance
than MFCCs alone. In future work we plan to investigate additional
time-frequency domain features because they have shown to im-
prove performance compared to only using frequency domain fea-
tures such as MFCCs [2]. Moreover, we will adopt a feature selec-
tion process to select the most predictive features.
The results are obtained using a single sub-event cluster per
sound event. We experimented with various number of sub-event
clusters, but obtained the best performance when using a single
sub-event. This is surprising since with a single sub-event cluster
the model does not benefit from modeling transitions between sub-
events and the resulting model corresponds closely to a standard
hidden Markov model (HMM). The difference between the hierar-
chical model using a single sub-event cluster and the HMM is that
the hierarchical model explicitly models the finishing of a state.
A limitation of our current HHMM implementation is that the
same number of sub-event clusters is used for all sound events.
2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2013, New Paltz, NY
0 5 10 15 20 25 30 35
40 45 50 55 60 65 70 75
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75
(a) script01
(b) script03
Time (s)
Figure 2: The top bar (result) shows the classification result of the HHMM for the first 80 seconds of (a) script01 with relatively high
recognition performance and (b) script03 with lower recognition performance. The bottom bar (ann) is the annotation of ‘sid’.
Close inspection of the data reveals that there are a number of rel-
atively simple sound events (e.g. cough) that do not benefit from
a richer hierarchical structure. Using additional sub-event cluster
results in an increase in the number of parameters that need to be
estimated. The overall recognition performance decreases because
the increased modeling complexity does not benefit the majority of
the classes, while the increase in number of parameters results in a
poorer estimation of the parameter values. Ideally, we would use a
different number of sub-event clusters for each sound event, to be
able to use the optimal number of clusters for each event.
Meta-classifiers can be extremely effective in combining the
strengths of multiple classifiers and are often beneficial over a sim-
ple voting scheme. We saw that the meta-HHMM outperformed the
HHMM on the development set, but underperformed on the test set.
Because the meta-HHMM used data from the development set in
its training phase (to estimate the prior probabilities of events), the
result on the development set cannot be regarded as a proper evalu-
ation. The results on the test set show that the meta-HHMM under
performs the HHMM, indicating that the model does not general-
ize well. However, in applications where the distribution of events
can be expected to be relatively fixed, the model is expected to out-
perform the HHMM. Further experiments on additional datasets are
needed to verify this expectation.
We presented a two-layer hierarchical hidden Markov model for
recognizing sound events in which one of the layers corresponds to
sound events and one layer corresponds to sub-event clusters. The
results of the hierarchical model were used in a meta-classifier in
combination with a random forest in an attempt to boost the classi-
fication results further. Our experimental results show that our as-
sumptions about the data distribution were too strong and therefore
resulted in the meta-classifier overfitting on the data, rather than an
increase in performance. Instead, the stand-alone use of the hierar-
chical model using a single sub-event cluster per sound event gave
the best performance. With a systematic comparison we show how
modeling temporal dependencies and hierarchical structure leads to
a significant increase in recognition performance. Future work will
focus on creating a hierarchical model in which a different number
of sub-event clusters per sound event can be used.
We are grateful to Christian Debes, Roel Heremans, and James Rex
for their contribution.
[1] A. Mesaros, T. Heittola, A. Eronen, and T. Virtanen, Acoustic
event detection in real life recordings, in 18th European Sig-
nal Processing Conference (EUSIPCO-2010), 2010, pp. 1267–
[2] S. Chu, S. Narayanan, and C.-C. Kuo, “Environmental sound
recognition with time-frequency audio features, IEEE Trans-
actions on Audio, Speech, and Language Processing, vol. 17,
no. 6, pp. 1142–1158, 2009.
[3] N. Oliver, E. Horvitz, and A. Garg, “Layered representations
for human activity recognition, in Fourth IEEE International
Conference on Multimodal Interfaces (ICMI), 2002, pp. 3–8.
[4] D. Giannoulis, E. Benetos, D. Stowell, M. Rossignol, M. La-
grange, and M. Plumbley, “Detection and classification of
acoustic scenes and events, School of Electronic Engineer-
ing and Computer Science, Queen Mary University of London,
Tech. Rep. EECSRR-13-01, 2013.
[5] J. D. Krijnders, M. E. Niessen, and T. C. Andringa, “Sound
event recognition through expectancy-based evaluation of
signal-driven hypotheses, Pattern Recognition Letters, vol. 31,
no. 12, pp. 1552–1559, 2010.
[6] T. L. M. Van Kasteren, G. Englebienne, and B. J. A. Kr¨
ose, “Hi-
erarchical activity recognition using automatically clustered ac-
tions,” in Proceedings of the Second International Conference
on Ambient Intelligence, 2011, pp. 82–91.
[7] D. Wolpert, “Stacked generalization, Neural Networks, vol. 5,
no. 2, pp. 241–259, 1992.
[8] L. Kuncheva and C. Whitaker, “Measures of diversity in classi-
fier ensembles,” Machine Learning, vol. 51, pp. 181–207, 2003.
[9] L. Breiman, “Random forests,” Machine learning, vol. 45,
no. 1, pp. 5–32, 2001.
... Audio features are extracted from audio files, e.g., Mel-Frequency Cepstral Coefficient (MFCC), and then each segment of audio features are classified as one of the pre-defined classes of acoustic events by the pre-trained classifiers. The common classifiers include Convolutional Neural Network (CNN) [12,42], Deep Neural Network (DNN) [6,22], regression forests [26], Support Vector Machine (SVM) [9, 20-22, 28, 38], Hidden Markov Model (HMM) [25,28], Gaussian Mixture Model (GMM) [45], and discriminative binary classifiers [10]. In contrast, the unsupervised method consists of the following steps. ...
... In the study of Portelo et al. [28], they detected acoustic events using SVM-based and HMM-based classifiers with several features, and obtained promising results despite difficulties posed by mixtures of acoustic events. Niessen et al. [25] proposed a hierarchical HMM with two layers for detecting acoustic events. Zhang et al. [45] classified acoustic events by using features of tensor-based sparse approximation. ...
Full-text available
In this study, we propose a method for acoustic event diarization based on a feature of deep embedding and a clustering algorithm of integer linear programming. The deep embedding learned by deep auto-encoder network is used to represent the properties of different classes of acoustic events, and then the integer linear programming is adopted for merging audio segments belonging to the same class of acoustic events. Four kinds of TV/movie audios (21.5 h in total) are used as experimental data, including Sport, Situation comedy, Award ceremony, and Action movie. We compare the deep embedding with state-of-the-art features. Further, the clustering algorithm of integer linear programming is compared with other clustering algorithms adopted in previous works. Finally, the proposed method is compared to both supervised and unsupervised methods on four kinds of TV/movie audios. The results show that the proposed method is superior to other unsupervised methods based on agglomerative information bottleneck, Bayesian information criterion and spectral clustering, and is little inferior to the supervised method based on deep neural network in terms of acoustic event error.
... Early research in the field of SED used traditional shallow learning model approaches, such as Gaussian mixture models [7], hidden Markov models [8], and random regression forests [9]. Approaches based on support vector machines [10][11][12] and non-negative matrix factorization [13][14][15] were also proposed. ...
Full-text available
Sound event detection (SED) recognizes the corresponding sound event of an incoming signal and estimates its temporal boundary. Although SED has been recently developed and used in various fields, achieving noise-robust SED in a real environment is typically challenging owing to the performance degradation due to ambient noise. In this paper, we propose combining a pretrained time-domain speech-separation-based noise suppression network (NS) and a pretrained classification network to improve the SED performance in real noisy environments. We use group communication with a context codec method (GC3)-equipped temporal convolutional network (TCN) for the noise suppression model and a convolutional recurrent neural network for the SED model. The former significantly reduce the model complexity while maintaining the same TCN module and performance as a fully convolutional time-domain audio separation network (Conv-TasNet). We also do not update the weights of some layers (i.e., freeze) in the joint fine-tuning process and add an attention module in the SED model to further improve the performance and prevent overfitting. We evaluate our proposed method using both simulation and real recorded datasets. The experimental results show that our method improves the classification performance in a noisy environment under various signal-to-noise-ratio conditions.
... Time-domain features are commonly used in the detection and evaluation applications along with statistical features (Genuit, 2004;Lee et al., 2005;Lee et al., 2015;Nopiah et al., 2015). However, time-frequency domain features, which include wavelet and cepstral features, are more effective in the classification and recognition of acoustic patterns (Mitrović et al., 2010;Niessen et al., 2013;Schr€ oder et al., 2015;Zhang et al., 2019;Pogorilyi et al., 2020a). The cepstral domain features are obtained by taking the fast Fourier transform (FFT) of the logarithm of the amplitude from the spectrum data (Mitrović et al., 2010). ...
Full-text available
Fault identification using the emitted mechanical noise is becoming an attractive field of research in a variety of industries. It is essential to rank acoustic feature integration functions on their efficiency to classify different types of sound for conducting a fault diagnosis. The Mel frequency cepstral coefficient (MFCC) method was used to obtain various acoustic feature sets in the current study. MFCCs represent the audio signal power spectrum and capture the timbral information of sounds. The objective of this study is to introduce a method for the selection of statistical indicators to integrate the MFCC feature sets. Two purpose-built audio datasets for squeak and rattle were created for the study. Data were collected experimentally to investigate the feature sets of 256 recordings from 8 different rattle classes and 144 recordings from 12 different squeak classes. The support vector machine method was used to evaluate the classifier accuracy with individual feature sets. The outcome of this study shows the best performing statistical feature sets for the squeak and rattle audio datasets. The method discussed in this pilot study is to be adapted to the development of a vehicle faulty sound recognition algorithm.
... SED can be applied to many areas related to machine listening, such as traffic monitoring [1], smart meeting room [2], automatic assistance driving [3], and multimedia analysis [4]. The popular classifiers for SED include deep models, such as CRNNs [5,6], recurrent neural networks (RNNs) [7,8], convolutional neural networks (CNNs) [9][10]; and traditional shallow models, such as random regression forests [11], support vector machines [12][13][14], hidden Markov models [15], and Gaussian mixture models [16]. ...
Full-text available
Convolutional recurrent neural networks (CRNNs) have achieved state-of-the-art performance for sound event detection (SED). In this paper, we propose to use a dilated CRNN, namely a CRNN with a dilated convolutional kernel, as the classifier for the task of SED. We investigate the effectiveness of dilation operations which provide a CRNN with expanded receptive fields to capture long temporal context without increasing the amount of CRNN's parameters. Compared to the classifier of the baseline CRNN, the classifier of the dilated CRNN obtains a maximum increase of 1.9%, 6.3% and 2.5% at F1 score and a maximum decrease of 1.7%, 4.1% and 3.9% at error rate (ER), on the publicly available audio corpora of the TUT-SED Synthetic 2016, the TUT Sound Event 2016 and the TUT Sound Event 2017, respectively.
... Recent works on AED commonly consider the acoustic event detection as a classification problem where each frame may correspond to more than one event class. The extracted audio representations [10] [11] are modeled by statistical models [6] [12]- [18]. With the fast development and superior advances of neural network based approaches, AED systems have been widely constructed using neural network techniques, especially in recent Detection and Classification of Acoustic Scenes and Events (DCASE) campaigns [8] [9]. ...
Acoustic event detection aims at processing the acoustic signal to determine the event type and to estimate the start and the end times of the event. Multi-label classification based approaches are commonly used to detect the frame wise event types followed by a median filter to determine the happening times. However, the multi-label classifiers are trained only on the acoustic event types ignoring the frame position within the audio events. To deal with this, this paper proposes to construct a joint learning based multi-task system. The first task performs the acoustic event type detection and the second task is to predict the frame position information. By sharing representations between the two tasks, we can enable the acoustic models to generalize better than the original classifier by averaging respective noise patterns to be implicitly regularized. Experimental results on the monophonic UPC-TALP and the polyphonic TUT Sound Event datasets demonstrate the superior performance of the joint learning method by achieving lower error rate and higher F-score compared to the baseline AED system.
Sound event detection(SED) has been widely applied in real world applications. Convolutional recurrent neural network based SED approaches have achieved state-of-the-art performances. However, the convolution process is typically performed by using a fixed sized kernel, which adversely affects the detection accuracy especially when the acoustic features of different event classes are characterized by high variations. To deal with this, this paper proposes a sound event detection technique using a convolutional recurrent neural network framework with multiple convolutional kernels of different sizes. The top performing kernels are selected from a kernel pool based on the unsupervised clustering errors and the accuracies of the temporarily trained models. Afterwards, the selected kernels are fed to the multiple convolution layers to deal with the acoustic feature variations. Experimental results on different subsets of AudioSet, namely the DCASE Challenge 2017 Task 4 and DCASE Challenge 2018 Task 4, demonstrate the performance of the proposed approach compared to the state-of-the-art systems.
Conference Paper
With the improvement of automobile automation, auto-driving technology has become one of the research hotspots worldwide. The classification of automobile driving conditions is a key technology for auto-driving. In order to simplify the complexity of automobile driving conditions recognition, it is usually divided into three categories: driving, braking, and crashing, which are distinguished by a classifier. In this paper the 988-dimensional features of audio were extracted through openSMILE, and a classification of automobile driving conditions based on limited Boltzmann machine was proposed, of which the recognition rate could be reached to 93.7%.
Acoustic event detection is to perceive the surrounding auditory sound and popularly performed by the multi-label classification based approaches. The concatenated acoustic features of consecutive frames and the hard boundary labels are adopted as the input and output respectively. However, the different input frames are treated equally and the hard boundary based outputs are error-prone. To deal with these, this paper proposes to utilize the sequential attention together with the soft boundary information. Experimental results on the latest TUT Sound Event database demonstrate the superior performance of the proposed technique.
Conference Paper
Full-text available
This paper describes a newly-launched public evaluation challenge on acoustic scene classification and detection of sound events within a scene. Systems dealing with such tasks are far from exhibiting human-like performance and robustness. Undermining factors are numerous: the extreme variability of sources of interest possibly interfering, the presence of complex background noise as well as room effects like reverberation. The proposed challenge is an attempt to help the research community move forward in defining and studying the aforementioned tasks. Apart from the challenge description, this paper provides an overview of systems submitted to the challenge as well as a detailed evaluation of the results achieved by those systems.
Conference Paper
Full-text available
This paper presents a system for acoustic event detection in recordings from real life environments. The events are modeled using a network of hidden Markov models; their size and topology is chosen based on a study of isolated events recognition. We also studied the effect of ambient background noise on event classifi-cation performance. On real life recordings, we tested recognition of isolated sound events and event detection. For event detection, the system performs recognition and temporal positioning of a se-quence of events. An accuracy of 24% was obtained in classifying isolated sound events into 61 classes. This corresponds to the ac-curacy of classifying between 61 events when mixed with ambient background noise at 0dB signal-to-noise ratio. In event detection, the system is capable of recognizing almost one third of the events, and the temporal positioning of the events is not correct for 84% of the time.
Full-text available
The paper considers the task of recognizing environmental sounds for the understanding of a scene or context surrounding an audio sensor. A variety of features have been proposed for audio recognition, including the popular Mel-frequency cepstral coefficients (MFCCs) which describe the audio spectral shape. Environmental sounds, such as chirpings of insects and sounds of rain which are typically noise-like with a broad flat spectrum, may include strong temporal domain signatures. However, only few temporal-domain features have been developed to characterize such diverse audio signals previously. Here, we perform an empirical feature analysis for audio environment characterization and propose to use the matching pursuit (MP) algorithm to obtain effective time-frequency features. The MP-based method utilizes a dictionary of atoms for feature selection, resulting in a flexible, intuitive and physically interpretable set of features. The MP-based feature is adopted to supplement the MFCC features to yield higher recognition accuracy for environmental sounds. Extensive experiments are conducted to demonstrate the effectiveness of these joint features for unstructured environmental sound classification, including listening tests to study human recognition capabilities. Our recognition system has shown to produce comparable performance as human listeners.
Full-text available
This paper introduces stacked generalization, a scheme for minimizing the generalization error rate of one or more generalizers. Stacked generalization works by deducing the biases of the generalizer(s) with respect to a provided learning set. This deduction proceeds by generalizing in a second space whose inputs are (for example) the guesses of the original generalizers when taught with part of the learning set and trying to guess the rest of it, and whose output is (for example) the correct guess. When used with multiple generalizers, stacked generalization can be seen as a more sophisticated version of cross-validation, exploiting a strategy more sophisticated than cross-validation's crude winner-takes-all for combining the individual generalizers. When used with a single generalizer, stacked generalization is a scheme for estimating (and then correcting for) the error of a generalizer which has been trained on a particular learning set and then asked a particular question. After introducing stacked generalization and justifying its use, this paper presents two numerical experiments. The first demonstrates how stacked generalization improves upon a set of separate generalizers for the NETtalk task of translating text to phonemes. The second demonstrates how stacked generalization improves the performance of a single surface-fitter. With the other experimental evidence in the literature, the usual arguments supporting cross-validation, and the abstract justifications presented in this paper, the conclusion is that for almost any real-world generalization problem one should use some version of stacked generalization to minimize the generalization error rate. This paper ends by discussing some of the variations of stacked generalization, and how it touches on other fields like chaos theory.
Conference Paper
Full-text available
The automatic recognition of human activities such as cooking, showering and sleeping allows many potential applications in the area of ambient intelligence. In this paper we show that using a hierarchical structure to model the activities from sensor data can be very beneficial for the recognition performance of the model. We present a two-layer hierarchical model in which activities consist of a sequence of actions. During training, sensor data is automatically clustered into clusters of actions that best fit to the data, so that sensor data only has to be labeled with activities, not actions. Our proposed model is evaluated on three real world datasets and compared to two non-hierarchical temporal probabilistic models. The hierarchical model outperforms the non-hierarchical models in all datasets and does so significantly in two of the three datasets.
Full-text available
Diversity among the members of a team of classifiers is deemed to be a key issue in classifier combination. However, measuring diversity is not straightforward because there is no generally accepted formal definition. We have found and studied ten statistics which can measure diversity among binary classifier outputs (correct or incorrect vote for the class label): four averaged pairwise measures (the Q statistic, the correlation, the disagreement and the double fault) and six non-pairwise measures (the entropy of the votes, the difficulty index, the Kohavi-Wolpert variance, the interrater agreement, the generalized diversity, and the coincident failure diversity). Four experiments have been designed to examine the relationship between the accuracy of the team and the measures of diversity, and among the measures themselves. Although there are proven connections between diversity and accuracy in some special cases, our results raise some doubts about the usefulness of diversity measures in building classifier ensembles in real-life pattern recognition problems.
Full-text available
We present the use of layered probabilistic representations using Hidden Markov Models for performing sensing, learning, and inference at multiple levels of temporal granularity. We describe the use of the representation in a system that diagnoses states of a user's activity based on real-time streams of evidence from video, acoustic, and computer interactions. We review the representation, present an implementation, and report on experiments with the layered representation in an office-awareness application.
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
A recognition system for environmental sounds is presented. Signal-driven classification is performed by applying machine-learning techniques on features extracted from a cochleogram. These possibly unreliable classifications are improved by creating expectancies of sound events based on context information.
A team of classifiers (committee of learners) can be more accurate than the best member of the team. Theoretically, if the classifiers make independent errors, the majority vote outperforms the best classifier. However, if the classifiers are dependent, the team might be either better or worse. While there are many measures for dependency between two variables (here the classifier outputs), measuring the dependency or diversity of many variables is not straightforward. Here we study eight measures of dependency between a team of classifiers divided in two groups: four averaged pairwise measures (Q statistic, the correlation, the disagreement and the double fault) and four non-pairwise measures (the entropy of the votes, the "difficulty" index, Kohavi-Wolpert variance, and the interrater agreement). To study empirically the relationship between the the majority vote accuracy (Pmaj ) and the diversity measures, binary classifier outputs (correct/incorrect votes) have been generated trying to sampl...