Content uploaded by Andreas Merentitis
Author content
All content in this area was uploaded by Andreas Merentitis
Content may be subject to copyright.
2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2013, New Paltz, NY
HIERARCHICAL MODELING USING AUTOMATED SUB-CLUSTERING
FOR SOUND EVENT RECOGNITION
Maria E. Niessen, Tim L. M. Van Kasteren, Andreas Merentitis
AGT International
Hilpertstr. 35, 64295 Darmstadt, Germany
{mniessen,tkasteren,amerentitis}@agtinternational.com
ABSTRACT
The automatic recognition of sound events allows for novel appli-
cations in areas such as security, mobile and multimedia. In this
work we present a hierarchical hidden Markov model for sound
event detection that automatically clusters the inherent structure
of the events into sub-events. We evaluate our approach on an
IEEE audio challenge dataset consisting of office sound events and
provide a systematic comparison of the various building blocks of
our approach to demonstrate the effectiveness of incorporating cer-
tain dependencies in the model. The hierarchical hidden Markov
model achieves an average frame-based F-measure recognition per-
formance of 45.5% on a test dataset that was used to evaluate chal-
lenge submissions. We also show how the hierarchical model can
be used as a meta-classifier, although in the particular application
this did not lead to an increase in performance on the test dataset.
Index Terms—sound event detection, hierarchical models,
meta-classifier
1. INTRODUCTION
Sound event recognition allows a broad range of applications in ar-
eas such as security (e.g. gunshot detection), mobile (e.g. con-
text awareness), and multimedia (e.g. search engines). Recogni-
tion of events in real-world environments is challenging because
different instances of the same type of event can have strong varia-
tion (e.g. speech) and one event label can be characterized by sev-
eral sounds (e.g. a printer loading paper and then printing it). We
present a hierarchical model and a meta-classifier to reliably detect
sound events making use of the temporal dependencies within and
between events.
Many studies in sound event recognition rely on techniques that
have proven successful in automatic speech recognition and multi-
media content analysis, mostly Mel-frequency cepstral coefficients
(MFCC) for signal description and hidden Markov models (HMM)
to classify events into a set of predefined categories. For exam-
ple, Mesaros et al. [1] uses this combination to detect a large selec-
tion of events in a variety of indoor and outdoor environments. To
deal with the variation of environmental sounds additional features
can be used, standard ones such as spectral moments or constructed
through a more elaborate feature selection process, resulting in a
higher recognition performance [2]. The variation and composi-
tionality of events in real-world environment has previously been
accounted for with a cascade hierarchical model, which means that
the output of a HMM was fed as input to another HMM, therefore
processing the data at different levels of abstraction [3].
In our hierarchical model we extract a range of standard audio
features from the sound and use these features in a hierarchical hid-
Temporal Spectral Auto-correlation Multi-dimensional
STE Flux Flux 13 MFCCs
ZCR Roll-off Roll-off 12 LPCs
Flatness Flatness
Brightness
Centroid
Table 1: Audio features
den Markov model (HHMM) which automatically learns to cluster
the intrinsic representation of each sound event from the data. The
output of the hierarchical model is combined with a discriminative
Random Forest method using a meta-classifier to complement the
strengths of both classifiers into a single model. We evaluate our
approach on an IEEE audio challenge dataset consisting of office
sound events [4] and provide a systematic comparison of the various
building blocks of our approach to demonstrate the effectiveness of
incorporating certain dependencies in the model.
The rest of this paper is organized as follows. Section 2 de-
scribes the features extracted from the audio data. Section 3 ex-
plains the HHMM and section 4 introduces the meta-classifier. In
section 5 we present our experiments and results, section 6 discusses
these results, and in section 7 we state our conclusions.
2. FEATURE EXTRACTION
Sound events in an office environment have highly variable au-
dio characteristics. In addition to features that are successful in
speech and music processing, such as MFCCs1and zero-crossing
rate (ZCR), we implemented features such as short-term energy
(STE) that better describe impact sounds such as door knocking
and complex sounds such as the printer. Table 1 shows the list of
features that were used in our experiments. We use the same set-
tings for all features: The audio data is discretized into frames using
a window size of 80 ms with an overlap of 50% and a rectangular
window. For the two roll-off features we applied a threshold of 85%
of the maximum energy. The linear prediction coefficients2(LPCs)
are calculated with the covariance method. All features combined,
we obtain a 35-dimensional feature vector for each frame.
1K. Wojcicki, HTK MFCC
2VOICEBOX, http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/ voice-
box.html
2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2013, New Paltz, NY
zt-1 ztzt+1
... ...
t-1 tt+1
yt-1 ytyt+1
...
ft-1 ftft+1
Acoustic
events
Sub event
clusters
Finished
state
variable
Observations
Figure 1: The graphical representation of a two-layer HHMM.
Shaded nodes represent observable variables, the white nodes rep-
resent hidden states.
3. HIERARCHICAL HMM
Sound events are highly correlated in the temporal domain. Instead
of classifying each frame independently, a model that takes into ac-
count the temporal dependencies within and between sound events
can improve recognition performance [5]. Therefore we apply a
two-layer hierarchical hidden Markov model (HHMM) (Fig. 1) that
has been developed for human activity recognition to classify the
audio features [6]. The top layer (yt) of the hierarchical state repre-
sentations are the sound events we wish to recognize (e.g. printing
and phone ringing) and the bottom layer (zt) are the intrinsic sub-
events that an event consists of. For example, the sound event of
printing can be segmented into a sub-event for the sound the printer
makes when loading paper from a tray, one for the actual printing
sound, and one for the sound the printer makes while feeding the
printed paper to the output tray. A finished state variable (ft) is
used as a binary indicator to indicate the top layer has finished its
sequence.
Although it is possible to train such a model using data which
is annotated with labels of both sound events and the sub-events,
we train the model using only labels for the top layer sound events.
There are two advantages to this approach: (i) annotating the data
becomes significantly less involved when only the sound events
have to be annotated and (ii) we do not force any structure upon
the model with respect to the sub-events, but rather let the model
find this structure in the data automatically. The automatic alloca-
tion of structure can be considered as a clustering task. The clusters
found in the data do not necessarily have to be meaningful clusters
that correspond to actual sub-events that are intuitive to humans.
We therefore refer to them as ‘sub-event clusters’ to indicate the
sub-events were found through clustering.
The observations are modeled using a multidimensional Gaus-
sian distribution, in which each sub-event cluster is associated with
a single Gaussian distribution p(~xt|yt=k, zt=l) =
N(xt|µkl,Σk). Note that the covariance matrix Σkonly has
a subscript k, meaning that we have a different covariance matrix
for each sound event, but that the covariance matrix for different
sub-event clusters is shared among the sub-event clusters for a par-
ticular sound event k. The ideal number of sub-event clusters to use
is determined through experimentation. The model parameters are
learned using the Expectation-Maximization (EM) algorithm and
novel sequences are inferred using the Viterbi algorithm.
4. META-CLASSIFIER
In the context of supervised learning, algorithms are typically
searching through a hypothesis space to find a hypothesis that can be
used to make good predictions for the particular problem at hand.
Ensemble learning is a technique for combining numerous weak
learners in an attempt to produce a strong learner. An ensemble
also represents a single hypothesis in the solution space. However,
this hypothesis is not necessarily contained within the space of the
models used to construct the ensemble. Therefore ensembles typi-
cally have more flexibility in the functions they can represent, which
can result in a reduction of model bias [7]. Considering the typical
bias-variance decomposition and the bias-variance trade-off, an in-
crease in model complexity is often associated with an increase in
variance, since the more complex model is potentially more prone
to overfitting the training data. To minimize these effects we use
Bagging, which is short for Bootstrap Aggregating.
Bagging involves training each model in the ensemble using
a randomly selected subset of the training set (to promote model
diversity) and combining predictions with an equal weight voting
scheme (to avoid overfitting). A good basis for ensemble meth-
ods is a set of models that individually provide good predictions
while their diversity (i.e. disagreement) is high [8]. After some
experimentation with the provided datasets we have selected deci-
sion trees as the basic models of our ensemble classifier. The use
of Bagging in combination with decision trees and subset feature
selection gives an ensemble method known as Random Forest [9].
Random Forests do not use any temporal information in modeling
the data, which can limit their effectiveness on problems with strong
temporal dependencies such as sound event detection. Therefore, a
temporal probabilistic model such as the HHMM (as described in
the previous section) is combined with the Random Forest.
The final decision is taken by a meta-classifier that uses the
predictions of the Random Forest and the HHMM as input. Since
audio data have a strong temporal character a model that is designed
to exploit this property is a good candidate for the meta-classifier.
In our case a second instance of the HHMM is used as a meta-
classifier. The output of this model (from here on referred to as
meta-HHMM) is used as the final classification result.
5. EXPERIMENT
We use the dataset that was created for the IEEE challenge “Detec-
tion and classification of acoustic scenes and events” [4]. For the
event detection challenge the organizers provided around 20 train-
ing examples of 16 different classes that can occur in an office en-
vironment, such as printer, speech, and phone. A development set
was provided for evaluation prior to submission and a separate test
dataset was used by the organizers to evaluate submissions. The
test set is not available to us at the time of writing this paper, but we
did receive the performance measures on the test set of the models
we submitted to the challenge (see Table 2). Both the test set and
the development set consist of multiple recordings of roughly 1 min
long, recorded in a various office environments (‘Office Live’) and
were annotated by two people (‘bdm’ and ‘sid’)3. The results for
the HHMM were obtained using a single sub-event cluster. In the
rest of this section we discuss the results we received on the test
dataset and explain the performance of our models in further detail
by presenting additional experiments on the development set.
3The development set consists of three recordings, referred to as
‘script01’, ‘script02’ and ‘script03’)
2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2013, New Paltz, NY
Table 2 shows the results of the meta-HHMM classifier and the
HHMM on the test dataset. The meta-HHMM classifier makes as-
sumptions about the event distribution, based on the distribution of
events in the development set. Our hypothesis was that the distribu-
tion of events in the development set would be predictive of the dis-
tribution in the test set and hence result in an improved performance
of the meta-classifier over the HHMM. However, the HHMM out-
performs the meta-HHMM classifier in terms of F-measure on all
three evaluation methods (i.e. event-based, class-wise event-based,
and frame-based), because the distribution of events in the test data
set differs significantly from the distribution in the development
data set.
F-Measure Performance
Model Event-based Class-wise
event-based
Frame-based
HHMM 34.5 33.5 45.5
Meta-HHMM 32.6 29.4 40.9
Table 2: F-measure performance of the meta-HHMM classifier and
the HHMM on the test dataset. The results were provided by the
organizers of the contest.
The recognition performance of the HHMM on the develop-
ment set is shown in Table 3. The first four event-based measures in
the table are calculated using onset only; each event is considered
to be correctly detected within a 100 ms tolerance window. The
offset measures in the table are calculated using both onset and off-
set; each event is correctly detected if its onset is within a 100 ms
tolerance window and its offset is within 50% range of the ground
truth events offset with respect to the duration of the event [4]. We
see a slight drop in performance for the offset measure due to the
additional constraint. Precision is higher than recall for all three
evaluation methods. In other words, our model is more likely to
miss (part of) a sound event than to recognize a sound event when
no event was taking place. This can be seen in Figure 2, where the
results of the HHMM are visualized together with the ground truth
for two of the development files. Further inspection of the results re-
veals that the ‘switch’ event is never recognized and the ‘pen drop’
and ‘knock’ events are difficult classes to detect. Other misclassi-
fications are more ‘acceptable’, because the mixed classes are very
similar, such as speech, laughter, cough, and clear throat.
Evaluation method
Metrics Event-based Class-wise
event-based
Frame-based
Recall (R) 40.3 40.3 43.2
Precision (P) 48.2 44.5 67.1
F-Meas. 43.5 39.1 52.1
AEER 1.43 1.34 0.96
Off. R34.7 34.7 –
Off. P41.6 39.0 –
Off. F-Meas. 37.5 33.7 –
Off. AEER 1.60 1.51 –
Table 3: Results of the HHMM with one sub-event cluster per sound
event.
Our HHMM can be considered as a combination of building
blocks that correspond to commonly used machine learning mod-
els. Table 4 shows the recognition performance of these individual
building blocks in comparison with the HHMM. The observation
model corresponds closely to a Gaussian Mixture Model (GMM), it
performs the worst because it does not take into account any tempo-
ral information and treats the data as independently and identically
distributed (IID). By adding temporal dependencies to the model
we obtain a hidden Markov model (HMM), which results in a sig-
nificant increase in performance. Finally we obtain our proposed
HHMM by adding the intermediate layer and finite state variable,
resulting in a further significant increase in performance. The table
also lists the performance of the meta-HHMM on the development
set and an upper bound which is the maximum annotator agreement
between the two annotations that were provided. The meta-HHMM
outperforms the HHMM in this case, but data from the development
set was used to estimate the prior distribution of events, making this
a biased result for the meta-HHMM.
F-Measure Performance
Model Event-based Class-wise
event-based
Frame-based
GMM 8.5 20.3 26.8
HMM 24.4 29.4 43.7
HHMM 43.5 39.1 52.1
Meta-HHMM∗47.5 41.3 54.4
Upper bound 87.1 88.2 92.1
Table 4: F-measure performance of the GMM, HMM, HHMM and
the meta-HHMM on the development set (averaged over the three
files). The upper bound is the maximum annotator agreement. (*:
the meta-HHMM uses a prior distribution of the events, which was
estimated on the development set, making this a biased result).
6. DISCUSSION
Standard feature ranking methods indicate that the lower MFCCs,
lower LPCs, and spectral moments are the most predictive features,
either because they are very predictive for one class in particular
(e.g. first MFCC for ‘cough’) or because they have relatively small
within-class variation for some classes (e.g. spectral roll-off). The
combination of all features gives a substantially higher performance
than MFCCs alone. In future work we plan to investigate additional
time-frequency domain features because they have shown to im-
prove performance compared to only using frequency domain fea-
tures such as MFCCs [2]. Moreover, we will adopt a feature selec-
tion process to select the most predictive features.
The results are obtained using a single sub-event cluster per
sound event. We experimented with various number of sub-event
clusters, but obtained the best performance when using a single
sub-event. This is surprising since with a single sub-event cluster
the model does not benefit from modeling transitions between sub-
events and the resulting model corresponds closely to a standard
hidden Markov model (HMM). The difference between the hierar-
chical model using a single sub-event cluster and the HMM is that
the hierarchical model explicitly models the finishing of a state.
A limitation of our current HHMM implementation is that the
same number of sub-event clusters is used for all sound events.
2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2013, New Paltz, NY
0 5 10 15 20 25 30 35
40 45 50 55 60 65 70 75
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75
(a) script01
(b) script03
alert
clearthroat
cough
doorslam
drawer
keyboard
keys
knock
laughter
mouse
pageturn
pendrop
phone
printer
speech
switch
Time (s)
result
ann
result
ann
Figure 2: The top bar (result) shows the classification result of the HHMM for the first 80 seconds of (a) script01 with relatively high
recognition performance and (b) script03 with lower recognition performance. The bottom bar (ann) is the annotation of ‘sid’.
Close inspection of the data reveals that there are a number of rel-
atively simple sound events (e.g. cough) that do not benefit from
a richer hierarchical structure. Using additional sub-event cluster
results in an increase in the number of parameters that need to be
estimated. The overall recognition performance decreases because
the increased modeling complexity does not benefit the majority of
the classes, while the increase in number of parameters results in a
poorer estimation of the parameter values. Ideally, we would use a
different number of sub-event clusters for each sound event, to be
able to use the optimal number of clusters for each event.
Meta-classifiers can be extremely effective in combining the
strengths of multiple classifiers and are often beneficial over a sim-
ple voting scheme. We saw that the meta-HHMM outperformed the
HHMM on the development set, but underperformed on the test set.
Because the meta-HHMM used data from the development set in
its training phase (to estimate the prior probabilities of events), the
result on the development set cannot be regarded as a proper evalu-
ation. The results on the test set show that the meta-HHMM under
performs the HHMM, indicating that the model does not general-
ize well. However, in applications where the distribution of events
can be expected to be relatively fixed, the model is expected to out-
perform the HHMM. Further experiments on additional datasets are
needed to verify this expectation.
7. CONCLUSION
We presented a two-layer hierarchical hidden Markov model for
recognizing sound events in which one of the layers corresponds to
sound events and one layer corresponds to sub-event clusters. The
results of the hierarchical model were used in a meta-classifier in
combination with a random forest in an attempt to boost the classi-
fication results further. Our experimental results show that our as-
sumptions about the data distribution were too strong and therefore
resulted in the meta-classifier overfitting on the data, rather than an
increase in performance. Instead, the stand-alone use of the hierar-
chical model using a single sub-event cluster per sound event gave
the best performance. With a systematic comparison we show how
modeling temporal dependencies and hierarchical structure leads to
a significant increase in recognition performance. Future work will
focus on creating a hierarchical model in which a different number
of sub-event clusters per sound event can be used.
8. ACKNOWLEDGMENT
We are grateful to Christian Debes, Roel Heremans, and James Rex
for their contribution.
9. REFERENCES
[1] A. Mesaros, T. Heittola, A. Eronen, and T. Virtanen, “Acoustic
event detection in real life recordings,” in 18th European Sig-
nal Processing Conference (EUSIPCO-2010), 2010, pp. 1267–
1271.
[2] S. Chu, S. Narayanan, and C.-C. Kuo, “Environmental sound
recognition with time-frequency audio features,” IEEE Trans-
actions on Audio, Speech, and Language Processing, vol. 17,
no. 6, pp. 1142–1158, 2009.
[3] N. Oliver, E. Horvitz, and A. Garg, “Layered representations
for human activity recognition,” in Fourth IEEE International
Conference on Multimodal Interfaces (ICMI), 2002, pp. 3–8.
[4] D. Giannoulis, E. Benetos, D. Stowell, M. Rossignol, M. La-
grange, and M. Plumbley, “Detection and classification of
acoustic scenes and events,” School of Electronic Engineer-
ing and Computer Science, Queen Mary University of London,
Tech. Rep. EECSRR-13-01, 2013.
[5] J. D. Krijnders, M. E. Niessen, and T. C. Andringa, “Sound
event recognition through expectancy-based evaluation of
signal-driven hypotheses,” Pattern Recognition Letters, vol. 31,
no. 12, pp. 1552–1559, 2010.
[6] T. L. M. Van Kasteren, G. Englebienne, and B. J. A. Kr¨
ose, “Hi-
erarchical activity recognition using automatically clustered ac-
tions,” in Proceedings of the Second International Conference
on Ambient Intelligence, 2011, pp. 82–91.
[7] D. Wolpert, “Stacked generalization,” Neural Networks, vol. 5,
no. 2, pp. 241–259, 1992.
[8] L. Kuncheva and C. Whitaker, “Measures of diversity in classi-
fier ensembles,” Machine Learning, vol. 51, pp. 181–207, 2003.
[9] L. Breiman, “Random forests,” Machine learning, vol. 45,
no. 1, pp. 5–32, 2001.