NON-SPEECH AUDIO EVENT DETECTION
Jos´ e Portˆ elo1, Miguel Bugalho12, Isabel Trancoso12, Jo˜ ao Neto12, Alberto Abad1, Ant´ onio Serralheiro13
1INESC-ID Lisboa, Portugal
2IST, Lisboa, Portugal
3Military Academy, Portugal
Audio event detection is one of the tasks of the European
project VIDIVIDEO. This paper focuses on the detection of
non-speech events, and as such only searches for events in
audio segments that have been previously classified as non-
effects have shown the potential of this type of corpus for
training purposes. This paper describes our experiments with
SVM and HMM-based classifiers, using a 290-hour corpus of
sound effects. Although we have only built detectors for 15
semantic concepts so far, the method seems easily portable
to other concepts. The paper reports experiments with multi-
ple features, different kernels and several analysis windows.
Preliminary experiments on documentaries and films yielded
promising results, despite the difficulties posed by the mix-
tures of audio events that characterize real sounds.
Index Terms— audio segmentation, event detection
The framework for this work is the European project VIDI-
VIDEO, whose goal is to boost the performance of video
search engines by forming a 1000 element thesaurus. Instead
of carefully modeling each different semantic concept, the ap-
proach is to apply machine learning techniques to train many,
possibly weaker detectors, describing different aspects of the
audio-video content. The combination of many single class
detectors will render a much richer basis for the semantics.
The integration of cues derived from the audio signal is es-
sential for many types of search concepts. Our role in the
project is to contribute towards this integration with three dif-
ferent modules: audio segmentation, speech recognition, and
detection of audio events. This paper concerns the last mod-
Audio Events Detection (AED) is a relatively new re-
search area with ambitious goals. Typical AED frameworks
are composed of at least two parts: feature extraction and au-
dio event inference. Optionally, there may be an intermediate
stage of key audio effect detection, typically based on Hidden
Markov Models (HMMs), that explores the time structure of
the events and/or models interconnections between key audio
effects (e.g. an explosion being preceded by a car crash).
The feature extraction process deals with different type
of features, such as: total spectral power, sub-band power,
brightness, bandwidth, MFCC (Mel-Frequency Cepstral Co-
efficients), PLP (Perceptual Linear Prediction), ZCR (Zero
Crossing Rate), pitch frequency, etc. Brightness and band-
width are, respectively, the first and second order statistics of
the spectrogram, and they roughly measure the timbre qual-
ity of the sound. Many of these features are common to the
audio segmentation and speech recognition modules. Due to
the large amount of features that can be extracted, consider-
ing them all can lead to lengthy training processes due to slow
convergence of the classification algorithms. In this situation,
it is common practice to use feature reduction techniques like
Principal Components Analysis (PCA) and Linear Discrimi-
nant Analysis (LDA), which map the features into a new vec-
tor space where the greatest variance by any projection of the
data lies on the first coordinate, the second greatest variance
lies on the second coordinate, and so on.
In the inference process, various machine learning meth-
ods are used to provide a final classification of the audio
events such as rule-based approaches (RB) , Gaussian
mixture models (GMMs)    , Support Vector Ma-
chines (SVMs)    , and Bayesian Networks .
In this work we used HMMs and SVMs for building a
one-against-all classifier for each semantic concept.
approach allows an easy extension to new semantic con-
cepts, although better results could potentially be achieved by
Given the unavailability of a corpus labeled in terms of
audio events, we used a sound effect corpus for training. The
potential of this type of corpus was proved in early experi-
ments with a small pilot corpus . The extended training
corpus and the small test corpus of documentaries and movies
will be described in section 2. The next section motivates
our two-stage AED approach that first distinguishes between
speech and non-speech audio events. Section 4 describes our
multiple experiments with one-against-all detectors. Finally,
section 5 presents the main conclusions and future plans.
1973978-1-4244-2354-5/09/$25.00 ©2009 IEEEICASSP 2009
2. CORPORA AND EVALUATION METRICS
The first corpus we considered for the task of audio events
detection was a small pilot corpus of 422 sound effects files,
totaling 6.8h, provided by B&G, one of the partners of the
project. The choice of a sound effects corpus was made be-
cause it is intrinsically labeled, as each file typically contains
a single type of sound. Since the initial results were quite
promising, we moved on to a larger corpus of approximately
18700 files with an estimated total duration of 289.6h, also
provided by B&G. The corpus includes enough training mate-
rial for over 40 different audio events, but so far we have only
considered15. ThisinitiallistispresentedinTable1, together
with the number of files and corresponding duration that were
used as training/development corpus for each classifier. Most
of the files have a sampling rate of 44.1kHz. However, many
were recorded with a low bandwidth (<10kHz).
In order to test the one-against-all detectors in a real life
situation, we manually labeled a number of movies, docu-
mentaries (DOC), talk shows (TS) and broadcast news (BN)
that were likely to contain this initial list of audio events. This
real life corpus covers 13 of the 15 audio events.
The development experiments described in this paper will
be assessed in terms of the well-known F-measure. However,
the experiments with the evaluation set will be assessed both
in terms of the ratio (prp) of true positives (tp) over total num-
ber of positives (p), and the ratio (prn) of true negatives (tn)
over total number of negatives (n). In this work the detection
performance (in every metric) is frame-based. Classification
results in the test set are smoothed over time .
3. TWO-STAGE AED
Our initial experiments with the pilot sound effect corpus led
us into adopting a two-stage approach for audio event de-
tection. The first stage applies a speech/non-speech detec-
tor. This stage attempts to separate the events that are typi-
cally produced by the human speech production system (not
only speech, but also laughing, crying, screaming, etc.), from
the ones that are not related to human voice. In the sec-
ond stage, separate classifiers attempt to detect either speech-
related events or non-speech events, according to the initial
classification. This paper addresses only the last category.
The original speech/non-speech (SNS1) detector is based
on an MLP (Multi-Layer Perceptron) trained with PLP fea-
tures, extracted from a corpus of broadcast news. Although
the performance of the classifier is very good for this domain
, the type of non-speech events is quite limited (e.g. jin-
gles). When tested in the sound effect corpus, the SNS some-
times detects speech in non-speech events, as shown in the
fourth column of Table 1, which contains the duration of the
detected speech segments.
This observation motivated the retraining of the detector
including non-speech examples randomly selected from the
large sound effects corpus (excluding the files that were used
for training each audio event classifier). The results obtained
with the new detector (SNS2), given in the last column of
Table 1, show an excellent false positive ratio (non-speech
classified as speech), except for the Helicopter concept. Ad-
ditionally, in a speech database, equivalent (speech) detection
performance to the original SNS1detector was observed.
Table 1. List of audio events: number of files, total duration,
and amount of data misclassified as speech (seconds).
4. ONE-AGAINST-ALL DETECTORS
With the objective of obtaining simple one-against-all detec-
tors, we have built “concept-specific” and “world” models
for the list of audio events. Our first experiments were car-
ried out using the LIBSVM toolkit , for the 15 concepts.
Then we performed in parallel experiments using the HMM
toolkit from HTK  and feature dimensionality reduction
techniques, for a restricted number of concepts.
4.1. SVM classifiers
The initial experiments were made with the purpose of eval-
uating the event detection results provided by different com-
binations of well-known features that will serve as a baseline
for future comparisons. At this stage we only considered the
use of PLP or MFCC (19 coefficients + energy + deltas) and
3 additional features: brightness, bandwidth and ZCR. The
“world” model was build using between 92 and 96 files, of
which an average of 31 were used as the development set. As
a starting point, analysis windows of 0.5s with 0.25s overlap
were adopted. Three different kernels were considered for the
SVM (linear, polynomial and radial basis function (RBF)),
but only the results for the RBF kernel are shown, as they
Table 2. SVM results for the development set (F-measure).
were overall better than the others. The results for these ini-
tial experiments on all considered audio events are presented
in Table 2. The results obtained on the test set using the best
combination of features on the development set are shown in
Table 3. These results confirm that detecting audio events in
real life data is much more challenging than the classification
of isolated events. We expect that AED can benefit from in-
corporating time structure models and new features.
4.2. HMM classifiers: Modeling time structure
After the initial experiments with SVMs, we tried to take ad-
SVMs are a powerful machine learning tool, some other tools,
Some of the 15 chosen audio events present a strong periodic
nature, such as Airplanes, Helicopters and Sirens. We have
chosen Sirens to test the HMM approach, due to their very
distinct frequency characteristics. Left-to-right models with
several number of states and Gaussian mixtures were trained
to tune these parameters according to the development set re-
sults. MFCC features (12 coefficients + energy + deltas) of
three different window lengths were used. In these experi-
ments, the audio files have been down-sampled to 16kHz.
The results for the test set are shown in Table 4. These
were obtained using the number of states and mixtures that
yielded the best results on the development set. Even using
a more limited feature set, the results for the 20ms window
length show a small improvement over the previous SVM re-
sults (0.43 mean positive detection, compared with 0.29 for
the SVMs). Only for the second file the results were worse.
4.3. Extended feature set
In the several experiments carried out throughout this work
we could verify that the results of the SVM classifiers were
Table 3. SVM results for the test set (prpand prn).
Table 4. Results of training HMMs with several window
lengths, for the test set (Sirens).
highly dependent on the set of features.
audio event has distinct frequency characteristics, we have
explored an extended set of features that includes pitch. We
have also tested a different method for representing feature
variation, Shifted Delta Cepstrum (SDC)  (parameters:
d=1,P=2,k=2). Because the pitch was extracted using 20ms
windows and all the other features were extracted using
500ms windows, for every feature vector we included several
pitch values. The total size of the extended feature vector
is 52. Table 5 shows the results for the SVMs using PLPs
(with deltas or SDC), the 3 additional features and pitch. The
results were slightly worse compared to Table 3.
Since the Siren
4.4. Data dimensionality reduction
The results obtained by adding the pitch feature have shown
that increasing the number of features may decrease the per-
formance of the SVMs. This motivated the use of PCA to
perform feature dimensionality reduction on the Siren audio
Table 5. SVM results with extended features (Sirens).
Table 6. SVM results with PCA features (Sirens).
event data. One of the advantages of PCA is to allow for a
faster execution of the training process by reducing the num-
ber of features. Moreover, by combining the most discrimi-
nating features into a small set, the PCA removes unimportant
data that can decrease the performance of machine learning
algorithms such as SVMs. Table 6 shows the results using
are calculated in the training set and their respective variance
coverage rate is verified in the development set. The results
show significant improvements relatively to the results using
pitch, and are better than the initial SVMs results for the test
5. CONCLUSIONS AND FUTURE WORK
The initial experiments presented in this work allowed us to
conclude that the performance of the classifiers in the sound
effect corpus can be very different from the performance on
the real data test set, where several audio events can coexist
simultaneously and where recording conditions can be signif-
icantly different. Even so, the advantages of using an intrinsi-
cally labeled corpora, and the good results obtained in some
audio events, justify this choice of training corpora. We are
currently working towards reducing the differences between
the training/development and test data by using normalization
techniques, and we are also testing agglomerative clustering
approaches. We observed that HMMs are a promising method
for our AED task that justifies further tests. The use of fea-
ture dimensionality reduction methods is also worth pursuing,
particularly when dealing with several features that may influ-
ence differently the detection of acoustic events.
Amaral, R., Meinedo, H., Caseiro, D., Trancoso, I.,
and Neto J., “A Prototype System for Selective Dissem-
ination of Broadcast News in European Portuguese”,
EURASIP J. on Adv. in Signal Processing, Hindawi
Publishing Corporation, vol. 2007, n. 37507, May 2007.
Cai, R. et al. “A flexible framework for key audio events
detection and auditory context inference”, IEEE Trans.
on Speech and Audio Processing, 2005.
Chang, C. and Lin, C., “LIBSVM: a library for
support vector machines”, Manual, 2001. Online:
Cheng, W., Chu, W. and Wu, J., ”Semantic context de-
tection based on hierarchical audio models”, Proc. 5th
ACM SIGMM Int. Workshop on Multimedia informa-
tion retrieval, pages 109-115, 2003.
Chu, W. et al. “A study of semantic context detection
by using SVM and GMM approaches”, Proc. IEEE Int.
Conf. on Multimedia and Expo, 2004.
Guo, G. and Li, S., “Content-based audio classification
and retrieval by support vector machines”, IEEE Trans.
on Neural Networks, 14(1):209-215, 2003.
 Moncrieff, S. et al. “Detecting indexical signs in film
audio for scene interpretation”, Proc. IEEE Int. Conf.
on Multimedia and Expo, 2001.
Torres-Carrasquillo, P. A. et al. “Approach to Lan-
guage Identification using Gaussian Mixture Models
and Shifted Delta Cepstral Features”, Proc. ICSLP
2002, Denver, September 2002.
 Trancoso, I. et al., “Training audio events detectors with
a sound effects corpus”, Proc. Interspeech 2008, Bris-
bane, September 2008.
 Xu, M. et al. “Creating audio keywords for event detec-
tion in soccer video”, Proc. IEEE Int. Conf. on Multi-
media and Expo, 2003.
 Young, S. et al. “HTK - Hidden Markov Model Toolkit”,
Manual, 2006. Online: http://htk.eng.cam.ac.uk/