Conference PaperPDF Available

Attentive Convolutional Recurrent Neural Network Using Phoneme-Level Acoustic Representation for Rare Sound Event Detection

Attentive Convolutional Recurrent Neural Network Using Phoneme-Level
Acoustic Representation for Rare Sound Event Detection
Shreya G. Upadhyay1,2, Bo-Hao Su1,2, Chi-Chun Lee1,2
1Department of Electrical Engineering, National Tsing Hua University
2MOST Joint Research Center for AI Technology and All Vista Healthcare,,
A well-trained Acoustic Sound Event Detection system cap-
tures the patterns of the sound to accurately detect events of
interest in an auditory scene, which enables applications across
domains of multimedia, smart living, and even health monitor-
ing. Due to the scarcity and the weak labelling nature of the
sound event data, it is often challenging to train an accurate
and robust acoustic event detection model directly, especially
for those rare occurrences. In this paper, we proposed an ar-
chitecture which takes the advantage of integrating ASR net-
work representations as additional input when training a sound
event detector. Here we used the convolutional bi-directional
recurrent neural network (CBRNN), which includes both spec-
tral and temporal attentions, as the SED classifier and further
combined the ASR feature representations when performing
the end-to-end CBRNN training. Our experiments on the TUT
2017 rare sound event detection dataset showed that with the in-
clusion of ASR features, the overall discriminative performance
of the end-to-end sound event detection system has improved;
the average performance of our proposed framework in terms of
f-score and error rates are 97 % and 0.05 % respectively.
Index Terms: sound event detection, convolution recurrent
neural network, attention, automatic speech recognition
1. Introduction
Sound event detection (SED) is one of the emerging topics in
research that aims at developing algorithms to automatically de-
tect sound events present in an auditory scene. SED has found
its use in many applications, including multimedia retrieval [1],
context based indexing, automated surveillance systems, and
unobtrusive health monitoring [2]. Recent works in SED has fo-
cused mainly on identifying improved neural architectures com-
posed with network blocks, such as convolutional neural net-
works (CNN) [3] and recurrent neural networks (RNN) [4], that
models time series sound data and spectral information. Sev-
eral recent works has shown that using a hybrid composition
of these network building blocks, such as long short-term mem-
ory(LSTM) with hidden Markov model (HMM) [5], the capsule
network model as Bi-LSTM [6], convolution recurrent neural
network (CRNN) [7] and convolutional bi-directional recurrent
neural network (CBRNN) [8] has all been developed for SED
and serve as the recent state-of-the-arts.
Compared to processing other acoustic data, such as music
and speech, SED, however, presents its own unique technical
challenges due to the limited amount of available information
with weakly labelled sound data. Another characteristic of the
SED is that the rare events are often important in computational
auditory scene analysis, e.g., detecting emergency events with
high accuracy. Hence, despite of the effort of developing com-
plex models to improve SED accuracy, the knowledge trans-
fer approach based on generating synthetic sound data has also
been considered in [9] to avoid the time-consuming manual la-
belling process of real world sound data; using a transfer learn-
ing network [10] that takes speech data as the source to learn
appropriate feature extractor for sound event data has also been
shown to demonstrate improved detection accuracy.
While research has started to develop different knowledge
transfer approaches in order to mitigate the issue of weak la-
beling and rare event robustness in SED, most of these works
are still limited in using the available SED databases only with-
out considering to leverage other types of high source data,
e.g., speech data used for automatic speech recognition (ASR),
where not only it is much larger in scale but also has the detailed
acoustic properties (phonemic-acoustic characterization) which
can be much better studied and well understood. In this pa-
per, we proposed a gradual network architecture that integrates
the phoneme-level acoustic representations obtained from a pre-
trained ASR for the task of improving SED performance. Our
main SED network architecture is a convolutional bidirectional-
RNN (CBRNN), i.e., inclusion of both temporal and spectral
attention mechanism. The pre-trained ASR acoustic model is
based on factorized time-delay neural network (TDNN-F) [11]
trained on the Librispeech corpus [12] (resulting in a WER of
3.76%). Our gradual network architecture takes the concatena-
tion of the pre-final layer of ASR features (i.e., from the TDNN-
F) and the conventional sound event representations learned
from the target SED database together as an input, and the net-
work only updates the SED path while keeps the ASR-based
representation frozen.
In this work, we evaluated our framework on the
DCASE2017 task 2 dataset [13] which is used often in works
for detecting rare sound events, and further compared it with
variants of CRNN model proposed recently in different papers
as the state-of-the-art. The experiment results surpass the con-
ventional CBRNN model achieving F1 score of 98.0% in baby-
cry event, 95.4% in gunshot event and also 97.0% in average
performance. This boost in accuracy over the state-of-the-art
demonstrated by integrating phoneme-level acoustic represen-
tations as predicted by a pre-trained ASR system, it indeed is
beneficial in enhancing the capacity in the end-to-end training
of SED model. We further analyzed the phoneme distribution of
each event to understand how each sound event class is mapped
to the respected phonemic class of the pre-trained ASR model.
The rest of the paper is organized as follows: the Section
2 describes the methodology part, which includes detailed de-
scription of dataset, feature extraction, the proposed architec-
ture, post-processing steps and the considered metrics. Section
3 illustrates the experimental setup for ASR and SED training
with result and analysis done on task 2 of DCASE 2017 dataset.
Section 4 draws the conclusion of this work.
Copyright © 2020 ISCA
October 25–29, 2020, Shanghai, China
Figure 1: An illustration of the proposed architecture; the proposed gradual network which concatenates the ASR network representa-
tions with the sound feature representations in path of end-to-end SED training.
2. Methodology
2.1. Dataset
The training of the SED model is done with the task 2 of
DCASE 2017 sound data, which consists of monophonic iso-
lated sound events for each target class; it is mixed with the
background recording of everyday acoustic scene [13]. In this
dataset, there are three target events as: “babycry”, “glass-
break”, and “gunshot” with background audio contains 15 dif-
ferent scene recordings. They have provided the synthesizer
code with synthesized audios which can be used to generate
1500 different mixtures for each class. The event to background
ratios are -6, 0, 6dB, the event occurrence probability of 0.5 and
all the mixtures are of a 30sec monaural audios. The dataset
comprises of two sets, the development set and the evaluation
set and the detail information of the task and the dataset is de-
scribed in [13]. Here we used 90% of training data to train the
model with 10% of it is used for validation purpose to prevent
over-fitting, and evaluation is done on the given evaluation set.
Before further processing, we first converted the sampling rate
to 16kHz.
2.2. Feature Extraction
For SED training, the log Mel-filter bank (Fbank) energies with
40 Mel scale filtered acoustic features are used in this work. The
sampling rate is set to 16kHz and are extracted within frame
size of 20ms with 50% overlapping. The extracted features are
normalized using min-max normalization and down-sampled by
taking mean of five samples at a time. Here we used the frame
concatenation with context window size of 2 on the input fea-
tures, and after applying context expansion, we obtained a total
of 200 concatenated Fbank features for each frame.
2.3. Proposed Model
2.3.1. Sound Event Detection (SED) Model
In this work, we used the CBRNN model as our SED binary
classifier for each event. The input of this system is the ex-
tracted Fbank features with multi-frame concatenation [15], and
the system output is the binary prediction for each frame of size
100ms. Here the CBRNN [8] structure consists of three parts
the CNN, RNN and the classification layer.
For the CNN part, the acoustic features are first fed to the
consecutive convolution layers and each convolution layer is
followed by batch normalization [16] for each feature map, lin-
ear activation function and the dropout layer[17]. The max
pooling layer is used here to extract the important features on
each feature map for both the axes. At the end of the CNN
part, the output feature map from CNN is stacked along the fre-
quency axis. The CNN model with max pooling can be de-
scribed here using function f extracting n features from a data
patch w:
f(ω) = σ(X
(W1ω+b1), ...., σ(X
(Wnω+bn)) (1)
where b1,...,bnare biases and W1,WNshows the weight matri-
In the RNN (GRU) part, the stacked output from the CNN
is fed to the recurrent layers followed by activation function.
Here, each recurrent layer produces the frame-wise output by
taking CNN layer output and the previous frame activation as
an input. For each frame the total activation of the GRU layer is
the interpolation of previous activation ht1and the candidate
activation ht. The GRU layer takes a sequence (x1,...,xt)
as an input and produces a sequence (h1, . . . , ht) of hidden
states with a sequence (y1,...,yt) as outputs (described in the
following equation):
ht=σ(Wxhxt+Whh ht1+bh)(3)
where σis a logistic sigmoid function and Wxh,Whh ,Why are
weight matrices and bh,byshows the bias.
For classification layer, the output of the bidirectional GRU
is fed to the fully connected layer to produce the classification
result for each frame, and the sigmoid activation function is
used to normalize the probabilities. After the CNN layers, each
output frame contains 100ms information, hence, the prediction
is for each 100ms data.
Table 1: System performance of our proposed model as compared with other state-of-the-art models shown in terms of Error Rate and
model CRNN+TA+RA[8] CRNN+TA[8] 1D-CRNN[14] CRNN[7] Finetuning Net Gradual Net
Babycry 0.18|91.3 0.25|87.4 0.15|92.2 0.18|90.8 0.20|88.6 0.04|98.0
Glassbreak 0.04|98.2 0.05|97.4 0.05|97.6 0.10|94.7 0.32|83.5 0.04|97.8
Gunshot 0.17|90.8 0.18|90.6 0.19|89.6 0.23|87.4 0.38|79.0 0.06|95.4
Average 0.13|93.4 0.16|91.8 0.13|93.4 0.17|91.0 0.30|83.7 0.05|97.0
2.3.2. Attention Mechanism
In this work we applied two aspects of attention: spectral and
temporal attention. The temporal attention is applied here to as-
sign different weights to the positive and negative frames, i.e., to
attend to most relevant event occurring frames. Here the CNN
output is passed to the fully connected layer with hidden unit
Ntfollowed by activation function and then the global max-
pooling is used on the frequency axis to obtain the weights for
each frame. After obtaining the weights the element-wise mul-
tiplication is done with the output of the fully connected layer.
The attention weights is computed as follows:
An,t = max
where Wnis weights and bnis bias for hidden units, Ctis out-
put from the CNN, ˆ
An,t is a temporal attention weights and σ
is a activation function.
We also used spectral attentions [8] to assign weights to
different spectral characteristic of the frequency component at
each frame. The attention weights are calculated using the same
structure as the temporal attention, the input of the CNN model
is first passed to the fully connected layer with hidden unit Ns
followed by activation function and then the global pooling is
used on the time axis to obtain attention weights. Then, the
element-wise multiplication is done with the input data to give
important weights to the spectral characteristics. The weighted
features are computed as:
An,t = max
where ˆ
An,t is a spectral attention weights, Wnis weights and
bnis bias for hidden units, Stis a acoustic input feature, σis a
activation function and ˆ
Stis the weighted feature.
2.3.3. Automatic Speech Recognition (ASR) Model
In our proposed model, we use the factorized time-delay deep
neural network (TDNN-F) [11] as our main acoustic modelling
structure that is pre-trained on the Librispeech dataset [12]. Un-
like the conventional TDNN model, TDNN-F applies the low-
rank factorized layers to the TDNN which would dramatically
decrease the loading of training parameters with better perfor-
mances. This pre-train model obtains a word error rate (WER)
of 3.76 % on the test-clean and 8.92% on the test-other of Lib-
rispeech database. Here, we extracted the event detection audio
features from the pre-final layer output of this pre-trained ASR
model as the phoneme-level acoustic representations.
2.3.4. Gradual Network
The proposed gradual network combines the ASR representa-
tions from ASR pre-train model with sound features represen-
tations in the end-to-end training of SED. It contains two parts,
the ASR part which is always static and not updated during the
training process, and the second part contains the CBRNN ar-
chitecture for SED which is trained from the scratch. In this
structure we concatenated the pre-final layer of the ASR rep-
resentation in the event detection training path as shown in the
gradual network part in Figure 1, and the concatenated vector is
fed into the fully connected layer to classify the target events.
This is done to improve the performance of the event detec-
tion by incorporating ASR features that provides complemen-
tary representation to the specific sound event pattern that can
be hard to learn from using CBRNN directly in the task 2 of
DCASE 2017 sound data.
2.4. Post-processing
In order to reduce the influence of the outliers and to improve
the robustness of the binary prediction given for each frame,
post-processing steps are necessary [15]. Here we used dy-
namic thresholding to filter the predicted values to reduce the
problems faced by the model because of unbalance positive and
negative samples. The dynamic threshold is shown as:
Ti=Tbase +βSi(8)
where Tiis dynamic threshold value, Tbaseis the static thresh-
old value, Siis the average value of the classifier output for
each audio and βis ratio value for Si.
After dynamic thresholding, output is post-processed with
a median filter of length 300ms. When performing prediction,
several values could appear before the onset and after the offset
of the event. Since at most one event would appear in a 30sec
log audio, it should show a continuous sequence. With this as-
sumption, we apply a post-processing step to make discontin-
uous sequence into one continuous sequence and selected the
longest continuous sequence of positive predictions to obtain
the onset and the offset of the target events.
2.5. Evaluation Metric
The evaluation of the system is done using event-based metrics
[18]. Here we considered the event-based F-score and error rate.
The f-score is the harmonic mean of the precision(P) and the
recall(R) and the error rate is the total number of insertions I,
deletion D and substitution S related to the number of references
event N. The onset detection is considered accurate only when it
is predicted within the range of 500ms of actual onset time. For
calculating these performance metrics we have used sed eval
toolbox provided by DCASE organizer. The F-score(F) and the
Error-rate(ER) are mathematically defined as
Table 2: The median time duration is shown with the frequently
occurred phonemes for each event.
Event Median Phonemes
babycry 2.33s
SPN S, EH1 S, M E, OW1 E,
HH B, N e, OW1 S, AH1 I
L B, N B
glassbreak 1.29s
SPN S, IH1 I, AH0 I, IY1 E,
L B, L E, M B, N E,
T I, AE1 B
gunshot 1.21s SPN S, AH1 I, D E, EH1 I,
EH1 S, N B, W B, Z E
3. Experiment Results and Analysis
The Adam optimizer with stochastic gradient descent algorithm
is used in all experiments with a learning rate of 0.0001. The
systems are trained by using back-propagation with cross en-
tropy loss function. The network is trained for a max of 100
epochs and a decaying factor 0.01 is set for the learning rate.
For CNN part, four convolution layers with 64 channels are
used with different kernel size of (7,7), (5,5), (3,3), (3,3) with
the stride of (2,1), (1,1), (1,1), (1,1) for each layer respectively.
The three max pooling layers is used after first, third and last
convolution layer with kernel size of (4,2), (4,1), (4,1) with the
stride of (1,1) for each layer and the dropout is set with the
probability of 0.25. We also add the two residual connections
[19] to improve the performance of the CNN. The first residual
connection is done with the output of first max-pooling with the
third convolution layer, and the second residual connection is
done with the second max-pooling with the fourth convolution
layer. The number of hidden unit used here for GRU is 32 with
2 layers, the hidden layer unit Ntfor temporal attention is 200
and for spectral attention the Nsis 32. For spectral attention
and temporal attention, we used SoftMax activation function.
Further, we take the pre-final layer features from the ASR
model to be combined with the SED features during the training
of the CBRNN model for learning. The average performance of
the proposed gradual network outperforms other models aver-
age performances, which achieves 97.0% f-score and 0.05% er-
ror rate where the “babycry” and “gunshot” target events shows
an improvement of 6.7% and 4.6% in f-score and 0.14% and
0.11% decrease in error rate respectively as compared to other
state-of-the-art models. Here, while the ”glassbreak” event de-
tection model does not outperform but it obtained the compa-
rable performance with respect to other state-of-the-art models.
The performance of the state-of-the-art models is shown in the
Table 1 in terms of f-score(F) and error rate(ER). Furthermore,
we have additionally carried out some experiments by creating
a fine-tuning network [20] that fine-tune the ASR features. In
this fine-tune network, the extracted pre-final output layer of
the pre-trained ASR model is fine-tuned to obtain the event pre-
diction for each frame. The fully connected layer is used with
softmax activation function to normalize the prediction prob-
abilities to [0,1]. This model by itself already obtains com-
petitive performance without actually training using the SED
features (88.6%, 83.5%, 79.0% f-score for “babycry”, “glass-
break”, and “gunshot” event respectively). By analyzing these
results, we can observe that the ASR phoneme-level acoustic
features alone may not be enough to completely surpass the dis-
Figure 2: Phonemes frequency present in each sound event with
their probabilities, which shows the important phonemes that
contributes more in the prediction of the target events.
criminatory power SED features, the complementary nature of
SER and ASR features are demonstrated in our detection rates.
We further provided an analysis in understanding which
phonemes contribute more to the model prediction in predict-
ing these acoustic event classes. We fed our target audio data
to ASR model to obtain the most likely phoneme and its align-
ment duration for analysis. From Figure 2, we can say that the
babycry event has high ratio of specific phonemes occurrences
than other events. Table 2 shows the median duration of each
event happened in the train dataset with the frequently occurred
phonemes in that event. It also shows that the duration of the
babycry is longer, and it contains more phonemes predicted than
other events.
Past research [21] suggests that the babycry comprises of
one of the first speech manifestation and represent as the the
sound production by larynx and oral cavity movements, this can
be seen as precursor to the phonemic production. As the baby-
cry signal production is more likely to similar to human speech
production, it may explain the fact that we observe a better
detection performance on event like babycry when combining
with ASR feature representations as babycry it contains more
phonemes-related information as human speech. Events like
gunshot and glassbreak are shorter duration events, the mech-
anism of these sound production are not the same as human
speech, which may lead to results in high occurrence in the pres-
ence of silence(SIL) and unknown(SNP S) phonemes.
4. Conclusions
In this work, we proposed a gradual network architecture that
takes the advantage of the pre-final layer of ASR representations
for SED task. The proposed model is tested on task 2 of DCASE
2017 data. Our experiment results shows that our proposed sys-
tem which is trained over speech representations can provide
useful information in predicting sound events, and ASR-based
phoneme-level acoustic representations is indeed beneficial in
the detection of different rare sound events, especially for those
that would resemble human-like speech. In our future research,
we will investigate which patterns of the speech provides the
needed contribution to improve the system performance, and
what particular sequence of phonemes are contributing in sound
event detection for a variety of different sound classes. Ad-
ditionally, the ASR system can be fine-tuned in parallel with
the event detection by updating the layers of the ASR network;
moreover, we would extend this framework in handling joint
sound event detection and acoustic scene classification task.
5. References
[1] D. Zhang and D. Ellis, “Detecting sound events in basketball
video archive,Dept. Electronic Eng., Columbia Univ., New York,
[2] J. Schroeder, S. Wabnik, P. W. Van Hengel, and S. Goetze, “De-
tection and classification of acoustic events for in-home care,” in
Ambient assisted living. Springer, 2011, pp. 181–195.
[3] H. Phan, L. Hertel, M. Maass, and A. Mertins, “Robust audio
event recognition with 1-max pooling convolutional neural net-
works,” arXiv preprint arXiv:1604.06338, 2016.
[4] G. Parascandolo, H. Huttunen, and T. Virtanen, “Recurrent neu-
ral networks for polyphonic sound event detection in real life
recordings,” in 2016 IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp.
[5] T. Hayashi, S. Watanabe, T. Toda, T. Hori, J. Le Roux, and
K. Takeda, “Bidirectional lstm-hmm hybrid system for poly-
phonic sound event detection,” in Proceedings of the Detection
and Classification of Acoustic Scenes and Events 2016 Workshop
(DCASE2016), 2016, pp. 35–39.
[6] Y. Liu, J. Tang, Y. Song, and L. Dai, “A capsule based approach
for polyphonic sound event detection, in 2018 Asia-Pacific Sig-
nal and Information Processing Association Annual Summit and
Conference (APSIPA ASC). IEEE, 2018, pp. 1853–1857.
[7] E. Cakır and T. Virtanen, “Convolutional recurrent neural net-
works for rare sound event detection,Deep Neural Networks for
Sound Event Detection, vol. 12, 2019.
[8] Y.-H. Shen, K.-X. He, and W.-Q. Zhang, “Learning how to lis-
ten: A temporal-frequential attention model for sound event de-
tection,” arXiv preprint arXiv:1810.11939, 2018.
[9] S. Jung, J. Park, and S. Lee, “Polyphonic sound event detection
using convolutional bidirectional lstm and synthetic data-based
transfer learning,” in ICASSP 2019-2019 IEEE International Con-
ference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 2019, pp. 885–889.
[10] H. Lim, M. J. Kim, and H. Kim, “Cross-acoustic transfer learning
for sound event classification,” in 2016 IEEE International Con-
ference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 2016, pp. 2504–2508.
[11] D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohammadi,
and S. Khudanpur, “Semi-orthogonal low-rank matrix factoriza-
tion for deep neural networks.” in Interspeech, 2018, pp. 3743–
[12] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
rispeech: an asr corpus based on public domain audio books,”
in 2015 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210.
[13] A. Mesaros, T. Heittola, and T. Virtanen, “Tut database for acous-
tic scene classification and sound event detection, in 2016 24th
European Signal Processing Conference (EUSIPCO). IEEE,
2016, pp. 1128–1132.
[14] H. Lim, J. Park, and Y. Han, “Rare sound event detection using 1d
convolutional recurrent neural networks, in Proceedings of the
Detection and Classification of Acoustic Scenes and Events 2017
Workshop, 2017, pp. 80–84.
[15] J. Wang and S. Li, “Multi-frame concatenation for detection of
rare sound events based on deep neural network, in no. Novem-
ber, 2017.
[16] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating
deep network training by reducing internal covariate shift,arXiv
preprint arXiv:1502.03167, 2015.
[17] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov, “Dropout: a simple way to prevent neural
networks from overfitting,The journal of machine learning re-
search, vol. 15, no. 1, pp. 1929–1958, 2014.
[18] A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for polyphonic
sound event detection,Applied Sciences, vol. 6, no. 6, p. 162,
[19] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
for image recognition,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2016, pp. 770–778.
[20] E. Lakomkin, C. Weber, S. Magg, and S. Wermter, “Reusing
neural speech representations for auditory emotion recognition,”
arXiv preprint arXiv:1803.11508, 2018.
[21] O. F. Reyes-Galaviz, S. D. Cano-Ortiz, and C. A. Reyes-
ıa, “Evolutionary-neural system to classify infant cry units
for pathologies identification in recently born babies,” IEEE, pp.
330–335, 2008.
... Sound event detection (SED) is an active field of research because of its vast application in real word. SED has been applied to detect baby-cry sounds in the home environment [1], ringtones, sirens and smoke-alarms in a workspace environment [2,3], gunshots and screams in road surveillance applications [4]. Video and image-based signal processing approaches are also supportive of SED [5]. ...
Wavelet-based audio processing is used for sound event detection. The low-level audio features (timbral or temporal features) are found to be effective to differentiate between different sound events and that is why frequency processing algorithms have become popular in recent times. Wavelet based sound event detection is found effective to detect sudden onsets in audio signals because it offers unique advantages compared to traditional frequency-based sound event detection using machine learning approaches. In this work, wavelet transform is applied to the audio to extract audio features which can predict the occurrence of a sound event using a classical feedforward neural network. Additionally, this work attempts to identify the optimal wavelet parameters to enhance classification performance. 3 window sizes, 6 wavelet families, 4 wavelet levels, 3 decomposition levels and 2 classifier models are used for experimental analysis. The UrbanSound8k data is used and a classification accuracy up to 97% is obtained. Some major observations with regard to parameter-estimation are as follows: wavelet level and wavelet decomposition level should be low; it is desirable to have a large window; however, the window size is limited by the duration of the sound event. A window size greater than the duration of the sound event will decrease classification performance. Most of the wavelet families can classify the sound events; however, using Symlet, Daubechies, Reverse biorthogonal and Biorthogonal families will save computational resources (lesser epochs) because they yield better accuracy compared to Fejér-Korovkin and Coiflets. This work conveys that wavelet-based sound event detection seems promising, and can be extended to detect most of the common sounds and sudden events occurring at various environments.
Conference Paper
Every sound event that we receive and produce everyday carry certain emotional cues. Recently, developing computational methods to recognize induced emotion in movies using content-based modeling is gaining more attention. Most of the existing works treat this as a task of multimodal audio- visual modeling; while these approaches are promising, this type of holistic modeling underestimates the impact of various semantically meaningful events designed in movies. In specifics, acoustic sound semantics such as human sounds in movies can significantly direct the viewer’s attention to emotional content in movies. This work explores the use of cross-modal atten- tion mechanism in modeling how the verbal and non-verbal human sound semantics affect induced valence jointly with conventional audio-visual content-based modeling. Our proposed method integrates both self and cross-modal attention into a feature-based transformer (Fea-TF CSMA) where it obtains a 49.74% accuracy on seven class valence classification on the COGNIMUSE movie dataset. Further analysis reveals insights about the effect of human verbal and non-verbal acoustic sound semantics on induced valence.
Full-text available
Sound event detection is to infer the event by understanding the surrounding environmental sounds. Due to the scarcity of rare sound events, it becomes challenging for the well-trained detectors which have learned too much prior knowledge. Meanwhile, few-shot learning methods promise a good generalization ability when facing a new limited-data task. Recent approaches have achieved promising results in this field. However, these approaches treat each support example independently, ignoring the information of other examples from the whole task. Because of this, most of previous methods are constrained to generate a same feature embedding for all test-time tasks, which is not adaptive to each inputted data. In this work, we propose a novel task-adaptive module which is easy to plant into any metric-based few-shot learning frameworks. The module could identify the task-relevant feature dimension. Incorporating our module improves the performance considerably on two datasets over baseline methods, especially for the transductive propagation network. Such as +6.8% for 5-way 1-shot accuracy on ESC-50, and +5.9% on noiseESC-50. We investigate our approach in the domain-mismatch setting and also achieve better results than previous methods.
Conference Paper
Full-text available
Rare sound event detection is a newly proposed task in IEEE DCASE 2017 to identify the presence of monophonic sound event that is classified as an emergency and to detect the onset time of the event. In this paper, we introduce a rare sound event detection system using combination of 1D convolutional neural network (1D ConvNet) and recurrent neural network (RNN) with long short-term memory units (LSTM). A log-amplitude mel-spectrogram is used as an input acoustic feature and the 1D ConvNet is applied in each time-frequency frame to convert the spectral feature. Then the RNN-LSTM is utilized to incorporate the temporal dependency of the extracted features. The system is evaluated using DCASE 2017 Challenge Task 2 Dataset. Our best result on the test set of the development dataset shows 0.07 and 96.26 of error rate and F-score on the event-based metric, respectively. The proposed system has achieved the 1st place in the challenge with an error rate of 0.13 and an F-Score of 93.1 on the evaluation dataset.
Sound events often occur in unstructured environments where they exhibit wide variations in their frequency content and temporal structure. Convolutional neural networks (CNN) are able to extract higher level features that are invariant to local spectral and temporal variations. Recurrent neural networks (RNNs) are powerful in learning the longer term temporal context in the audio signals. CNNs and RNNs as classifiers have recently shown improved performances over established methods in various sound recognition tasks. We combine these two approaches in a Convolutional Recurrent Neural Network (CRNN) and apply it on a polyphonic sound event detection task. We compare the performance of the proposed CRNN method with CNN, RNN, and other established methods, and observe a considerable improvement for four different datasets consisting of everyday sound events.
Conference Paper
We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting of binaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.