Fig 2 - uploaded by Huy Phan
Content may be subject to copyright.
Source publication
In this work, we propose an approach that features deep feature embedding learning and hierarchical classification with triplet loss function for Acoustic Scene Classification (ASC). In the one hand, a deep convolutional neural network is firstly trained to learn a feature embedding from scene audio signals. Via the trained convolutional neural net...
Contexts in source publication
Context 1
... processing pipeline for deep feature embedding learning using a deep CNN is illustrated in Fig. 2. Each acoustic scene signal is firstly transformed into time-frequency image, such as Gammatone spectrogram with 128 Gammatone filters [11]. The time-frequency image is then decomposed into nonoverlapping image patches of size 128 × 128. Let X and y denote an image patch and its one-hot encoding label, respectively. Mixup data ...
Context 2
... clarity, in Fig. 2 and Table I, we intentionally separate the deep CNN into two parts: the CNN part for feature learning ...
Context 3
... processing pipeline for deep feature embedding learning using a deep CNN is illustrated in Fig. 2. Each acoustic scene signal is firstly transformed into time-frequency image, such as Gammatone spectrogram with 128 Gammatone filters [11]. The time-frequency image is then decomposed into nonoverlapping image patches of size 128 × 128. Let X and y denote an image patch and its one-hot encoding label, respectively. Mixup data ...
Similar publications
Music genre classification is one of the trending topics in regards to the current Music Information Retrieval (MIR) Research. Since, the dependency of genre is not only limited to the audio profile, we also make use of textual content provided as lyrics of the corresponding song. We implemented a CNN based feature extractor for spectrograms in ord...
Unsupervised spoken term discovery consists of two tasks: finding the acoustic segment boundaries and labeling acoustically similar segments with the same labels. We perform segmentation based on the assumption that the frame feature vectors are more similar within a segment than across the segments. Therefore, for strong segmentation performance,...
An important part of the human-computer interaction process is speech emotion recognition (SER), which has been receiving more attention in recent years. However, although a wide diversity of methods has been proposed in SER, these approaches still cannot improve the performance. A key issue in the low performance of the SER system is how to effect...
Road traffic monitoring is very important for intelligent transportation. The detection of traffic state based on acoustic information is a new research direction. A vehicles acoustic event classification algorithm based on sparse autoencoder is proposed to analysis the traffic state. Firstly, the multidimensional Mel-cepstrum features and energy f...