Fig 5 - available from: EURASIP Journal on Audio Speech and Music Processing
This content is subject to copyright. Terms and conditions apply.
Convolutional layers with a a fixed-size kernel and b a Mel-scale kernel

Convolutional layers with a a fixed-size kernel and b a Mel-scale kernel

Source publication
Article
Full-text available
We propose a new method for music detection from broadcasting contents using the convolutional neural networks with a Mel-scale kernel. In this detection task, music segments should be annotated from the broadcast data, where music, speech, and noise are mixed. The convolutional neural network is composed of a convolutional layer with kernel that i...

Contexts in source publication

Context 1
... also implemented a two-dimensional kernel composed of a two-dimensional convolution layer to learn more about optimized filters. Figure 5 shows the basic convolution layer with (a) a fixed kernel and (b) a convolution layer with the Mel-scale kernel. Figure 5 a shows the temporal and frequency dimensions of the kernel fixed at all times. ...
Context 2
... 5 shows the basic convolution layer with (a) a fixed kernel and (b) a convolution layer with the Mel-scale kernel. Figure 5 a shows the temporal and frequency dimensions of the kernel fixed at all times. However, Fig. 5 b shows the frequency dimension of the kernel is large in the high-frequency region and small in the low-frequency region. ...
Context 3
... also implemented a two-dimensional kernel composed of a two-dimensional convolution layer to learn more about optimized filters. Figure 5 shows the basic convolution layer with (a) a fixed kernel and (b) a convolution layer with the Mel-scale kernel. Figure 5 a shows the temporal and frequency dimensions of the kernel fixed at all times. However, Fig. 5 b shows the frequency dimension of the kernel is large in the high-frequency region and small in the low-frequency region. We initialize the kernel weight of the melCL to the weight of the Mel-scale filter bank, to obtain the stability of the learning process and improved performance. The melCL has a temporal dimension of kernel of 5 ...

Similar publications

Chapter
Full-text available
PCANet is a simple deep learning baseline for image classification, which learns the filters banks by PCA instead of stochastic gradient descent (SGD) in each layer. It shows a good performance for image classification tasks with only a few parameters and no backpropagation procedure. However, PCANet suffers from two main problems. The first proble...

Citations

... In [24][25][26], the main objective was the application of Mel-spectrograms and a Mel-scale kernel in detection of such elements as speech, singer's voice or music in the noisy audio recordings. Based on these studies, it could be stated that Mel-spectrograms might be employed as a successful approach in feature extractions and might serve well as input in the classification models based on convolutional neural networks. ...
... Additionally, the architecture of each model has been individually adjusted by adding several Flatten and Dense layers. Then each model was trained and 50 epochs were chosen based on similar studies [20,23,24]. However, EarlyStopping callback was implemented to prevent overfitting. ...
... Mel-scaled spectrograms [17,29,33] are produced by mapping the frequencies of a standard spectrogram to the Mel scale which involves computing the spectrogram of an audio signal and then applying a set of Mel-scale filters to the spectral data. These filters are overlapping triangular filters and they capture the energy of the signal within specific Mel-scale frequency bands. ...
Preprint
Full-text available
Convolutional neural networks (CNNs) are widely used in computer vision. They can be used not only for conventional digital image material to recognize patterns, but also for feature extraction from digital imagery representing spectral and rhythm features extracted from time-domain digital audio signals for the acoustic classification of sounds. Different spectral and rhythm feature representations like mel-scaled spectrograms, mel-frequency cepstral coefficients (MFCCs), cyclic tempograms, short-time Fourier transform (STFT) chromagrams, constant-Q transform (CQT) chromagrams and chroma energy normalized statistics (CENS) chromagrams are investigated in terms of the audio classification performance using a deep convolutional neural network. It can be clearly shown that the mel-scaled spectrograms and the mel-frequency cepstral coefficients (MFCCs) perform significantly better then the other spectral and rhythm features investigated in this research for audio classification tasks using deep CNNs. The experiments were carried out with the aid of the ESC-50 dataset with 2,000 labeled environmental audio recordings.
... For the SVM model to receive meaningful input from raw audio data, effective feature extraction is necessary. The extracted features, which are primarily MFCCs, are used to train the SVM model for classification (Rajadnya & Joshi, 2021;Jang et al., 2019). The RBF kernel is employed for multiclass classification. ...
Article
Full-text available
Music is the universal language that profoundly unites cultures and generations all over the world. The ability to recognize notes in music is essential in many fields, including composing, teaching music, and digital audio processing. It is essential for both music education and composition to be able to accurately identify and classify musical notes and chords. Using Mel-frequency Cepstral Coefficients (MFCC) and Support Vector Machine (SVM), we present a novel Music Note Recognition System in this study which identifies the notes present in the input audio file. To meet the needs of musicians, non-musicians as well as audio enthusiasts, the system attempts to accurately identify musical notes from audio samples. The experiments show that the recommended method is effective. Comparison analysis shows the advantages of SVM classification and MFCC-based feature extraction over existing techniques, which result in better performance. Beginners can use this system to learn and practice musical instruments, while musicians can use it to compose, transcribe, and analyze musical pieces. Furthermore, the system's durability and adaptability make it a good fit for integration with recent advancements in audio processing and music production software.
... Currently, the segmentation is mainly performed with neural networks and supervised learning. While each task has been generally solved independently as a binary frame-wise classification task (SAD [12], OSD [13,14], MD [15,16]), more recent approaches propose to solve multiple tasks simultaneously. The multiclass model predicts a single class and class intersection is empty. ...
Conference Paper
Full-text available
Audio signal segmentation is a key task for automatic audio indexing. It consists of detecting the boundaries of class-homogeneous segments in the signal. In many applications, explainable AI is a vital process for transparency of decision-making with machine learning. In this paper, we propose an explainable multilabel segmentation model that solves speech activity (SAD), music (MD), noise (ND), and overlapped speech detection (OSD) simultaneously. This proxy uses the non-negative matrix factorization (NMF) to map the embedding used for the segmentation to the frequency domain. Experiments conducted on two datasets show similar performances as the pre-trained black box model while showing strong explainability features. Specifically, the frequency bins used for the decision can be easily identified at both the segment level (local explanations) and global level (class prototypes).
... As we can see from Fig. 13, the non-linear scales emphasize frequencies in the range of 0 Hz to 4,000 Hz, with higher emphasis given to frequencies in the range of 0 to 2,000 Hz. This range includes most of the important frequency components of human speech [54]. As a result, the network was effectively given an adequate amount of detailed information about the lower or higher range of the frequency spectrum depending on how various frequency scales were applied. ...
Article
Full-text available
The present study focuses on the evaluation of the degradation of emotional expression in speech generated by a wireless telephone network. Two assessment approaches emerged: an objective one, deploying convolutional neural networks (CNNs) fed with spectrograms across three scales (Linear, Logarithmic, Mel), and a subjective method grounded in human perception. The study gathered expressive phrases in two different languages: from novice Arabic and proficient German speakers. These utterances underwent transmission on a real 4G network a rarity, as usual focus lies on bandwidth (BW) reduction or compression. Our innovation lies in utilizing the complete 4G infrastructure, accounting for all possible impairments. The results obtained indeed reveal a significant impact of transmission via the real 4G network on emotion recognition. Prior to transmission, the highest recognition rates, measured by the objective method using the Mel frequency scale, were 76% for Arabic and 91% for German. After transmission, these rates significantly decreased, reaching 70% for Arabic and 82% for German (a degradation of 6% and 9%), respectively. As for the subjective method, the recognition rates were 75% for Arabic and 70% for German before transmission and dropped to 67% for Arabic and 68% for German after transmission (a degradation of 8% and 2%). Our results were also compared to those found in the literature that used the same database.
... These studies have introduced methods that enhance the discrimination process by suggesting a new machine-learning algorithm, rather than proposing a novel feature extraction method. Among these studies, [18,26,32] can be mentioned in which convolutional neural networks (CNN) were applied to music detection from broadcast contents and speech/music discrimination, respectively. Also, in [17], the method proposed for the classification stage is based on the recurrent neural network (RNN) and has higher efficiency and lower error. ...
Article
Full-text available
Multimedia data have increased dramatically today, making the distinction between desirable information and other types of information extremely important. Speech/music discrimination is a field of audio analytics that aims to detect and classify speech and music segments in an audio file. This paper proposes a novel feature extraction method called Long-Term Multi-band Frequency-Domain Mean-Crossing Rate (FDMCR). The proposed feature computes the average frequency-domain mean-crossing rate along the frequency axis for each of the perceptual Mel-scaled frequency bands of the signal power spectrum. In this paper, the class-separation capability of this feature is first measured by well-known divergence criteria such as Maximum Fisher Discriminant Ratio (MFDR), Bhattacharyya divergence, and Jeffreys/Symmetric Kullback–Leibler (SKL) divergence. The proposed feature is then applied to the speech/music discrimination (SMD) process on two well-known speech-music datasets—GTZAN and S &S (Scheirer and Slaney). The results obtained on the two datasets using conventional classifiers, including k-NN, GMM, and SVM, as well as deep learning-based classification methods, including CNN, LSTM, and BiLSTM, show that the proposed feature outperforms other features in speech/music discrimination.
... A fully convolutional architecture is presented in [49] for simultaneous speech and music segmentation. A CNN model featuring kernels based on the Mel scale is used for the music detection task in [50]. Similarly, the work described in [51] separates speech and music in audio streams using a CNN trained in a semi-supervised way. ...
Thesis
Full-text available
Advances in technology over the last decade have reshaped the way population interact with multimedia content. This fact aroused a significant rise both in generation and consumption of these data in recent years. Manual analysis and annotation of this information is unfeasible given the current volume, revealing the necessity of automatic tools that can help advance from manual working pipelines to assisted or partially automatic practices. Over the last few years, most of these tools for multimedia information retrieval are based on the deep learning paradigm. In this context, work presented in this thesis focuses on the audio information retrieval domain. In particular, this dissertation studies the audio segmentation task, whose main goal is to provide a sequence of labels that isolates different regions in an input audio signal according to the characteristics described in a predefined set of classes, e.g., speech, music or noise. This study has mainly focused on two important topics: data availability and generalisation. For the first one, part of the work presented in this thesis has investigated ways to improve the performance of audio segmentation systems even when the training datasets are limited in size. Concerning generalisation, some experiments performed aimed to train robust audio segmentation models that can work in different domain conditions. Research efforts presented in this thesis dissertation have been centred around three main areas: speech activity detection in challenging environments, multiclass audio segmentation, and AUC optimisation for audio segmentation:
... For instance, the Mel-scale [38,39] would be a great alternative in order to get the perceptual scale of pitches of the events studied. A further application can be found in [40] and [41] ...
Preprint
Full-text available
We present SpectroMap, an open source GitHub repository for audio fingerprinting written in Python programming language. It is composed of a peak search algorithm that extracts topological prominences from a spectrogram via time-frequency bands. In this paper, we introduce the algorithm functioning with two experimental applications in a high-quality urban sound dataset and environmental audio recordings to describe how it works and how effective it is in handling the input data.
... The MIREX challenge has four tasks: Music Detection, Speech Detection, Music & Speech Detection, and Music Relative Loudness Estimation [19]. A CNN model with Mel-scale kernel was leveraged for music detection in broadcasting audio [20]. The Mel-scale kernel in CNN layers helped in learning robust features where the Mel-scale dictates the kernel size. ...
... This model was trained on 52 hours of mixed broadcast data containing approximately 50% music; 24 hours of broadcast audio with a music ratio 50-76% was used as a test set. This test set included representative genres for broadcast audio in English, Spanish, and Korean languages [20]. ...
... Similarly, Doukhan et al. [18] presented an open-source speech and music segmentation system based on the log Mel spectrogram and a CNN architecture. Jang et al. used a trainable Mel-kernel to extract the features for SMAD [35]. Multiple works explore the impact of different model architectures. ...
Article
Full-text available
Automatic speech and music activity detection (SMAD) is an enabling task that can help segment, index, and pre-process audio content in radio broadcast and TV programs. However, due to copyright concerns and the cost of manual annotation, the limited availability of diverse and sizeable datasets hinders the progress of state-of-the-art (SOTA) data-driven approaches. We address this challenge by presenting a large-scale dataset containing Mel spectrogram, VGGish, and MFCCs features extracted from around 1600 h of professionally produced audio tracks and their corresponding noisy labels indicating the approximate location of speech and music segments. The labels are several sources such as subtitles and cuesheet. A test set curated by human annotators is also included as a subset for evaluation. To validate the generalizability of the proposed dataset, we conduct several experiments comparing various model architectures and their variants under different conditions. The results suggest that our proposed dataset is able to serve as a reliable training resource and leads to SOTA performances on various public datasets. To the best of our knowledge, this dataset is the first large-scale, open-sourced dataset that contains features extracted from professionally produced audio tracks and their corresponding frame-level speech and music annotations.