Figure - available via license: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
Content may be subject to copyright.
Source publication
In singing, the perceptual term “voice quality” is used to describe expressed emotions and singing styles. In voice physiology research, specific voice qualities are discussed by the term phonation modes and are related directly to the voicing produced by the vocal folds. The control and awareness of phonation modes is vital for professional singer...
Context in source publication
Context 1
... most striking aspect in Table 4 is the similar performance of the 40-dimensional M P S peaks feature set and the 8-dimensional M P S sum and Ceps peaks feature sets. In order to gain more information on the misclassifications, confusion matrices are given in Table 5 and Table 6 for one proposed feature set (M P S sum ) and one reference feature set (VQ-features). The confusion matrices clearly show that there exists greater confusion between pressed and modal, and between breathy and modal phonation modes in the VQ feature set, compared to the proposed M P S sum feature set. ...Citations
... Large-scale audio datasets like VGGSound and FSD50K provide valuable resources for audio classification studies 18,20 , but they lack sufficient data to represent rare vocal types in the field of Bel canto. Existing professional vocal datasets are also inconsistent in labeling important points such as pitch, timbre, and technique, making them inefficient to use for quantitative analysis 21 . ...
Vocal education in the music field is difficult to quantify due to the individual differences in singers' voices and the different quantitative criteria of singing techniques. Deep learning has great potential to be applied in music education due to its efficiency to handle complex data and perform quantitative analysis. However, accurate evaluations with limited samples over rare vocal types, such as Mezzo-soprano, requires extensive well-annotated data support using deep learning models. In order to attain the objective, we perform transfer learning by employing deep learning models pre-trained on the ImageNet and Urbansound8k datasets for the improvement on the precision of vocal technique evaluation. Furthermore, we tackle the problem of the lack of samples by constructing a dedicated dataset, the Mezzo-soprano Vocal Set (MVS), for vocal technique assessment. Our experimental results indicate that transfer learning increases the overall accuracy (OAcc) of all models by an average of 8.3%, with the highest accuracy at 94.2%. We not only provide a novel approach to evaluating Mezzo-soprano vocal techniques but also introduce a new quantitative assessment method for music education.
... This success is based on the combination of a non-linear feature extraction network (often a CNN), followed by an attention-based transformer network. 40 Similarly, AI and ML techniques have been attempted to use in singing voice recognition for detecting singing samples in audio segments, 41 classifying phonation mode and detecting singing voice especially in western operatic voice 31,32,[42][43][44] in what can be considered technology-assisted singing teaching, where technological feedback techniques aid singers in achieving vocal targets with improved accuracy and speed. [45][46][47] Previous studies focusing on using ML for classifying phonation types in classical singing voice samples have shown somewhat moderate accuracy, with one initial study reporting an overall accuracy of 64% (range 50-70%) using support vector machine algorithms (SVMs) 48 when distinguishing Breathy, Neutral, Flow, and Pressed phonation classes as defined in. ...
... In this study, while the tested models and their performances demonstrated between good to very good discriminatory power, this level of performance was only the case for some of the tested categorisations and the models were still limited compared to the performance reported by the human auditory-perceptual assessments. Nevertheless, the general gap between AI models and human auditory-perceptual performance is increasingly closing, if taking a historical perspective from the initial study by Proutskova and colleagues 48 showing overall 64% accuracy in phonation type categorisation to more recent work by Kadiri and colleagues 44,49,119 showing upward of 85% accuracy. Comparatively, this study's best-performing model on the simplest problem related to RQ1 showed an accuracy of 88.7% accuracy while the human listeners achieved balanced 96.1% accuracy on the comparable auditory-perceptual assessment. ...
... Moreover, cepstral coefficients derived from the spectra, computed by single frequency filtering and zero-time windowing, were studied in [32] in voice quality classification from speech and singing. In [33], modulation power spectral features were studied in classification of voice quality in classical singing. Recently, voice quality detection (i.e., the detection of different voice qualities in each singing file, along with their onset and offset times for each detected voice quality) was studied in [34]. ...
Prior studies in the automatic classification of voice quality have mainly studied the use of the acoustic speech signal as input. Recently, a few studies have been carried out by jointly using both speech and neck surface accelerometer (NSA) signals as inputs, and by extracting MFCCs and glottal source features. This study examines simultaneously-recorded speech and NSA signals in the classification of voice quality (breathy, modal, and pressed) using features derived from three self-supervised pre-trained models (wav2vec2-BASE, wav2vec2-LARGE, and HuBERT) and using a SVM as well as CNNs as classifiers. Furthermore, the effectiveness of the pre-trained models is compared in feature extraction between glottal source waveforms and raw signal waveforms for both speech and NSA inputs. Using two signal processing methods (quasi-closed phase (QCP) glottal inverse filtering and zero frequency filtering (ZFF)), glottal source waveforms are estimated from both speech and NSA signals. The study has three main goals: (1) to study whether features derived from pre-trained models improve classification accuracy compared to conventional features (spectrogram, mel-spectrogram, MFCCs, i-vector, and x-vector), (2) to investigate which of the two modalities (speech vs. NSA) is more effective in the classification task with pre-trained model-based features, and (3) to evaluate whether the deep learning-based CNN classifier can enhance the classification accuracy in comparison to the SVM classifier. The results revealed that the use of the NSA input showed better classification performance compared to the speech signal. Between the features, the pre-trained model-based features showed better classification accuracies, both for speech and NSA inputs compared to the conventional features. It was also found that the HuBERT features performed better than the wav2vec2-BASE and wav2vec2-LARGE features.
With the rise of mobile devices, bel canto practitioners increasingly utilize smart devices as auxiliary tools for improving their singing skills. However, they frequently encounter timbre abnormalities during practice, which, if left unaddressed, can potentially harm their vocal organs. Existing singing assessment systems primarily focus on pitch and melody and lack real-time detection of bel canto timbre abnormalities. Moreover, the diverse vocal habits and timbre compositions among individuals present significant challenges in cross-user recognition of such abnormalities. To address these limitations, we propose TimbreSense, a novel bel canto timbre abnormality detection system. TimbreSense enables real-time detection of the five major timbre abnormalities commonly observed in bel canto singing. We introduce an effective feature extraction pipeline that captures the acoustic characteristics of bel canto singing. By applying temporal average pooling to the Short-Time Fourier Transform (STFT) spectrogram, we reduce redundancy while preserving essential frequency-domain information. Our system leverages a transformer model with self-attention mechanisms to extract correlation and semantic features of overtones in the frequency domain. Additionally, we employ a few-shot learning approach involving pre-training, meta-learning, and fine-tuning to enhance the system’s cross-domain recognition performance while minimizing user usage costs. Experimental results demonstrate the system’s strong cross-user domain recognition performance and real-time capabilities.