Article

Automatic Leaderboard: Evaluation of Singing Quality Without a Standard Reference

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Automatic evaluation of singing quality can be done with the help of a reference singing or the digital sheet music of the song. However, such a standard reference is not always available. In this article, we propose a framework to rank a large pool of singers according to their singing quality without any standard reference. We define musically motivated absolute measures based on pitch histogram, and relative measures based on inter-singer statistics to evaluate the quality of singing attributes such as intonation, and rhythm. The absolute measures evaluate the goodness of pitch histogram specific to a singer, while the relative measures use the similarity between singers in terms of pitch, rhythm, and timbre as an indicator of singing quality. With the relative measures, we formulate the concept of veracity or truth-finding for the ranking of singing quality. We successfully validate a self-organizing approach to rank-ordering a large pool of singers. The fusion of absolute and relative measures results in an average Spearman's rank correlation of 0.71 with human judgments in a 10-fold cross-validation experiment, which is close to the inter-judge correlation.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Over the years, different ASSE systems have been proposed. Depending on whether a reference melody is taken as the ground truth, these ASSE systems can be classified as reference-dependent [4][5][6][7][8][9] or reference-independent approaches [10][11][12][13][14][15][16][17][18]. Recent research on ASSE has been mainly focused on referenceindependent deep learning-based approaches, where CNNbased architectures are often used to extract useful patterns from input spectrograms [11,14,15,17,18]. ...
... Depending on whether a reference melody is taken as the ground truth, these ASSE systems can be classified as reference-dependent [4][5][6][7][8][9] or reference-independent approaches [10][11][12][13][14][15][16][17][18]. Recent research on ASSE has been mainly focused on referenceindependent deep learning-based approaches, where CNNbased architectures are often used to extract useful patterns from input spectrograms [11,14,15,17,18]. Other features including pitch histograms [11,14,15,17] and singer timbre embeddings [17,18] are also used, and these features are usually fused via concatenation. ...
... Recent research on ASSE has been mainly focused on referenceindependent deep learning-based approaches, where CNNbased architectures are often used to extract useful patterns from input spectrograms [11,14,15,17,18]. Other features including pitch histograms [11,14,15,17] and singer timbre embeddings [17,18] are also used, and these features are usually fused via concatenation. Although this is a simple way of feature fusion, the more advanced techniques that could uncover deeper relationships between these features are still unexplored in ASSE. ...
Conference Paper
Full-text available
Automatic singing skill evaluation (ASSE) systems are predominantly designed for solo singing, and the scenario of singing with accompaniment is largely unaddressed. In this paper, we propose an end-to-end ASSE system that effectively processes both solo singing and singing with accompaniment using data augmentation, where a comparative study is conducted on four different data augmentation approaches. Additionally, we incorporate bi-directional cross-attention (BiCA) for feature fusion which, compared to simple concatenation, can better exploit the interrelationships between different features. Results on the 10KSinging dataset show that data augmentation and BiCA boost performance individually. When combined, they contribute to further improvements, with a Pearson correlation coefficient of 0.769 for solo singing and 0.709 for singing with accompaniment. This represents relative improvements of 36.8% and 26.2% compared to the baseline model score of 0.562, respectively.
... Automatic singing skill evaluation (ASSE) can be a great alternative, and its application to rating Karaoke performances dates back to the 1990s [2]- [4]. In recent years, different machine learning approaches have been proposed, which treat ASSE as a classification task [5]- [7], a ranking task [8], or a regression task [9]- [11], resulting in a categorical rating (e.g., good or bad), a ranking (e.g., leaderboard), or a numerical rating (e.g., 87 out of 100), respectively. In this paper, we consider ASSE a regression task and propose a model that generates numerical ratings to reflect the overall singing skill. ...
... The baseline model [11] used the DAMP SingEval dataset [8], which contains 400 solo singing clips (four songs, 100 performances per song) collected from Smule. The ratings are publicly available 5 but the audios are not, due to the possible proprietary restrictions from Smule. ...
... The workflow of creating 10K Singing is shown in Fig. 1. We first began with two singer leaderboard videos for Chinese male singers 8 and Chinese female singers 9 from Bilibili. 10 We chose these two particular videos because: ...
Conference Paper
Full-text available
Most automatic singing skill evaluation (ASSE) models focus only on solo singing, resulting in a limited application scope since singing is usually mixed with instrumental accompaniment in music. In this paper, we propose a more general ASSE model which applies to both solo singing and singing with accompaniment. For this purpose, we employ an existing singing voice separation tool for accompaniment removal and compare ASSE models trained with and without accompani- ment. Results show that accompaniment removal achieves better performances. Furthermore, we explore different features and model architectures, concluding that the additions of timbral features, attention mechanism, and dense layer further improve the performance. Finally, we show that our proposed model achieves a Pearson correlation coefficient of 0.562, a 62.4% relative improvement compared to 0.346 for the baseline model.
... In the field of music information retrieval, publicly available singing vocal datasets are scarce for singing quality evaluation and the existing public datasets often have some inherent defects such as poor audio quality, biased distribution of songs and incorrect labels [14]. There has been some effort in building singing vocals datasets for the purpose of singing quality assessment [4], [9], however the number of audio recordings is small, and manual annotations are only available for overall singing quality. Most of the public datasets provided are dependent data collection method and audio labeling by human annotators [15], [16] from their laboratory or company, which is time-consuming and expensive. ...
... Such datasets are not balanced for the purpose of singing quality assessment because amateur singing qualities are underrepresented. A subset of the DAMP dataset [9], consisting of 400 singing renditions, was assessed by human annotators for their overall singing quality through a crowd-sourcing platform. However, for training an explainable neural network model for singing quality assessment, apart from overall singing quality score, annotations for the individual perceptual parameters such as intonation accuracy, rhythm consistency etc. are also needed. ...
... One of the drawbacks of existing singing quality evaluation datasets [9] is that they only have an overall assessment score and do not have detailed manual annotation about perceptual parameters such as intonation accuracy, rhythm consistency etc. Therefore, supervised training for such detailed parameters is not possible with the existing datasets. ...
Conference Paper
Full-text available
Data-driven methods for automatic singing quality assessment have so far focused on obtaining an overall singing assessment score of a given singing rendition. However, the explainability of such a score in terms of musically relevant components of singing quality such as intonation accuracy and rhythm correctness has not been attempted due to the lack of annotated training data. In this work, we propose to augment a singing vocals dataset, containing only professional singing renditions, with negative samples for improving the diversity in singing quality examples in the training data. We validate this augmented dataset through listening tests. Moreover, we use this data to formulate a multi-task learning framework that can simultaneously provide pitch accuracy feedback along with an overall singing quality score for a given singing rendition. We show that our methods outperform existing systems for both unseen songs and singers singing English and Mandarin popular songs.
... Automatic singing skill (or quality) evaluation methods seek to provide quantitative measurement of the quality of a singing rendition on the basis of all or a subset of the perceptual criteria that humans use, to provide meaningful feedback to the singers. There have been broadly two approaches for automatic singing skill evaluation [39]- [41]: reference-dependent and reference-independent. In the following, we will introduce both approaches, but since deep learning approaches have not yet been popular for reference-dependent singing skill evaluation, we will focus more on reference-independent evaluation based on deep learning. ...
... With the immense amount of online uploads on singing platforms, Gupta et al. [39] leveraged the comparative statistics between singers as well as music theory to derive a leaderboard of singers, where the singers are rank-ordered according to their singing quality relative to each other. They designed inter-singer relative measures based on the hypothesis that given a song that has a particular sequence of notes and a rhythm, it can be sung correctly in one or a few consistent ways, but incorrectly in many different, and dissimilar ways. ...
... Although some methods [22], [59] have attempted to provide an explainable score using data augmentation techniques, much needs to be done to build standard training and test datasets with human labels that go beyond an overall assessment score. As reported in [39], the longer the singing input, the better the prediction accuracy. So, these methods are not yet suitable for real-time feedback. ...
Article
Full-text available
Singing, the vocal productionof musical tones, is one of the most important elements of music. Addressing the needs of real-world applications, the study of technologies related to singing voices has become an increasingly active area of research. In this paper, we provide a comprehensive overview of the recent developments in the field of singing information processing, specifically in the topics of singing skill evaluation, singing voice synthesis, singing voice separation, and lyrics synchronization and transcription. We will especially focus on deep learning approaches including modern representation learning techniques for singing voices. We will also provide an overview of contributions in public datasets for singing voice research.
... To find the pitch histogram we follow the same approach explained in [70]. We first transform the F0 values from Hz to Cents. ...
... It can be observed from Figure 6(a) that the pitch histogram of the singing voice showcases prominent peaks, indicating that pitch values in some ranges are more frequently produced by the singer. This further points to the fact that singers deliver the dominant notes of the song frequently and consistently [70]. On the other hand, the pitch histogram of speech in Figure 6(b) is relatively flat, showing the absence of specific musical notes. ...
... To further observe the behavior of pitch histogram across speech and singing over the entire NHSS database, we calculate the kurtosis and skew of the pitch histograms [70] for speech and singing corresponding to each song for all singers. We plot the average kurtosis and skew values over all the songs for each speaker in Figure 7. Kurtosis and skew can be used as statistical measures to approximate the sharpness of the pitch histogram, which exhibit higher values for good quality singing [70]. ...
Article
Full-text available
We present a database of parallel recordings of speech and singing, collected and released by the Human Language Technology (HLT) laboratory at the National University of Singapore (NUS), that is called NUS-HLT Speak-Sing (NHSS) database. We release this database¹ to the public to support research activities, that include, but not limited to comparative studies of acoustic attributes of speech and singing signals, cooperative synthesis of speech and singing voices, and speech-to-singing conversion. This database consists of recordings of sung vocals of English pop songs, the spoken counterpart of lyrics of the songs read by the singers in their natural reading manner, and manually prepared utterance-level and word-level annotations. The audio recordings in the NHSS database correspond to 100 songs sung and spoken by 10 singers, resulting in a total of 7 h of audio data. There are 5 male and 5 female singers, singing and reading the lyrics of 10 songs each. In this paper, we discuss the design methodology of the database, analyse the similarities and dissimilarities in characteristics of speech and singing voices, and provide some strategies to address relationships between these characteristics for converting one to another. We develop benchmark systems, which can be used as reference for speech-to-singing alignment, spectral mapping, and conversion using the NHSS database.
... To demonstrate the presence of well expressed musical notes in singing, as exhibited by pitch contours, we plot pitch histograms for singing and speech corresponding to one song rendered by a female singer. These pitch histograms are plotted after transforming the F0 values from Hz to cents, where one semitone is 100 cents on an equitempered octave [55]. We have made the count of pitch values in this histogram folded onto 12 semitones in an octave as described in [55]. ...
... These pitch histograms are plotted after transforming the F0 values from Hz to cents, where one semitone is 100 cents on an equitempered octave [55]. We have made the count of pitch values in this histogram folded onto 12 semitones in an octave as described in [55]. It can be observed from Figure 6(a) that the pitch histogram of singing voice showcases prominent peaks, indicating that pitch values in some ranges are more frequently produced by the singer. ...
... It can be observed from Figure 6(a) that the pitch histogram of singing voice showcases prominent peaks, indicating that pitch values in some ranges are more frequently produced by the singer. This further points to the fact that singers deliver the dominant notes of the song frequently and consistently [55]. On the other hand, the pitch histogram of speech in Figure 6(b) is rela-tively flat, showing the absence of specific musical notes. ...
Preprint
Full-text available
We present a database of parallel recordings of speech and singing, collected and released by the Human Language Technology (HLT) laboratory at the National University of Singapore (NUS), that is called NUS-HLT Speak-Sing (NHSS) database. We release this database to the public to support research activities, that include, but not limited to comparative studies of acoustic attributes of speech and singing signals, cooperative synthesis of speech and singing voices, and speech-to-singing conversion. This database consists of recordings of sung vocals of English pop songs, the spoken counterpart of lyrics of the songs read by the singers in their natural reading manner, and manually prepared utterance-level and word-level annotations. The audio recordings in the NHSS database correspond to 100 songs sung and spoken by 10 singers, resulting in a total of 7 hours of audio data. There are 5 male and 5 female singers, singing and reading the lyrics of 10 songs each. In this paper, we discuss the design methodology of the database, analyze the similarities and dissimilarities in characteristics of speech and singing voices, and provide some strategies to address relationships between these characteristics for converting one to another. We develop benchmark systems for speech-to-singing alignment, spectral mapping and conversion using the NHSS database.
... Gupta et al. [15], [16] designed features that characterize the shape of the pitch histogram and inter-singer distances to evaluate singing quality without a reference. The reference independent methods of singing quality evaluation have mostly involved intonation characterization. ...
... This approach for rhythm representation was derived purely from audio signal analysis, where the periodic structure provided by the background music in the song helped in characterizing the rhythm. Previously, the pitch histogram of a singing voice has been shown to provide a comprehensive representation for intonation [15], [16]. But the pitch histogram loses all information about timing, and hence rhythm, and therefore it is not suitable for rhythm analysis. ...
... In addition to the implementation of SingDistVis, we also prepared a simple baseline implementation that displays only the F0 trajectory of a user-selected singer as a polyline chart. In this user study, we employed the singing dataset from four songs labeled for singing quality for 100 singers per song [10] ...
Article
Full-text available
This paper describes SingDistVis, an information visualization technique for fundamental frequency (F0) trajectories of large-scale singing data where numerous singers sing the same song. SingDistVis allows to explore F0 trajectories interactively by combining two views: OverallView and DetailedView. OverallView visualizes a distribution of the F0 trajectories of the song in a time-frequency heatmap. When a user specifies an interesting part, DetailedView zooms in on the specified part and visualizes singing assessment (rating) results. Here, it displays high-rated singings in red and low-rated singings in blue. When the user clicks on a particular singing, the audio source is played and its F0 trajectory through the song is displayed in OverallView. We selected heatmap-based visualization for OverallView to provide an overview of a large-scale F0 dataset, and polyline-based visualization for DetailedView to provide a more precise representation of a small number of particular F0 trajectories. This paper introduces a subjective experiment using 1,000 singing voices to determine suitable visualization parameters. Then, this paper presents user evaluations where we asked participants to compare visualization results of four types of Overview+Detail designs and concluded that the presented design archived better evaluations than other designs in all the seven questions. Finally, this paper describes a user experiment in which eight participants compare SingDistVis with a baseline implementation in exploring interested singing voices and concludes that the proposed SingDistVis archived better evaluations in nine of the questions.
... effective in supporting beginners' training [6]. Due to the ubiquity of artificial intelligence (AI), machine learning (ML) methods have also been applied in the automatic assessment of singing quality [7]. Similarly, research on the automatic recognition of specific singing techniques has recently gained popularity [8,9]. ...
Conference Paper
Full-text available
The usefulness of computer-based tools in supporting singing pedagogy has been demonstrated. With the increasing use of artificial intelligence (AI) in education, machine learning (ML) has been applied in music-pedagogy related tasks too, e. g., singing technique recognition. Research has also shown that comparing ML performance with human perception can elucidate the usabil-ity of AI in real-life scenarios. Nevertheless, this assessment is still missing for singing technique recognition. Thus, we comparatively evaluate classification and perceptual results from the identification of singing techniques. Since computer-assisted singing often relays on visual feedback, both an auditory task (recogni-tion from a capella singing), and a visual one (recognition from spectrograms) were performed. Responses by 60 humans were compared with ML outcomes. By guaranteeing comparable setups, our results indicate that ML can capture differences in human auditory and visual perception. This opens new horizons in the application of AI-supported learning.
... Gururani et al. [20] extend this work by applying linear regression and correlation analysis to quantify the importance of each descriptor in assessing musicality, note accuracy, rhythm accuracy, and tone quality. Gupta et al. [21] use unsupervised methods to assess a large number of singers singing the same song by clustering. There are also some recent works on assessing song intelligibility [22]. ...
Preprint
Full-text available
p>The growing popularity of computer-assisted pedagogy and audio analysis methods based on machine learning stimulates research into developing tools for music pedagogy. Despite the proliferation of commercial tools for teaching music, a lack of systematic research on computer-assisted music pedagogy exists. This paper describes pedagogical aids for music education. Novel methods are proposed for detecting singing mistakes by comparing teacher's and learner's audio. These methods consist of a CNN-based model for comparing pitch and amplitude contours and a CRNN-based model for comparing spectrograms. A new evaluation method is proposed to compare the efficacy of mistake detection systems. Experiments indicate that the proposed learning-based methods are superior to the rule-based baseline. A systematic study of errors and a cross-teacher study reveal insights into music pedagogy that can be utilized when deploying the proposed tool. In addition, a new dataset of teacher and learner audio recordings, annotated for singing mistakes, is presented. </p
... Gururani et al. [20] extend this work by applying linear regression and correlation analysis to quantify the importance of each descriptor in assessing musicality, note accuracy, rhythm accuracy, and tone quality. Gupta et al. [21] use unsupervised methods to assess a large number of singers singing the same song by clustering. There are also some recent works on assessing song intelligibility [22]. ...
Preprint
Full-text available
p>The growing popularity of computer-assisted pedagogy and audio analysis methods based on machine learning stimulates research into developing tools for music pedagogy. Despite the proliferation of commercial tools for teaching music, a lack of systematic research on computer-assisted music pedagogy exists. This paper describes pedagogical aids for music education. Novel methods are proposed for detecting singing mistakes by comparing teacher's and learner's audio. These methods consist of a CNN-based model for comparing pitch and amplitude contours and a CRNN-based model for comparing spectrograms. A new evaluation method is proposed to compare the efficacy of mistake detection systems. Experiments indicate that the proposed learning-based methods are superior to the rule-based baseline. A systematic study of errors and a cross-teacher study reveal insights into music pedagogy that can be utilized when deploying the proposed tool. In addition, a new dataset of teacher and learner audio recordings, annotated for singing mistakes, is presented. </p
... For human experts, studies have shown that even if the melody is new to them, they can make a highly consistent evaluation [1], indicating the feasibility of the reference-independent approaches. The traditional reference-independent singing evaluation mainly focused on intonation characteristics [13,14]. Since Zhang et al. [15] proposed a convolutional neural network (CNN) based model trained on two-class data, the data-driven methods have been introduced into this field. ...
Preprint
Automatic singing evaluation independent of reference melody is a challenging task due to its subjective and multi-dimensional nature. As an essential attribute of singing voices, vocal timbre has a non-negligible effect and influence on human perception of singing quality. However, no research has been done to include timbre information explicitly in singing evaluation models. In this paper, a data-driven model TG-Critic is proposed to introduce timbre embeddings as one of the model inputs to guide the evaluation of singing quality. The trunk structure of TG-Critic is designed as a multi-scale network to summarize the contextual information from constant-Q transform features in a high-resolution way. Furthermore, an automatic annotation method is designed to construct a large three-class singing evaluation dataset with low human-effort. The experimental results show that the proposed model outperforms the existing state-of-the-art models in most cases.
Article
Voice health is traditionally assessed by methods that rely on the perception of a clinician, who integrates auditory and visual cues in order to reach a conclusion about the voice under evaluation. However, these tasks suffer from inter-professional variability due to its subjective nature, which is why more objective, computational-based methods are of interest. Two examples of such subjective tasks are the classification of voices in three types according to their periodicity, also termed voice typing, and the evaluation of six aspects of voice quality by means of the consensus auditory-perceptual evaluation of voice (CAPE-V) protocol. In this paper, two approaches to emulate each of those tasks are introduced, based on simple features extracted from scattering transform coefficients and support vector machines. Firstly, a system for automatic voice typing was trained and its classification performance was evaluated for intra and inter-dataset trials using two widely known corpora. Accuracies above 80%, comparable to the state-of-the-art, were found for all the experiments conducted. Secondly, a multidimensional, multioutput regression chain model was used to automatically grade the voice quality features of the CAPE-V protocol, obtaining errors and correlation coefficients that are comparable to those found for three human raters.
Article
Sight-singing exercises are a fundamental part of music education. In this paper, we present an objective and complete automatic evaluation system for sight-singing, which has two critical stages: note transcription and note alignment. In the first stage, we use an onset detector based on the convolutional recurrent neural network (CRNN) for note segmentation and the pitch extractor described in (Kim et al. 2018) for note labeling. In the second stage, an alignment algorithm based on relative pitch modeling is proposed. Due to the lack of datasets for sight-singing note alignment and the overall system evaluation, we construct the sight-singing vocal dataset (SSVD). Each module of the system and the entire system are tested on this dataset. The onset detector achieves an F-measure of 90.61%, and the stages of note transcription and note alignment achieve an F-measure of 88.42% and 94.79%, respectively. In addition, we propose an objective criterion for the sight-singing evaluation system. Based on this criterion, our automatic sight-singing system achieves an F-measure of 77.95% on the SSVD dataset.
Article
Full-text available
Human experts evaluate singing quality based on many perceptual parameters such as intonation, rhythm, and vibrato, with reference to music theory. We proposed previously the Perceptual Evaluation of Singing Quality (PESnQ) framework that incorporated acoustic features related to these perceptual parameters in combination with the cognitive modeling concept of the telecommunication standard Perceptual Evaluation of Speech Quality to evaluate singing quality. In this study, we present further the study of the PESnQ framework to approximate the human judgments. First, we find that a linear combination of the individual perceptual parameter human scores can predict their overall singing quality judgment. This provides us with a human parametric judgment equation. Next, the prediction of the individual perceptual parameter scores from the PESnQ acoustic features show a high correlation with the respective human scores, which means more meaningful feedback to learners. Finally, we compare the performance of early fusion and late fusion of the acoustic features in predicting the overall human scores. We find that the late fusion method is superior to that of the early fusion method. This work underlines the importance of modeling human perception in automatic singing quality assessment.
Conference Paper
Full-text available
A perceptually valid automatic singing evaluation score could serve as a complement to singing lessons, and make singing training more reachable to the masses. In this study, we adopt the idea behind PESQ (Perceptual Evaluation of Speech Quality) scoring metrics, and propose various perceptually relevant features to evaluate singing quality. We correlate the obtained singing quality score, which we term as Perceptual Evaluation of Singing Quality (PESnQ) score, with that given by music-expert human judges, and compare the results with the known baseline systems. It is shown that the proposed PESnQ has a correlation of 0.59 with human ratings, which is an improvement of �~96% over baseline systems.
Article
Full-text available
We present an end-to-end system for musical key estimation, based on a convolutional neural network. The proposed system not only out-performs existing key estimation methods proposed in the academic literature; it is also capable of learning a unified model for diverse musical genres that performs comparably to existing systems specialised for specific genres. Our experiments confirm that different genres do differ in their interpretation of tonality, and thus a system tuned e.g. for pop music performs subpar on pieces of electronic music. They also reveal that such cross-genre setups evoke specific types of error (predicting the relative or parallel minor). However, using the data-driven approach proposed in this paper, we can train models that deal with multiple musical styles adequately, and without major losses in accuracy.
Conference Paper
Full-text available
The web is a huge source of valuable information. However, in recent times, there is an increasing trend towards false claims in social media, other web-sources, and even in news. Thus, factchecking websites have become increasingly popular to identify such misinformation based on manual analysis. Recent research proposed methods to assess the credibility of claims automatically. However, there are major limitations: most works assume claims to be in a structured form, and a few deal with textual claims but require that sources of evidence or counter-evidence are easily retrieved from the web. None of these works can cope with newly emerging claims, and no prior method can give user-interpretable explanations for its verdict on the claim's credibility. This paper overcomes these limitations by automatically assessing the credibility of emerging claims, with sparse presence in web-sources, and generating suitable explanations from judiciously selected sources. To this end, we retrieve diverse articles about the claim, and model the mutual interaction between: the stance (i.e., support or refute) of the sources, the language style of the articles, the reliability of the sources, and the claim's temporal footprint on the web. Extensive experiments demonstrate the viability of our method and its superiority over prior works. We show that our methods work well for early detection of emerging claims, as well as for claims with limited presence on the web and social media.
Article
Full-text available
Singing is an omnipresent feature of musical cultures and occurs in many different forms and guises. Because of its variety and developmental nature, any assessment of singing needs to take careful account of the particular cultural context in which the observed behaviour occurs. The literature on singing indicates that the nature of the sample, the location, the social context, the type of singing task and the assessment method are all likely to be significant variables. An overview of singing assessment is presented in which the application of recent advances in speech science technology are set along- side more traditional assessment methods, with examples of assessment evidence from the project Singing Development in Early Childhood at the Roehampton Institute.
Conference Paper
Full-text available
The problem of pitch tracking has been extensively studied in the speech research community. The goal of this paper is to investigate how these techniques should be adapted to singing voice analysis, and to provide a comparative evaluation of the most representative state-of-the-art approaches. This study is carried out on a large database of annotated singing sounds with aligned EGG recordings, comprising a variety of singer categories and singing exercises. The algorithmic performance is assessed according to the ability to detect voicing boundaries and to accurately estimate pitch contour. First, we evaluate the usefulness of adapting existing methods to singing voice analysis. Then we compare the accuracy of several pitch-extraction algorithms, depending on singer category and laryngeal mechanism. Finally, we analyze their robustness to reverberation.
Conference Paper
Full-text available
Online video presents a great opportunity for up-and-coming singers and artists to be visible to a worldwide audience. However, the sheer quantity of video makes it difficult to discover promising musicians. We present a novel algorithm to automatically identify talented musicians using machine learning and acoustic analysis on a large set of "home singing" videos. We describe how candidate musician videos are identified and ranked by singing quality. To this end, we present new audio features specifically designed to directly capture singing quality. We evaluate these vis-a-vis a large set of generic audio features and demonstrate that the proposed features have good predictive performance. We also show that this algorithm performs well when videos are normalized for production quality.
Article
Full-text available
A new microcomputer-based system is described which has been developed for the assessment and development of singing ability. It makes use of a specially developed hardware interface which estimates the fundamental frequency of a sung or spoken input as the basis for (a) a measurement of vocal pitching accuracy (assessment), and (b) a vocal pitch display to provide visual feedback (development). Results are discussed which are based on a study carried out with the system in a British primary school, and they indicate that this system is effective in promoting singing development.
Article
Full-text available
A new control paradigm of source signals for high quality speech synthesis is introduced to handle a variety of speech quality, based on time-frequency analyses by the use of an instantaneous frequency and group delay. The proposed signal representation consists of a frequency domain aperiodicity measure and a time domain energy concentration measure to represent source attributes, which supplement the conventional source information, such as F 0 and power. The frequency domain aperiodicity measure is defined as a ratio between the lower and upper smoothed spectral envelopes to represent the relative energy distribution of aperiodic components. The time domain measure is defined as an effective duration of the aperiodic component. These aperiodicity parameters and F 0 as time functions are used to generate the source signal for synthetic speech by controlling relative noise levels and the temporal envelope of the noise component of the mixed mode excitation signal, including fine timing and amplitude fluctuations. A series of preliminary simulation experiments was conducted to test and to demonstrate consistency of the proposed method. Examples sung in different voice qualities were also analyzed and resynthesized using the proposed method.
Conference Paper
Full-text available
The world-wide web has become the most important infor- mation source for most of us. Unfortunately, there is no guarantee for the correctness of information on the web. Moreover, different web sites often provide conflicting in- formation on a subject, such as different specifications for the same product. In this paper we propose a new problem called Veracity, i.e., conformity to truth, which studies how to find true facts from a large amount of conflicting informa- tion on many subjects that is provided by various web sites. We design a general framework for the Veracity problem, and invent an algorithm called TruthFinder, which uti- lizes the relationships between web sites and their informa- tion, i.e., a web site is trustworthy if it provides many pieces of true information, and a piece of information is likely to be true if it is provided by many trustworthy web sites. Our ex- periments show that TruthFinder successfully finds true facts among conflicting information, and identifies trustwor- thy web sites better than the popular search engines.
Conference Paper
Full-text available
This paper describes a system that compares user renditions of short sung clips with the original version of those clips. The of both recordings was estimated and then Viterbi-aligned with each other. The total difference in pitch after alignment was used as a distance metric and transformed into a rating out of ten, to indicate to the user how close he or she was to the original singer. An existing corpus of sung speech was used for initial design and optimisation of the system. We then collected further devel- opment and evaluation corpora ó these recordings were judged for closeness to an original recording by two human judges. The rankings assigned by those judges were used to design and opti- mise the system. The design was then implemented and deployed as part of a telephone-based entertainment application. Index Terms: automated singing evaluation, pitch tracking, enter- tainment applications
Article
Full-text available
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.
Article
Full-text available
Although considerable progress has been made in the development of acoustic and physiological measures of operatic singing voice, there is still no widely accepted objective tool for the evaluation of its multidimensional features. Auditory-perceptual evaluation, therefore, remains an important evaluation method for singing pedagogues, voice scientists, and clinicians who work with opera singers. Few investigators, however, have attempted to develop standard auditory-perceptual tools for evaluation of the operatic voice. This study aimed to pilot test a new auditory-perceptual rating instrument for operatic singing voice. Nine expert teachers of operatic singing used the instrument to rate the singing voices of 21 professional opera chorus artists from a national opera company. The findings showed that the instrument has good face validity, that it can be legitimately treated as a psychometrically sound scale, and that raters can use the scale consistently, both between and within judges. This new instrument, therefore, has the potential to allow opera singers, their teachers, voice care clinicians, and researchers to evaluate the important auditory-perceptual features of operatic voice quality.
Article
Full-text available
Several models have been described in the literature which seek to represent audio stimuli in the perceptual domain to best predict the audibility of errors and distortions. By modelling the principal nonlinear processes of human hearing it is possible to calculate a perceptual domain error surface that represents the audible difference between distorted and original audio signals. A further stage of analysis is required to maximise the usefulness of the auditory model output. The audible error surface must be interpreted to produce an estimate of the overall subjective judgement which would result from the particular distortion. Ideally, the interpretation of the error surface should be broadly analogous to human perceptual mechanisms, and equally, it would be desirable to avoid the complex and cumbersome statistical mapping and clustering techniques proposed by some authors. A technique employed in adaptive transform coding of images, namely cell entropy, offered several desired properties. The paper reports the extension and application of such a technique to the interpretation of perceptual-domain error surfaces produced by an auditory model. Speech data were subjected to an example, algorithmically generated, nonlinear distortion and then processed by the auditory model. The usefulness of the error-activity and error-entropy quantities are illustrated, without optimisation, by comparison of model predictions and experimentally determined opinion scores
Article
Full-text available
The World Wide Web has become the most important information source for most of us. Unfortunately, there is no guarantee for the correctness of information on the Web. Moreover, different websites often provide conflicting information on a subject, such as different specifications for the same product. In this paper, we propose a new problem, called Veracity, i.e., conformity to truth, which studies how to find true facts from a large amount of conflicting information on many subjects that is provided by various websites. We design a general framework for the Veracity problem and invent an algorithm, called TRUTHFlNDER, which utilizes the relationships between websites and their information, i.e., a website is trustworthy if it provides many pieces of true information, and a piece of information is likely to be true if it is provided by many trustworthy websites. An iterative method is used to infer the trustworthiness of websites and the correctness of information from each other. Our experiments show that TRUTHFlNDER successfully finds true facts among conflicting information and identifies trustworthy websites better than the popular search engines.
Article
Full-text available
In order to represent musical content, pitch and timing information is utilized in the majority of existing work in Symbolic Music Information Retrieval (MIR). Symbolic representations such as MIDI allow the easy calculation of such information and its manipulation. In contrast, most of the existing work in Audio MIR uses timbral and beat information, which can be calculated using automatic computer audition techniques. In this paper, Pitch Histograms are defined and proposed as a way to represent the pitch content of music signals both in symbolic and audio form. This representation is evaluated in the context of automatic musical genre classification. A multiple-pitch detection algorithm for polyphonic signals is used to calculate Pitch Histograms for audio signals. In order to evaluate the extent and significance of errors resulting from the automatic multiple-pitch detection, automatic musical genre classification results from symbolic and audio data are compared. The comparison indicates that Pitch Histograms provide valuable information for musical genre classification. The results obtained for both symbolic and audio cases indicate that although pitch errors degrade classification performance for the audio case, Pitch Histograms can be effectively used for classification in both cases.
Article
In the Big Data era, truth discovery has served as a promising technique to solve conflicts in the facts provided by numerous data sources. The most significant challenge for this task is to estimate source reliability and select the answers supported by high quality sources. However, existing works assume that one data source has the same reliability on any kinds of entity, ignoring the possibility that a source may vary in reliability on different domains. To capture the influence of various levels of expertise in different domains, we integrate domain expertise knowledge to achieve a more precise estimation of source reliability. We propose to infer the domain expertise of a data source based on its data richness in different domains. We also study the mutual influence between domains, which will affect the inference of domain expertise. Through leveraging the unique features of the multi-truth problem that sources may provide partially correct values of a data item, we assign more reasonable confidence scores to value sets. We propose an integrated Bayesian approach to incorporate the domain expertise of data sources and confidence scores of value sets, aiming to find multiple possible truths without any supervision. Experimental results on two real-world datasets demonstrate the feasibility, efficiency and effectiveness of our approach.
Article
Best worst scaling (BWS) can be a method of data collection, and/or a theory of how respondents provide top and bottom ranked items from a list. The three ‘cases’ of BWS are described, followed by a summary of the main models and related theoretical results, including an exposition of possible theoretical relationships between estimates from two of the cases. This is followed by the theoretical and empirical properties of ‘best minus worst scores.’ The entry ends with some directions for future research.
Article
Intonation is an important concept in Carnatic music that is characteristic of a raaga, and intrinsic to the musical expression of a performer. In this paper we approach the description of intonation from a computational perspective, obtaining a compact representation of the pitch track of a recording. First, we extract pitch contours from automatically selected voice segments. Then, we obtain a a pitch histogram of its full pitch-range, normalized by the tonic frequency, from which each prominent peak is automatically labelled and parametrized. We validate such parame-trization by considering an explorative classification task: three raagas are disambiguated using the characterization of a single peak (a task that would seriously challenge a more naïve parametrization). Results show consistent improvements for this particular task. Furthermore, we perform a qualitative assessment on a larger collection of raa-gas, showing the discriminative power of the entire representation. The proposed generic parametrization of the intonation histogram should be useful for musically relevant tasks such as performer and instrument characterization.
Conference Paper
This paper proposes an automatic singing evaluation system. The system provides the user with the score that was assessed by acoustic features and the rhythmic similarity between the original song and the user input. The assessment system is divided into two stages. In the first stage, acoustic similarities are measured by dynamic time warping (DTW). In the second stage, the rhythmic similarity is measured by analysing the optimal path of DTW by quadratic polynomial regression. Finally, the similarity of two stages is combined into one score by corresponding weights. The experimental results show the good performance for the automatic singing evaluation system.
Conference Paper
We present recent developments in the openSMILE feature extraction toolkit. Version 2.0 now unites feature extraction paradigms from speech, music, and general sound events with basic video features for multi-modal processing. Descriptors from audio and video can be processed jointly in a single framework allowing for time synchronization of parameters, on-line incremental processing as well as off-line and batch processing, and the extraction of statistical functionals (feature summaries), such as moments, peaks, regression parameters, etc. Postprocessing of the features includes statistical classifiers such as support vector machine models or file export for popular toolkits such as Weka or HTK. Available low-level descriptors include popular speech, music and video features including Mel-frequency and similar cepstral and spectral coefficients, Chroma, CENS, auditory model based loudness, voice quality, local binary pattern, color, and optical flow histograms. Besides, voice activity detection, pitch tracking and face detection are supported. openSMILE is implemented in C++, using standard open source libraries for on-line audio and video input. It is fast, runs on Unix and Windows platforms, and has a modular, component based architecture which makes extensions via plug-ins easy. openSMILE 2.0 is distributed under a research license and can be downloaded from http://opensmile.sourceforge.net/.
Conference Paper
This paper presents a generic approach for automatic singing assessment for basic singing levels. The system provides the user with a set of intonation, rhythm and overall ratings obtained by measuring the similarity of the sung melody and a target performance. Two different similarity approaches are discussed: f0 curve alignment through Dynamic Time Warping (DTW), and singing transcription plus note-level similarity. From these two approaches, we extract different intonation and rhythm similarity measures which are combined through quadratic polynomial regression analysis in order to fit the judgement of 4 trained musicians on 27 performances. The results show that the proposed system is suitable for automatic singing voice rating and that DTW based measures are specially simple and effective for intonation and rhythm assessment.
Article
We review and discuss recent developments in best–worst scaling (BWS) that allow researchers to measure items or objects on measurement scales with known properties. We note that BWS has some distinct advantages compared with other measurement approaches, such as category rating scales or paired comparisons. We demonstrate how to use BWS to measure subjective quantities in two different empirical examples. One of these measures preferences for weekend getaways and requires comparing relatively few objects; a second measures academics' perceptions of the quality of academic marketing journals and requires comparing a significantly large set of objects. We conclude by discussing some limitations and future research opportunities related to BWS.
Article
This study aims to develop an automatic singing evaluation system for Karaoke performances. Many Karaoke systems in the market today come with a scoring function. The addition of the feature enhances the entertainment appeal of the system due to the competitive nature of humans. The automatic Karaoke scoring mechanism to date, however, is still rudimentary, often giving inconsistent results with scoring by human raters. A cause of blunder arises from the fact that often only the singing volume is used as the evaluation criteria. To improve on the singing evaluation capabilities on Karaoke machines, this study exploits various acoustic features, including pitch, volume, and rhythm to assess a singing performance. We invited a number of singers having different levels of singing capabilities to record for Karaoke solo vocal samples. The performances were rated independently by four musicians, and then used in conjunction with additional Karaoke Video Compact Disk music for the training of our proposed system. Our experiment shows that the results of automatic singing evaluation are close to the human rating, where the Pearson product-moment correlation coefficient between them is 0.82.
Article
It has been conjectured that the development of vocal pitch accuracs Mn siiging is partly determined by the quality of feedback available to an error labelling schema (Welch. 1985 [a] and lb]). As a practical applicationl of this theoretical construct to the classroom, a micro-computer-based nvsteni (Howard and Welch, 1987) has been developed to provide real-time visual feedback of vocal pitch production to the user. This system has recentiv undergone initial trials with a class of seven-year-old children in a Bristol school. The sample (n =32) was divided into three matched groups: (a) experimental interactive, (b) experimental non-interactive, and (c) control, based on an initial system-driven assessment of their socal pitch accuracy Groups (a) and (b) used the system and the control group (c) uindertook singing activities of a more traditional nature. After one schoiol term. socal pitch accuracy was re-assessed. The two experimental groups recorded a signiticant improvement in their vocal pitch matching ability cmpared to the control group. The results are seen as supportive of the previouLslv stated theoretical position and confirm that action can encouralrac the underlsin.it developmental process (Welch, 1986).
Article
This paper presents the results of two experiments on sing-ing skill evaluation, where human subjects judge the sub-jective quality of previously unheard melodies. The aim of this study is to explore the criteria that human subjects use in judging singing skill and the stability of their judgments, as a basis for developing an automatic singing skill evalua-tion scheme. The experiments use the rank ordering method, where the subjects ordered a group of given stimuli according to their preferred rankings. Experiment 1 uses real, a capella sing-ing as the stimuli, while experiment 2 uses the fundamental frequency (F0) sequence extracted from the singing. In experiment 1, 88.9% of the correlation between the sub-jects' evaluations was significant at the 5% level. Results of experiment 2 show that the F0 sequence is significant in only certain cases, so that the judgment and its stability in experiment 1 should be attributed to other factors of real singing.
Conference Paper
This paper describes a study of subjective criteria for untrained singerspsila singing voice quality evaluation, focusing on the perceptual aspects that have relatively strong acoustic implications. And the correlation among the individual perceptual criteria is also investigated. A SVM regression method is applied to find the importance of every evaluation criterion. Experiments on a 200 singing clips dataset are conducted to give promising results. And probable acoustic cues are introduced for a future prospect.
Conference Paper
This paper presents a method of evaluating singing skills that does not require score information of the sung melody. This requires an approach that is different from existing systems, such as those currently used for Karaoke systems. Previous research on singing evaluation has focused on analyzing the characteristics of singing voice, but were not aimed at developing an automatic evaluation method. The approach presented in this study uses pitch interval accuracy and vibrato as acoustic features which are independent from specific characteristics of the singer or melody. The approach was tested by a 2-class (good/poor) classification test with 600 song sequences, and achieved an average classification rate of 83.5%.
Conference Paper
Recently the 'Million Song Dataset', containing audio features and metadata for one million songs, was made available. In this paper, we build a convolutional network that is then trained to perform artist recognition, genre recognition and key detection. The network is tailored to summarize the audio features over musically significant timescales. It is infeasible to train the network on all available data in a supervised fashion, so we use unsupervised pretraining to be able to harness the entire dataset: we train a convolutional deep belief network on all data, and then use the learnt parameters to initialize a convolutional multilayer perceptron with the same architecture. The MLP is then trained on a labeled subset of the data for each task. We also train the same MLP with randomly initialized weights. We find that our convolutional approach improves accuracy for the genre recognition and artist recognition tasks. Unsupervised pretraining improves convergence speed in all cases. For artist recognition it improves accuracy as well.
Article
Four real-time visual feedback computer tools for singing lessons (SINGAD ,A LBERT ,S ING & SEE, and WinSINGAD), and the research carried out to evaluate the usefulness of these systems are reviewed in this article. We report on the development of user-functions and the usability of these computer-assisted learning tools. Both quantitative and qualitative studies confirm the efficiency of real-time visual feedback in improving singing abilities. Having addressed these findings, we suggest further quantitative investigations of (1) the detailed effect of visual feedback on performance accuracy and on the learning process, and (2) the interactions between improvement of musical performance and the type of visual feedback and the amount of information it presents, the skill level of the user and the teacher's role.
Article
Experts were interviewed to identify criteria for evaluation of vocal performance. A scale was then constructed and inter- and intrajudge reliability assessed. Experts listened to 19 different performances, plus 6 presented a second time. Interjudge reliability for one judge was modest, but increased dramatically as the size of the judge panel increased. The most reliable items were overall score and intonation accuracy. Diction was less reliable than other items. Intrajudge reliability was higher for overall score than for any other item. A factor analysis on the test items yielded factors labelled intrinsic quality, execution, and diction. Another factor analysis, using the experts as variables, revealed two underlying evaluative dimensions. It was found that 13 experts were primarily influenced by execution, and that 8 were mainly affected by intrinsic quality. Interjudge and intrajudge reliabilities of these two groups differed.
Article
An algorithm is presented for the estimation of the fundamental frequency (F0) of speech or musical sounds. It is based on the well-known autocorrelation method with a number of modifications that combine to prevent errors. The algorithm has several desirable features. Error rates are about three times lower than the best competing methods, as evaluated over a database of speech recorded together with a laryngograph signal. There is no upper limit on the frequency search range, so the algorithm is suited for high-pitched voices and music. The algorithm is relatively simple and may be implemented efficiently and with low latency, and it involves few parameters that must be tuned. It is based on a signal model (periodic signal) that may be extended in several ways to handle various forms of aperiodicity that occur in particular applications. Finally, interesting parallels may be drawn with models of auditory processing.
Article
Musical abilities are generally regarded as an evolutionary by-product of more important functions, such as those involved in language. However, there is increasing evidence that humans are born with musical predispositions that develop spontaneously into sophisticated knowledge bases and procedures that are unique to music. Recent findings also suggest that the brain is equipped with music-specific neural networks and that these can be selectively compromised by a congenital anomaly. This results in a disorder, congenital amusia, that appears to be limited to the processing of music. Recent evidence points to fine-grained perception of pitch as the root of musical handicap. Hence, musical abilities appear to depend crucially on the fine-tuning of pitch, in much the same way that language abilities rely on fine time resolution.
Conference Paper
Previous objective speech quality assessment models, such as bark spectral distortion (BSD), the perceptual speech quality measure (PSQM), and measuring normalizing blocks (MNB), have been found to be suitable for assessing only a limited range of distortions. A new model has therefore been developed for use across a wider range of network conditions, including analogue connections, codecs, packet loss and variable delay. Known as perceptual evaluation of speech quality (PESQ), it is the result of integration of the perceptual analysis measurement system (PAMS) and PSQM99, an enhanced version of PSQM. PESQ is expected to become a new ITU-T recommendation P.862, replacing P.861 which specified PSQM and MNB
Article
One of the most time honored methods of detecting pitch is to use some type of autocorrelation analysis on speech which has been appropriately preprocessed. The goal of the speech preprocessing in most systems is to whiten, or spectrally flatten, the signal so as to eliminate the effects of the vocal tract spectrum on the detailed shape of the resulting autocorrelation function. The purpose of this paper is to present some results on several types of (nonlinear) preprocessing which can be used to effectively spectrally flatten the speech signal The types of nonlinearities which are considered are classified by a non-linear input-output quantizer characteristic. By appropriate adjustment of the quantizer threshold levels, both the ordinary (linear) autocorrelation analysis, and the center clipping-peak clipping autocorrelation of Dubnowski et al. [1] can be obtained. Results are presented to demonstrate the degree of spectrum flattening obtained using these methods. Each of the proposed methods was tested on several of the utterances used in a recent pitch detector comparison study by Rabiner et al. [2] Results of this comparison are included in this paper. One final topic which is discussed in this paper is an algorithm for adaptively choosing a frame size for an autocorrelation pitch analysis.
A comparative study of pitch extraction algorithms on a large variety of singing sounds
  • O Babacan
  • T Drugman
  • N Alessandro
  • N Henrich
  • T Dutoit
Function PeakDet, MATLAB (converted to python)
  • E Billauer
An automatic singing voice evaluation method for voice training
  • P Prasertvithyakarn
Method and apparatus for karaoke scoring
  • P.-C Chang
Sparse representation of phonetic features for voice conversion with and without parallel data
  • B Çişman
  • H Li
  • K C Tan
Karaoke scoring apparatus analyzing singing voice relative to melody data
  • T Tanaka
Semi-supervised lyrics and solo-singing alignment
  • C Gupta
  • R Tong
  • H Li
  • Y Wang
Semi-supervised lyrics and solo-singing alignment
  • gupta