Figure 4 - uploaded by Anandaswarup Vadapalli
Content may be subject to copyright.
Source publication
Similar publications
Pertunjukan Simuntu sudah mentradisi di masyarakat Nagari Koto Kaciak dari dahulu sampai sekarang yang didukung oleh masyarakat setempat. Sampai saat sekarang pertunjukan Simuntu baik dari segi bentuk pertunjukan dan fungsinya tidak mengalami perubahan yang selalu berpatokan kepada warisan dari nenek moyang terdahulu. Pertunjukan Simuntu tanpa meru...
Citations
... For example, in automatic speech recognition (ASR), performance is unrecognisable when compared with what was possible ten years ago. In the CHiME, REVERB, Blizzard, and Hurricane challenges, researchers have made rapid progress by building on open source baseline software that is improved in each round [4,5,6,7]. Advances can also be attributed to the availability of speech corpora recorded in various environments. Recent developments in machine learning applied to noise reduction and speech enhancement indicate that this is a promising approach for hearing aid speech signal processing. ...
In recent years, rapid advances in speech technology have been made possible by machine learning challenges such as CHiME, REVERB, Blizzard, and Hurricane. In the Clarity project, the machine learning approach is applied to the problem of hearing aid processing of speech-in-noise, where current technology in enhancing the speech signal for the hearing aid wearer is often ineffective. The scenario is a (simulated) cuboid-shaped living room in which there is a single listener, a single target speaker and a single interferer, which is either a competing talker or domestic noise. All sources are static, the target is always within ±30 degrees azimuth of the listener and at the same elevation, and the interferer is an omnidirectional point source at the same elevation. The target speech comes from an open source 40-speaker British English speech database collected for this purpose. This paper provides a baseline description of the round one Clarity challenges for both enhancement (CEC1) and prediction (CPC1). To the authors' knowledge, these are the first machine learning challenges to consider the problem of hearing aid speech signal processing.
... We used all available results of the Blizzard Challenge 3 [22,23,24,25,26,27,28,29,30,31], which are currently all years except for 2017 and 2018 (see Table 2 ...
In this paper, we present a new objective prediction model for synthetic speech naturalness. It can be used to evaluate Text-To-Speech or Voice Conversion systems and works language independently. The model is trained end-to-end and based on a CNN-LSTM network that previously showed to give good results for speech quality estimation. We trained and tested the model on 16 different datasets, such as from the Blizzard Challenge and the Voice Conversion Challenge. Further, we show that the reliability of deep learning-based naturalness prediction can be improved by transfer learning from speech quality prediction models that are trained on objective POLQA scores. The proposed model is made publicly available and can, for example, be used to evaluate different TTS system configurations.
... We used all available results of the Blizzard Challenge 3 [22,23,24,25,26,27,28,29,30,31], which are currently all years except for 2017 and 2018 (see Table 2 ...
... Out-domain sentences are used for evaluation. They include test sentences from Blizzard Challenges 2013-15 [22,23,24] and sentences taken from the web. 10 native listeners of the language are used in each test. Evaluators are in the age group 20-35 and have no known hearing defects. ...
... Monolingual systems are built using 5 hours of data, while the mixed TTS is trained using 2.5 hours each of its constituent languages. The code-mixable ability of each system is evaluated on a set of code-mixed Hindi+English text [23,24] using a DMOS test. 10 listeners, who are proficient in both languages, evaluated 7 code-mixed synthesised speech generated by each system. ...
... Increasingly, many technologies such as Web search and natural language processing are adapting to this phenomenon [3,4,5]. In the area of Speech Synthesis, although the efforts of the 2013, 2014 and 2015 Blizzard Challenges 2 [6,7] resulted in improvements to the naturalness of speech synthesis of Indian languages, the text was assumed to be written in native script. In this work, we transliterate Blizzard data to informal chat-style ASCII text using Mechanical Turkers, and synthesize speech from the resulting transliterated ASCII text. ...
... No intelligibility evaluation was conducted since transcription word error rate (WER) has been found to be a poor metric for Indian languages, cf. [6]. However, we believe listeners do take into account intelligibility while rating the stimuli, even though they were asked to rate the naturalness. ...
... Increasingly, many technologies such as Web search and natural language processing are adapting to this phenomenon [3,4,5]. In the area of Speech Synthesis, although the efforts of the 2013, 2014 and 2015 Blizzard Challenges 2 [6,7] resulted in improvements to the naturalness of speech synthesis of Indian languages, the text was assumed to be written in native script. In this work, we transliterate Blizzard data to informal chat-style ASCII text using Mechanical Turkers, and synthesize speech from the resulting transliterated ASCII text. ...
... No intelligibility evaluation was conducted since transcription word error rate (WER) has been found to be a poor metric for Indian languages, cf. [6]. However, we believe listeners do take into account intelligibility while rating the stimuli, even though they were asked to rate the naturalness. ...
... All the data, and samples used in the listening tests are available online at: http://srikanthr.in/indic-speech-synthesis. 6 http://srikanthr.in/indic-search ...
Text-to-Speech synthesis in Indian languages has a seen lot of progress over the decade partly due to the annual Blizzard challenges. These systems assume the text to be written in Devanagari or Dravidian scripts which are nearly phonemic orthography scripts. However, the most common form of computer interaction among Indians is ASCII written transliterated text. Such text is generally noisy with many variations in spelling for the same word. In this paper we evaluate three approaches to synthesize speech from such noisy ASCII text: a naive Uni-Grapheme approach, a Multi-Grapheme approach, and a supervised Grapheme-to-Phoneme (G2P) approach. These methods first convert the ASCII text to a phonetic script, and then learn a Deep Neural Network to synthesize speech from that. We train and test our models on Blizzard Challenge datasets that were transliterated to ASCII using crowdsourcing. Our experiments on Hindi, Tamil and Telugu demonstrate that our models generate speech of competetive quality from ASCII text compared to the speech synthesized from the native scripts. All the accompanying transliterated datasets are released for public access.
... In the literature, most of the propositions are perceptually evaluated. For instance, for the Blizzard challenge, a large scale evaluation campaign is used [1,2], but each time the number of utterances under test is restricted. The same is true in the majority of the evaluations done. ...
... In the literature, most of the propositions are perceptually evaluated. For instance, for the Blizzard challenge, a large scale evaluation campaign is used [1,2], but each time the number of utterances under test is restricted. The same is true in the majority of the evaluations done. ...
Subjective evaluation is a crucial problem in the speech processing community and especially for the speech synthesis field, no matter what system is used. Indeed, when trying to assess the effectiveness of a proposed method, researchers usually conduct subjective evaluations by randomly choosing a small set of samples, from the same domain, taken from a baseline system and the proposed one. When selecting them randomly, statistically, samples with almost no differences are evaluated and the global measure is smoothed which may lead to judge the improvement not significant.
To solve this methodological flaw, we propose to compare speech synthesis systems on thousands of generated samples from various domains and to focus subjective evaluations on the most relevant ones by computing a normalized alignment cost between sample pairs. This process has been successfully applied both in the HTS statistical framework and in the corpus-based approach. We have conducted two perceptive experiments by generating more than 27,000 samples for each system under comparison. A comparison between tests involving most different samples and randomly chosen samples shows clearly that the proposed approach reveals significant differences between the systems.
Statistical parametric speech synthesis conventionally utilizes decision tree clustered context-dependent hidden Markov models (HMMs) to model speech parameters. But decision trees are unable to capture complex context dependencies and fail to model the interaction between linguistic features. Recently deep neural networks (DNNs) have been applied in speech synthesis and they can address some of these limitations. This paper focuses on the prediction of phone durations in Text-to-Speech (TTS) systems using feedforward DNNs in case of short sentences (sentences containing one, two or three syllables only). To achieve better prediction accuracy hyperparameter optimization was carried out with manual grid search. Recordings from a male and a female speaker were used to train the systems, and the output of various configurations were compared against conventional HMM-based solutions and natural speech. Experimental results of objective evaluations show that DNNs can outperform previous state-of-the-art solutions in duration modeling.