September 2022
·
25 Reads
·
10 Citations
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
September 2022
·
25 Reads
·
10 Citations
August 2022
·
227 Reads
Transfer tasks in text-to-speech (TTS) synthesis - where one or more aspects of the speech of one set of speakers is transferred to another set of speakers that do not feature these aspects originally - remains a challenging task. One of the challenges is that models that have high-quality transfer capabilities can have issues in stability, making them impractical for user-facing critical tasks. This paper demonstrates that transfer can be obtained by training a robust TTS system on data generated by a less robust TTS system designed for a high-quality transfer task; in particular, a CHiVE-BERT monolingual TTS system is trained on the output of a Tacotron model designed for accent transfer. While some quality loss is inevitable with this approach, experimental results show that the models trained on synthetic data this way can produce high quality audio displaying accent transfer, while preserving speaker characteristics such as speaking style.
August 2021
·
144 Reads
·
86 Citations
August 2021
·
34 Reads
·
54 Citations
June 2021
·
54 Reads
·
101 Citations
March 2021
·
75 Reads
This paper introduces PnG BERT, a new encoder model for neural TTS. This model is augmented from the original BERT model, by taking both phoneme and grapheme representations of text as input, as well as the word-level alignment between them. It can be pre-trained on a large text corpus in a self-supervised manner, and fine-tuned in a TTS task. Experimental results show that a neural TTS model using a pre-trained PnG BERT as its encoder yields more natural prosody and more accurate pronunciation than a baseline model using only phoneme input with no pre-training. Subjective side-by-side preference evaluations show that raters have no statistically significant preference between the speech synthesized using a PnG BERT and ground truth recordings from professional speakers.
March 2021
·
86 Reads
·
1 Citation
This paper introduces Parallel Tacotron 2, a non-autoregressive neural text-to-speech model with a fully differentiable duration model which does not require supervised duration signals. The duration model is based on a novel attention mechanism and an iterative reconstruction loss based on Soft Dynamic Time Warping, this model can learn token-frame alignments as well as token durations automatically. Experimental results show that Parallel Tacotron 2 outperforms baselines in subjective naturalness in several diverse multi speaker evaluations. Its duration control capability is also demonstrated.
October 2020
·
81 Reads
·
1 Citation
Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still room for improvements to its efficiency and naturalness. This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, called \emph{Parallel Tacotron}, is highly parallelizable during both training and inference, allowing efficient synthesis on modern parallel hardware. The use of the variational autoencoder relaxes the one-to-many mapping nature of the text-to-speech problem and improves naturalness. To further improve the naturalness, we use lightweight convolutions, which can efficiently capture local contexts, and introduce an iterative spectrogram loss inspired by iterative refinement. Experimental results show that Parallel Tacotron matches a strong autoregressive baseline in subjective evaluations with significantly decreased inference time.
October 2020
·
533 Reads
This paper presents Non-Attentive Tacotron based on the Tacotron 2 text-to-speech model, replacing the attention mechanism with an explicit duration predictor. This improves robustness significantly as measured by unaligned duration ratio and word deletion rate, two metrics introduced in this paper for large-scale robustness evaluation using a pre-trained speech recognition model. With the use of Gaussian upsampling, Non-Attentive Tacotron achieves a 5-scale mean opinion score for naturalness of 4.41, slightly outperforming Tacotron 2. The duration predictor enables both utterance-wide and per-phoneme control of duration at inference time. When accurate target durations are scarce or unavailable in the training data, we propose a method using a fine-grained variational auto-encoder to train the duration predictor in a semi-supervised or unsupervised manner, with results almost as good as supervised training.
May 2020
·
27 Reads
·
2 Citations
... For instance, Hida et al. (2022) demonstrated that combining explicit morphological features with pretrained language models like BERT significantly enhances pitch-accent prediction (Hida et al., 2022). Similarly, pretrained models such as PnG BERT have shown promise in integrating grapheme and phoneme information for better pitch-accent rendering (Jia et al., 2021). Yasuda and Toda (2022) found that fine-tuning PnG BERT with tone prediction tasks led to notable improvements over baseline Tacotron models (Wang et al., 2017) in accent correctness (Yasuda and Toda, 2022). ...
August 2021
... The Tacotron2 model has been used for text-to-spectrogram conversion and the Waveglow then converts the spectrogram into target audio samples. While there are multiple options available for the spectrogram prediction network and audio synthesis network we choose Tacotron2 + Waveglow as they are competitive with other architectures and still popular in literature [10,1,12,11]. Moreover, there are single-stage end-to-end deep learning models available but these are not considered in this work due to high data requirements. ...
September 2022
... The emergence of large language models (LLMs) has accelerated the evolution of text-to-speech systems from traditional pipeline architectures to end-to-end neural approaches, and more recently to LLM-powered zero-shot cloning systems [Naveed et al., 2023, Chang et al., 2024, Popov et al., 2021, Touvron et al., 2023a,b, Popov et al., 2021, Mehta et al., 2024. Early neural models such as Tacotron 2 [Elias et al., 2021] and FastSpeech [Ren et al., 2020] introduced sequence-to-sequence frameworks with attention and duration prediction, significantly improving speech quality and synthesis speed. Later, VITS proposed a fully end-to-end probabilistic model that integrates text encoding, duration modeling, and waveform generation into a single variational framework, enabling faster inference and more expressive speech. ...
August 2021
... It is trained with UTF-8 byte-based input representations derived from a text encoder which allows for sharing representations across languages efficiently. The major components of the T2F model are based on Parallel Tacotron 2 [17], [18]. Inference begins with a text encoder that transforms the linguistic information into a sequence of hidden representations. ...
June 2021
... Text-to-Speech (TTS) is one of the primary applications of audio tokens. Traditional TTS systems typically rely on neural networks that predict mel spectrograms from text (Shen et al., 2018;Ren et al., 2019), followed by neural vocoders (Morise et al., 2016;Kong et al., 2020;van den Oord et al., 2016) to synthesize waveforms. The introduction of discrete audio tokens offers several advantages. ...
April 2018
... Consequently, cloning someone's voice was most of the time either impossible or too prohibitively expensive. Yet, providing voiceprints as additional information when training a TTS system makes it possible to clone a voice with only a few seconds of audio material and without the need to train a new system [20][21][22]. Even if the results are not yet as convincing as previously used techniques, the essential prerequisites have been met to convert any given text into speech and predetermine the used voice by providing a voiceprint. ...
June 2018
... Advancements in Text-to-Speech (TTS) models [1,2,3,4,5] and neural vocoders [6,7,8,9] have made synthetic voices nearly indistinguishable from human speech. Consequently, more attention has been attracted by refining the expressiveness of synthetic voices, particularly through the accurate manipulation of prosodic features and pronunciation. ...
December 2017