April 2025
·
35 Reads
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
April 2025
·
35 Reads
September 2024
·
5 Reads
In this paper, we introduce a zero-shot Voice Transfer (VT) module that can be seamlessly integrated into a multi-lingual Text-to-speech (TTS) system to transfer an individual's voice across languages. Our proposed VT module comprises a speaker-encoder that processes reference speech, a bottleneck layer, and residual adapters, connected to preexisting TTS layers. We compare the performance of various configurations of these components and report Mean Opinion Score (MOS) and Speaker Similarity across languages. Using a single English reference speech per speaker, we achieve an average voice transfer similarity score of 73% across nine target languages. Vocal characteristics contribute significantly to the construction and perception of individual identity. The loss of one's voice, due to physical or neurological conditions, can lead to a profound sense of loss, impacting one's core identity. As a case study, we demonstrate that our approach can not only transfer typical speech but also restore the voices of individuals with dysarthria, even when only atypical speech samples are available - a valuable utility for those who have never had typical speech or banked their voice. Cross-lingual typical audio samples, plus videos demonstrating voice restoration for dysarthric speakers are available here (google.github.io/tacotron/publications/zero_shot_voice_transfer).
April 2024
·
13 Reads
·
15 Citations
August 2021
·
145 Reads
·
86 Citations
June 2021
·
54 Reads
·
101 Citations
March 2021
·
86 Reads
·
1 Citation
This paper introduces Parallel Tacotron 2, a non-autoregressive neural text-to-speech model with a fully differentiable duration model which does not require supervised duration signals. The duration model is based on a novel attention mechanism and an iterative reconstruction loss based on Soft Dynamic Time Warping, this model can learn token-frame alignments as well as token durations automatically. Experimental results show that Parallel Tacotron 2 outperforms baselines in subjective naturalness in several diverse multi speaker evaluations. Its duration control capability is also demonstrated.
October 2020
·
81 Reads
·
1 Citation
Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still room for improvements to its efficiency and naturalness. This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, called \emph{Parallel Tacotron}, is highly parallelizable during both training and inference, allowing efficient synthesis on modern parallel hardware. The use of the variational autoencoder relaxes the one-to-many mapping nature of the text-to-speech problem and improves naturalness. To further improve the naturalness, we use lightweight convolutions, which can efficiently capture local contexts, and introduce an iterative spectrogram loss inspired by iterative refinement. Experimental results show that Parallel Tacotron matches a strong autoregressive baseline in subjective evaluations with significantly decreased inference time.
October 2020
·
533 Reads
This paper presents Non-Attentive Tacotron based on the Tacotron 2 text-to-speech model, replacing the attention mechanism with an explicit duration predictor. This improves robustness significantly as measured by unaligned duration ratio and word deletion rate, two metrics introduced in this paper for large-scale robustness evaluation using a pre-trained speech recognition model. With the use of Gaussian upsampling, Non-Attentive Tacotron achieves a 5-scale mean opinion score for naturalness of 4.41, slightly outperforming Tacotron 2. The duration predictor enables both utterance-wide and per-phoneme control of duration at inference time. When accurate target durations are scarce or unavailable in the training data, we propose a method using a fine-grained variational auto-encoder to train the duration predictor in a semi-supervised or unsupervised manner, with results almost as good as supervised training.
... However, we support the enrollment phrase specification either as audio or as text. In the case of text enrollment, we use Text To Speech (TTS) models [6,7] to generate audio, which is then used to generate the embedding. Furthermore, if multiple enrollment embeddings are available, the centroid of the enrollment embeddings is used to compute the cosine distance. ...
April 2024
... The emergence of large language models (LLMs) has accelerated the evolution of text-to-speech systems from traditional pipeline architectures to end-to-end neural approaches, and more recently to LLM-powered zero-shot cloning systems [Naveed et al., 2023, Chang et al., 2024, Popov et al., 2021, Touvron et al., 2023a,b, Popov et al., 2021, Mehta et al., 2024. Early neural models such as Tacotron 2 [Elias et al., 2021] and FastSpeech [Ren et al., 2020] introduced sequence-to-sequence frameworks with attention and duration prediction, significantly improving speech quality and synthesis speed. Later, VITS proposed a fully end-to-end probabilistic model that integrates text encoding, duration modeling, and waveform generation into a single variational framework, enabling faster inference and more expressive speech. ...
August 2021
... It is trained with UTF-8 byte-based input representations derived from a text encoder which allows for sharing representations across languages efficiently. The major components of the T2F model are based on Parallel Tacotron 2 [17], [18]. Inference begins with a text encoder that transforms the linguistic information into a sequence of hidden representations. ...
June 2021