Isaac Elias’s research while affiliated with Google Inc. and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (8)


Speech Re-Painting for Robust ASR
  • Conference Paper

April 2025

·

35 Reads

Kyle Kastner

·

Gary Wang

·

Isaac Elias

·

[...]

·


Zero-shot Cross-lingual Voice Transfer for TTS

September 2024

·

5 Reads

In this paper, we introduce a zero-shot Voice Transfer (VT) module that can be seamlessly integrated into a multi-lingual Text-to-speech (TTS) system to transfer an individual's voice across languages. Our proposed VT module comprises a speaker-encoder that processes reference speech, a bottleneck layer, and residual adapters, connected to preexisting TTS layers. We compare the performance of various configurations of these components and report Mean Opinion Score (MOS) and Speaker Similarity across languages. Using a single English reference speech per speaker, we achieve an average voice transfer similarity score of 73% across nine target languages. Vocal characteristics contribute significantly to the construction and perception of individual identity. The loss of one's voice, due to physical or neurological conditions, can lead to a profound sense of loss, impacting one's core identity. As a case study, we demonstrate that our approach can not only transfer typical speech but also restore the voices of individuals with dysarthria, even when only atypical speech samples are available - a valuable utility for those who have never had typical speech or banked their voice. Cross-lingual typical audio samples, plus videos demonstrating voice restoration for dysarthric speakers are available here (google.github.io/tacotron/publications/zero_shot_voice_transfer).





Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

March 2021

·

86 Reads

·

1 Citation

This paper introduces Parallel Tacotron 2, a non-autoregressive neural text-to-speech model with a fully differentiable duration model which does not require supervised duration signals. The duration model is based on a novel attention mechanism and an iterative reconstruction loss based on Soft Dynamic Time Warping, this model can learn token-frame alignments as well as token durations automatically. Experimental results show that Parallel Tacotron 2 outperforms baselines in subjective naturalness in several diverse multi speaker evaluations. Its duration control capability is also demonstrated.


Parallel Tacotron: Non-Autoregressive and Controllable TTS

October 2020

·

81 Reads

·

1 Citation

Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still room for improvements to its efficiency and naturalness. This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, called \emph{Parallel Tacotron}, is highly parallelizable during both training and inference, allowing efficient synthesis on modern parallel hardware. The use of the variational autoencoder relaxes the one-to-many mapping nature of the text-to-speech problem and improves naturalness. To further improve the naturalness, we use lightweight convolutions, which can efficiently capture local contexts, and introduce an iterative spectrogram loss inspired by iterative refinement. Experimental results show that Parallel Tacotron matches a strong autoregressive baseline in subjective evaluations with significantly decreased inference time.


Figure 2: Preference test result with Non-Attentive Tacotron with Gaussian upsampling compared against Tacotron 2 (GMMA).
Figure 3: Single word pace control with sentence "I'm so saddened about the devastation in Big Basin." The top spectrogram is with regular pace. The rest slow down the words "saddened", "devastation", and "Big Basin" respectively to 0.67× the regular pace by scaling the predicted duration by 1.5×.
Figure 4: Alignment on text "What time do I need to show up to my sky diving lesson?" from the unsupervised model. The predicted alignments are from Gaussian upsampling.
MOS with 95% confidence intervals.
Robustness measured by UDR and WDR on two large evaluation sets.

+1

Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling
  • Preprint
  • File available

October 2020

·

533 Reads

This paper presents Non-Attentive Tacotron based on the Tacotron 2 text-to-speech model, replacing the attention mechanism with an explicit duration predictor. This improves robustness significantly as measured by unaligned duration ratio and word deletion rate, two metrics introduced in this paper for large-scale robustness evaluation using a pre-trained speech recognition model. With the use of Gaussian upsampling, Non-Attentive Tacotron achieves a 5-scale mean opinion score for naturalness of 4.41, slightly outperforming Tacotron 2. The duration predictor enables both utterance-wide and per-phoneme control of duration at inference time. When accurate target durations are scarce or unavailable in the training data, we propose a method using a fine-grained variational auto-encoder to train the duration predictor in a semi-supervised or unsupervised manner, with results almost as good as supervised training.

Download

Citations (3)


... However, we support the enrollment phrase specification either as audio or as text. In the case of text enrollment, we use Text To Speech (TTS) models [6,7] to generate audio, which is then used to generate the embedding. Furthermore, if multiple enrollment embeddings are available, the centroid of the enrollment embeddings is used to compute the cosine distance. ...

Reference:

GE2E-KWS: Generalized End-to-End Training and Evaluation for Zero-shot Keyword Spotting
Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data
  • Citing Conference Paper
  • April 2024

... The emergence of large language models (LLMs) has accelerated the evolution of text-to-speech systems from traditional pipeline architectures to end-to-end neural approaches, and more recently to LLM-powered zero-shot cloning systems [Naveed et al., 2023, Chang et al., 2024, Popov et al., 2021, Touvron et al., 2023a,b, Popov et al., 2021, Mehta et al., 2024. Early neural models such as Tacotron 2 [Elias et al., 2021] and FastSpeech [Ren et al., 2020] introduced sequence-to-sequence frameworks with attention and duration prediction, significantly improving speech quality and synthesis speed. Later, VITS proposed a fully end-to-end probabilistic model that integrates text encoding, duration modeling, and waveform generation into a single variational framework, enabling faster inference and more expressive speech. ...

Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling
  • Citing Conference Paper
  • August 2021

... It is trained with UTF-8 byte-based input representations derived from a text encoder which allows for sharing representations across languages efficiently. The major components of the T2F model are based on Parallel Tacotron 2 [17], [18]. Inference begins with a text encoder that transforms the linguistic information into a sequence of hidden representations. ...

Parallel Tacotron: Non-Autoregressive and Controllable TTS
  • Citing Conference Paper
  • June 2021