Ye Jia’s research while affiliated with Google Inc. and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (48)


SimulTron: On-Device Simultaneous Speech to Speech Translation
  • Conference Paper

April 2025

·

11 Reads

·

2 Citations

Alex Agranovich

·

Eliya Nachmani

·

Oleg Rybakov

·

[...]

·

Michelle Tadmor Ramanovich

SimulTron: On-Device Simultaneous Speech to Speech Translation

June 2024

·

33 Reads

Simultaneous speech-to-speech translation (S2ST) holds the promise of breaking down communication barriers and enabling fluid conversations across languages. However, achieving accurate, real-time translation through mobile devices remains a major challenge. We introduce SimulTron, a novel S2ST architecture designed to tackle this task. SimulTron is a lightweight direct S2ST model that uses the strengths of the Translatotron framework while incorporating key modifications for streaming operation, and an adjustable fixed delay. Our experiments show that SimulTron surpasses Translatotron 2 in offline evaluations. Furthermore, real-time evaluations reveal that SimulTron improves upon the performance achieved by Translatotron 1. Additionally, SimulTron achieves superior BLEU scores and latency compared to previous real-time S2ST method on the MuST-C dataset. Significantly, we have successfully deployed SimulTron on a Pixel 7 Pro device, show its potential for simultaneous S2ST on-device.




Figure 1: RNN-Transducer
ASR-D3ST base cascaded model (JGA)
Speech Aware Dialog System Technology Challenge (DSTC11)
  • Preprint
  • File available

December 2022

·

41 Reads

Most research on task oriented dialog modeling is based on written text input. However, users interact with practical dialog systems often using speech as input. Typically, systems convert speech into text using an Automatic Speech Recognition (ASR) system, introducing errors. Furthermore, these systems do not address the differences in written and spoken language. The research on this topic is stymied by the lack of a public corpus. Motivated by these considerations, our goal in hosting the speech-aware dialog state tracking challenge was to create a public corpus or task which can be used to investigate the performance gap between the written and spoken forms of input, develop models that could alleviate this gap, and establish whether Text-to-Speech-based (TTS) systems is a reasonable surrogate to the more-labor intensive human data collection. We created three spoken versions of the popular written-domain MultiWoz task -- (a) TTS-Verbatim: written user inputs were converted into speech waveforms using a TTS system, (b) Human-Verbatim: humans spoke the user inputs verbatim, and (c) Human-paraphrased: humans paraphrased the user inputs. Additionally, we provided different forms of ASR output to encourage wider participation from teams that may not have access to state-of-the-art ASR systems. These included ASR transcripts, word time stamps, and latent representations of the audio (audio encoder outputs). In this paper, we describe the corpus, report results from participating teams, provide preliminary analyses of their results, and summarize the current state-of-the-art in this domain.

Download

Textless Direct Speech-to-Speech Translation with Discrete Speech Representation

October 2022

·

84 Reads

Research on speech-to-speech translation (S2ST) has progressed rapidly in recent years. Many end-to-end systems have been proposed and show advantages over conventional cascade systems, which are often composed of recognition, translation and synthesis sub-systems. However, most of the end-to-end systems still rely on intermediate textual supervision during training, which makes it infeasible to work for languages without written forms. In this work, we propose a novel model, Textless Translatotron, which is based on Translatotron 2, for training an end-to-end direct S2ST model without any textual supervision. Instead of jointly training with an auxiliary task predicting target phonemes as in Translatotron 2, the proposed model uses an auxiliary task predicting discrete speech representations which are obtained from learned or random speech quantizers. When a speech encoder pre-trained with unsupervised speech data is used for both models, the proposed model obtains translation quality nearly on-par with Translatotron 2 on the multilingual CVSS-C corpus as well as the bilingual Fisher Spanish-English corpus. On the latter, it outperforms the prior state-of-the-art textless model by +18.5 BLEU.





Accent quality and style appropriateness MOS results for accent transfer voices trained on synthetic data.
Naturalness MOS results for accent transfer voices trained on synthetic data.
Accent transfer quality of the intermediate (Tacotron) and the final (CHiVE-BERT) models.
Appropriateness of the intermediate model (Tacotron) vs. the final model (CHiVE-BERT).
Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks

August 2022

·

227 Reads

Transfer tasks in text-to-speech (TTS) synthesis - where one or more aspects of the speech of one set of speakers is transferred to another set of speakers that do not feature these aspects originally - remains a challenging task. One of the challenges is that models that have high-quality transfer capabilities can have issues in stability, making them impractical for user-facing critical tasks. This paper demonstrates that transfer can be obtained by training a robust TTS system on data generated by a less robust TTS system designed for a high-quality transfer task; in particular, a CHiVE-BERT monolingual TTS system is trained on the output of a Tacotron model designed for accent transfer. While some quality loss is inevitable with this approach, experimental results show that the models trained on synthetic data this way can produce high quality audio displaying accent transfer, while preserving speaker characteristics such as speaking style.


Citations (29)


... The Inner product between data points is used generative machine intelligence is used to compute correlation, angle, direction, or similarity score. A basic example with the orthogonal directions is given by (1,2,3), (12, 3, −6), (1, −2, 1). Now, if one applies tanh to these vectors element by element, then becomes correlated, they are orthogonal anymore. ...

Reference:

Breaking the Barriers of Text-Hungry and Audio-Deficient AI
SimulTron: On-Device Simultaneous Speech to Speech Translation
  • Citing Conference Paper
  • April 2025

... The authors in [27] examine textless Audio-to-Audio translation with limited parallel data. The authors in [70] examine textless direct Audio-to-Audio translation with discrete speech representation. The authors in [31] examine enhancing expressivity transfer in textless Audio-to-Audio translation. ...

Textless Direct Speech-to-Speech Translation with Discrete Speech Representation
  • Citing Conference Paper
  • June 2023

... For instance, Hida et al. (2022) demonstrated that combining explicit morphological features with pretrained language models like BERT significantly enhances pitch-accent prediction (Hida et al., 2022). Similarly, pretrained models such as PnG BERT have shown promise in integrating grapheme and phoneme information for better pitch-accent rendering (Jia et al., 2021). Yasuda and Toda (2022) found that fine-tuning PnG BERT with tone prediction tasks led to notable improvements over baseline Tacotron models (Wang et al., 2017) in accent correctness (Yasuda and Toda, 2022). ...

PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS
  • Citing Conference Paper
  • August 2021

... Existing dubbing methods can be categorized into two groups, each focusing on learning different styles of key prior information to generate high-quality voices. The first group focuses on learning effective speaker style representations [1,2,3,4]. The second group aims to learn appropriate prosody by utilizing visual information from the given video input [5,6,7]. ...

More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech
  • Citing Conference Paper
  • June 2022

... Importantly, the performance of these models is often evaluated using different experimental setups, which limits the extent to which their performance can be reliably compared. Several standardized evaluation setups and benchmarks have been proposed to evaluate the performance of pre-trained multilingual speech models [10][11][12]. ...

XTREME-S: Evaluating Cross-lingual Speech Representations
  • Citing Conference Paper
  • September 2022

... The Tacotron2 model has been used for text-to-spectrogram conversion and the Waveglow then converts the spectrogram into target audio samples. While there are multiple options available for the spectrogram prediction network and audio synthesis network we choose Tacotron2 + Waveglow as they are competitive with other architectures and still popular in literature [10,1,12,11]. Moreover, there are single-stage end-to-end deep learning models available but these are not considered in this work due to high data requirements. ...

Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks

... ing data, some studies utilize unsupervised methods or data augmentation to enhance performance (Jia et al., 2022;Dong et al., 2022;Popuri et al., 2022). A challenge in textless S2ST is extracting acoustic and semantic features from noisy speech sequences, leading many studies to employ the VQ-VAE method to aid alignment learning between different language speeches (Tjandra et al., 2019;Zhang et al., 2020). ...

Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation
  • Citing Conference Paper
  • September 2022

... Acoustic echo cancellation (AEC) techniques [18,19,20] address this issue. Signal processing [18,19,21,22] and neural network [20,23,24,25] based solutions have both been proposed for AEC. What makes the task distinct from the others is that the reference signal of the device playback is usually available and can be used for noise suppression. ...

Textual Echo Cancellation
  • Citing Conference Paper
  • December 2021

... The emergence of large language models (LLMs) has accelerated the evolution of text-to-speech systems from traditional pipeline architectures to end-to-end neural approaches, and more recently to LLM-powered zero-shot cloning systems [Naveed et al., 2023, Chang et al., 2024, Popov et al., 2021, Touvron et al., 2023a,b, Popov et al., 2021, Mehta et al., 2024. Early neural models such as Tacotron 2 [Elias et al., 2021] and FastSpeech [Ren et al., 2020] introduced sequence-to-sequence frameworks with attention and duration prediction, significantly improving speech quality and synthesis speed. Later, VITS proposed a fully end-to-end probabilistic model that integrates text encoding, duration modeling, and waveform generation into a single variational framework, enabling faster inference and more expressive speech. ...

Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling
  • Citing Conference Paper
  • August 2021