James Betker’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (1)


Figure 2: Training curves for VQVAE. Y-axis is MSE loss in log-log scale. X-axis is number of training steps.
Figure 3: Early training curves in log-log scale. Y-axis is cross entropy loss for MEL tokens. X-axis is number of training steps. Does not include a long tail of training and fine-tuning due to online changes that were made adding non-reproducible noise to curves.
Figure 4: Late training curves for CLVP in log-log scale. Y-axis is cross entropy loss. X-axis is number of samples. Early training curves were lost.
Figure 5: Diffusion model losses, log-log scale. Y-axis: MSE loss, X-axis: training samples.
Better speech synthesis through scaling
  • Preprint
  • File available

May 2023

·

574 Reads

·

1 Citation

James Betker

In recent years, the field of image generation has been revolutionized by the application of autoregressive transformers and DDPMs. These approaches model the process of image generation as a step-wise probabilistic processes and leverage large amounts of compute and data to learn the image distribution. This methodology of improving performance need not be confined to images. This paper describes a way to apply advances in the image generative domain to speech synthesis. The result is TorToise -- an expressive, multi-voice text-to-speech system. All model code and trained weights have been open-sourced at https://github.com/neonbjb/tortoise-tts.

Download

Citations (1)


... SpeechT5 is a pre-trained encoder-decoder model for various spoken language processing tasks, including its TTS application, which we use as a baseline. Finally, XTTS is a zero-shot TTS model built on Tortoise [30], which incorporates a decoder-only Transformer with some modifications to improve voice cloning and enable faster training and inference. ...

Reference:

Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement
Better speech synthesis through scaling