January 2023
·
8 Reads
·
4 Citations
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
January 2023
·
8 Reads
·
4 Citations
October 2022
·
17 Reads
Data augmentation is a ubiquitous technique used to provide robustness to automatic speech recognition (ASR) training. However, even as so much of the ASR training process has become automated and more "end-to-end", the data augmentation policy (what augmentation functions to use, and how to apply them) remains hand-crafted. We present Graph-Augment, a technique to define the augmentation space as directed acyclic graphs (DAGs) and search over this space to optimize the augmentation policy itself. We show that given the same computational budget, policies produced by G-Augment are able to perform better than SpecAugment policies obtained by random search on fine-tuning tasks on CHiME-6 and AMI. G-Augment is also able to establish a new state-of-the-art ASR performance on the CHiME-6 evaluation set (30.7% WER). We further demonstrate that G-Augment policies show better transfer properties across warm-start to cold-start training and model size compared to random-searched SpecAugment policies.
October 2021
·
15 Reads
·
36 Citations
August 2021
·
13 Reads
·
38 Citations
August 2021
·
121 Reads
·
91 Citations
June 2021
·
82 Reads
This paper introduces WaveGrad 2, a non-autoregressive generative model for text-to-speech synthesis. WaveGrad 2 is trained to estimate the gradient of the log conditional density of the waveform given a phoneme sequence. The model takes an input phoneme sequence, and through an iterative refinement process, generates an audio waveform. This contrasts to the original WaveGrad vocoder which conditions on mel-spectrogram features, generated by a separate model. The iterative refinement process starts from Gaussian noise, and through a series of refinement steps (e.g., 50 steps), progressively recovers the audio sequence. WaveGrad 2 offers a natural way to trade-off between inference speed and sample quality, through adjusting the number of refinement steps. Experiments show that the model can generate high fidelity audio, approaching the performance of a state-of-the-art neural TTS system. We also report various ablation studies over different model configurations. Audio samples are available at https://wavegrad.github.io/v2.
June 2021
·
54 Reads
·
101 Citations
June 2021
·
162 Reads
·
115 Citations
June 2021
·
19 Reads
Supervised neural network training has led to significant progress on single-channel sound separation. This approach relies on ground truth isolated sources, which precludes scaling to widely available mixture data and limits progress on open-domain tasks. The recent mixture invariant training (MixIT) method enables training on in-the wild data; however, it suffers from two outstanding problems. First, it produces models which tend to over-separate, producing more output sources than are present in the input. Second, the exponential computational complexity of the MixIT loss limits the number of feasible output sources. These problems interact: increasing the number of output sources exacerbates over-separation. In this paper we address both issues. To combat over-separation we introduce new losses: sparsity losses that favor fewer output sources and a covariance loss that discourages correlated outputs. We also experiment with a semantic classification loss by predicting weak class labels for each mixture. To extend MixIT to larger numbers of sources, we introduce an efficient approximation using a fast least-squares solution, projected onto the MixIT constraint set. Our experiments show that the proposed losses curtail over-separation and improve overall performance. The best performance is achieved using larger numbers of output sources, enabled by our efficient MixIT loss, combined with sparsity losses to prevent over-separation. On the FUSS test set, we achieve over 13 dB in multi-source SI-SNR improvement, while boosting single-source reconstruction SI-SNR by over 17 dB.
November 2020
·
109 Reads
We describe a sequence-to-sequence neural network which can directly generate speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Output waveforms are modeled as a sequence of non-overlapping fixed-length frames, each one containing hundreds of samples. The interdependencies of waveform samples within each frame are modeled using the normalizing flow, enabling parallel training and synthesis. Longer-term dependencies are handled autoregressively by conditioning each flow on preceding frames. This model can be optimized directly with maximum likelihood, without using intermediate, hand-designed features nor additional loss terms. Contemporary state-of-the-art text-to-speech (TTS) systems use a cascade of separately learned models: one (such as Tacotron) which generates intermediate features (such as spectrograms) from text, followed by a vocoder (such as WaveRNN) which generates waveform samples from the intermediate features. The proposed system, in contrast, does not use a fixed intermediate representation, and learns all parameters end-to-end. Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system, with significantly improved generation speed.
... While traditional methods have their merits and have been effectively used in AMR for years, artificial intelligence and machine learning (AI/ML) models offer significant advantages in terms of accuracy, robustness, and adaptability [6], [7]. The automated feature extraction and ability to learn complex patterns make AI models a powerful tool for modern AMR problems, leading to better performance in real-world applications [8]. ...
September 2015
... In [16], a data purification approach is proposed to improve the training of a self-supervised personalized speechenhancement model. Several DA methods have been proposed to improve ASR performance such as SpecAugment [17], Multi-Condition Training (MCT) [18], patched Multi-Condition Training (pMCT) [19], and G(raph)-Augment [20]. However, none of them target ASR personalization tasks. ...
January 2023
... MixIT is one of the major unsupervised TSE methods, and several improvements have been proposed. For instance, one method mitigated the overseparation problem by introducing a penalty term for the number of active sources and the correlation between the output sources [42]. Other methods produced better separation by using a pre-trained classification model (e.g., an audio event classification or an ASR model) [42], [43] or by employing a loss function that relaxes the training difficulty [40], [41]. ...
October 2021
... Denoising Diffusion Probabilistic Models (DDPMs) belong to a class of generative networks that synthesize data by progressively reducing noise from Gaussian distributions through a reverse Markov process, transforming random noise into structured outputs similar to the training data [67]. Originally applied to images, this method was later extended to audio [68]- [70] and text-to-speech (TTS) synthesis [71]- [73]. DiffTTS [72] was a pioneer in applying DDPMs to the TTS field, followed by Grad-TTS [73], which modeled the diffusion as a stochastic differential equation (SDE), which combines Denoising Diffusion with GANs [74] to accelerate the decoding process, FastDiff [75], which employs a noise scheduling algorithm [76] to reduce steps, and MatchaTTS [77] which utilizes optimal-transport conditional flow matching (OT-CFM) [78] to enhance synthesis speed, aim to improve efficiency. ...
August 2021
... Speech-prefix finetuned (SP): This is the full configuration as described in §2.2, which uses an audio-dependent soft prompt. Speech prefix + Text Injection (ST): This configuration adds text injection [27,28] to SP. While finetuning on paired speechtext data, the trainable parameters are jointly trained on the text used to pretrain the respective LM, so that the LM does not overfit to the paired training data. ...
August 2021
... In recent years, text-to-speech (TTS) technology has undergone significant advancements, evolving from traditional end-to-end models [1][2][3][4][5] to more sophisticated cascade models [6][7][8][9][10], where an auto-regressive (AR) model and a decoder can be trained separately. A particularly notable development in this transition is the integration of large language models (LLMs) into the TTS pipeline [11][12][13][14][15][16][17][18]. ...
June 2021
... It is trained with UTF-8 byte-based input representations derived from a text encoder which allows for sharing representations across languages efficiently. The major components of the T2F model are based on Parallel Tacotron 2 [17], [18]. Inference begins with a text encoder that transforms the linguistic information into a sequence of hidden representations. ...
June 2021
... To combat the above shortcomings, [9] focuses on developing a completely unsupervised method called mixture invariant training (MixIT). In MixIT, existing mixtures are combined to construct training examples, and the model separates these examples into a variable number of latent sources, such that original mixtures can be approximated by remixing these separated sources. ...
June 2020
... While non-generative TTS models assume a oneto-one mapping between the input text and the speech [17], which does not match the intrinsic diversity of real-world speech, generative deep-learning models learn the distribution of the training data during training and sample from their latent space to generate diverse data [28]. Generative TTS models include TTS models based on Normalizing Flows [4], [19], [29], Diffusion models [30], and Variational Autoencoders (VAE) [31]. Flow-based generative models, in particular, can fit complex data distributions and generate diverse utterances with high quality. ...
May 2020
... Joint Audio-Text training (JAT) [11] is a recent approach for leveraging unpaired text-only data to improve ASR [10,11,26,27]. Unlike shallow fusion that considers token distributions from an external neural network language model (NNLM), JAT does not require additional model parameters or latency, making it suitable for on-device streaming ASR. ...
May 2020