Content uploaded by Nanxin Chen
Author content
All content in this area was uploaded by Nanxin Chen on Jun 22, 2021
Content may be subject to copyright.
WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
Nanxin Chen1†, Yu Zhang2, Heiga Zen2, Ron J. Weiss2, Mohammad Norouzi2, Najim Dehak1,
William Chan2
1Center for Language and Speech Processing, Johns Hopkins University
2Brain Team, Google Research
{nchen14,ndehak3}@jhu.edu,{ngyuzh,heigazen,ronw,mnorouzi,williamchan}@google.com
Abstract
This paper introduces WaveGrad 2, a non-autoregressive gener-
ative model for text-to-speech synthesis. WaveGrad 2 is trained
to estimate the gradient of the log conditional density of the
waveform given a phoneme sequence. The model takes an input
phoneme sequence, and through an iterative refinement process,
generates an audio waveform. This contrasts to the original
WaveGrad vocoder which conditions on mel-spectrogram fea-
tures, generated by a separate model. The iterative refinement
process starts from Gaussian noise, and through a series of re-
finement steps (e.g., 50 steps), progressively recovers the audio
sequence. WaveGrad 2 offers a natural way to trade-off between
inference speed and sample quality, through adjusting the num-
ber of refinement steps. Experiments show that the model can
generate high fidelity audio, approaching the performance of a
state-of-the-art neural TTS system. We also report various abla-
tion studies over different model configurations. Audio samples
are available at https://wavegrad.github.io/v2.
Index Terms: neural TTS, audio synthesis, non-autoregressive,
score matching
1. Introduction
Deep learning has revolutionized text-to-speech (TTS) synthe-
sis [1–7]. Text-to-speech is a multimodal generation problem,
which maps a input text sequence to a speech sequence with
many possible variations in, for example, prosody, speaking
style, and phonation modes. Most neural TTS systems follow a
two-stage generation process. In the first step, a feature genera-
tion model generates intermediate representations, typically lin-
ear or mel-spectrograms, from a text or phoneme sequence. The
intermediate representations control the structure of the wave-
form and are usually generated by an autoregressive architec-
ture [4–6, 8, 9] to capture rich distributions. Next, a vocoder
takes the intermediate features as input and predicts the wave-
form [5,10–23]. Because it takes the predicted features from the
feature generation model as input during inference, the vocoder
is often trained using predicted features as input [5].
Even though this two-stage TTS pipeline can produce high-
fidelity audio, the deployment can be complicated since it uses
a cascade of learned modules. Another concern is related to the
intermediate features which are often chosen largely by expe-
rience. For example, mel-spectrogram features usually work
well but may not be the best choice for all applications. In
contrast, the benefits of data-driven end-to-end approaches have
been widely observed in various task domains across machine
learning. End-to-end approaches are able to learn the best in-
termediate features automatically from the training data, which
are usually task-specific. They are also easier to train since they
†Work done during an internship at Google Brain.
do not require supervision and ground truth signal at different
stages.
There are two predominant end-to-end TTS models. Au-
toregressive models offer tractable likelihood computation, but
they require recurrent generation of waveform samples at in-
ference time [2], which can be slow. By contrast, non-
autoregressive models enable efficient parallel generation, but
they require token duration information. Reference token du-
rations for training are usually computed by an offline forced-
alignment model [7, 8]. To predict durations for generation,
an additional module is trained to predict the ground truth du-
rations. More recent work [7, 24] has focused on applying
non-autoregressive models to end-to-end TTS. However, they
still rely on spectral losses and mel-spectrograms for alignment
and do not take full advantage of end-to-end training. Fast-
Speech 2 [7] requires additional conditioning signals such as
pitch and energy to reduce the number of candidate output se-
quences. EATS [24] uses adversarial training as well as spec-
trogram losses to handle the one-to-many mapping issue, which
makes the architecture more complicated.
In this work, we propose WaveGrad 2, a non-autoregressive
phoneme-to-waveform model that does not require intermediate
features or specialized loss functions. To make the architecture
more end-to-end, the WaveGrad [23] decoder is integrated into
a Tacotron 2-style non-autoregressive encoder. The WaveGrad
decoder iteratively refines the input signal, beginning from ran-
dom noise, and can produce high-fidelity audio with sufficient
steps. The one-to-many mapping problem is handled by the
score matching objective, which optimizes weighted variational
lower-bound of the log-likelihood [25].
The non-autoregressive encoder follows recently proposed
non-attentive Tacotron [9] that combines a text encoder and a
Gaussian resampling layer to incorporate the duration informa-
tion. The ground-truth duration information is utilized during
training and a duration predictor is trained to estimate it. Dur-
ing inference, the duration predictor predicts the duration for
each input token. Compared to the attention-based models, such
a duration predictor is significantly more resilient to attention
failures and monotonic alignments are guaranteed because of
the way to compute the position. The main contributions of this
paper are as follows:
• A fully differentiable and efficient architecture that produces
waveforms directly without generating intermediate features
like spectrograms explicitly;
• A model providing a natural trade-off between fidelity and
generation speed by changing the number of refinement steps;
• An end-to-end non-autoregressive model reaches 4.43 Mean
Opinion Score (MOS), nearing performance of state-of-the-
art neural TTS systems.
2. Score matching
Similar to the original WaveGrad [23], WaveGrad 2 is built on
prior work on score matching [26, 27] and diffusion probabilis-
tic models [25,28]. In the case of TTS, the score function is de-
fined as the gradient of the log conditional distribution p(y|x)
with respect to the output yas
s(y|x) = ∇ylog p(y|x),(1)
where yis the waveform and xis the conditioning signal. To
synthesize speech given the conditional signal, one can draw a
waveform iteratively via Langevin dynamics starting from the
initialization ˜y0as
˜yi+1 = ˜yi+η
2s(˜yi|x) + √η zi,(2)
where η > 0is the step size, zi∼ N(0, I ), and Idenotes an
identity matrix.
Following previous work [23], we adopt a special parame-
terization known as the diffusion model [25, 28]. A score net-
work s(˜y|x, ¯α)is trained to predict the scaled derivative by
minimizing the distance between model prediction and ground
truth as
E¯α, h
θ˜y, x, √¯α−
1i,(3)
where ∼ N(0, I )is the noise term introduced by applying the
reparameterization trick, ¯αis the noise level and ˜yis sampled
according to
˜y=√¯α y0+√1−¯α . (4)
During training, ¯α’s are sampled from the intervals [¯αn,¯αn+1 ]
based on a pre-defined linear schedule of β’s, according to:
¯αn:=
n
Y
s=1
(1 −βs).(5)
In each iteration, the updated waveform is estimated following
the following stochastic process
yn−1=1
√αnyn−βn
√1−¯αn
θ(yn, x, √¯αn)+σnz.
(6)
3. WaveGrad 2
The proposed model includes three modules illustrated in Fig-
ure 1:
• The encoder takes a phoneme sequence as input and extracts
abstract hidden representations from the input context.
• The resampling layer changes the resolution of the encoder
output to match the output waveform time scale, quantized
into 10ms segments (similar to typical mel-spectrogram fea-
tures). This is achieved by conditioning on the target duration
during training. Durations predicted by the duration predictor
module are utilized during inference.
• The WaveGrad decoder predicts the raw waveform by refin-
ing the noisy waveform iteratively. In each iteration, the de-
coder gradually refines the signal and adds fine-grained de-
tails.
Phoneme
Encoder
Phoneme Duration Duration LossDuration Predictor
Resampling
Sampling window
WaveGrad Decoder
L1 loss
Figure 1: WaveGrad 2 network architecture. The inputs consists
of the phoneme sequence. Dashed lines indicates computation
only performed during training.
3.1. Encoder
The design of the encoder follows that of Tacotron 2 [5].
Phoneme tokens are used as inputs, with silence tokens inserted
at word boundaries. An end-of-sequence token is added after
each sentence. Tokens are first converted into learned embed-
ding, which are then passed through 3 convolution layers with
dropout [29] and batch normalization layer [30]. Finally, long-
term contextual information is modeled by passing the output
through a single bi-directional long short-term memory (LSTM)
layer with ZoneOut regularization [31].
3.2. Resampling
The length of the output waveform sequence is very different
from the length of encoder representations. In Tacotron 2 [5],
this is resolved by the attention mechanism. To make
the structure non-autoregressive and speed-up inference, we
adopt the Gaussian upsampling introduced in the non-attentive
Tacotron [9]. Instead of repeating each token according to its
duration, Gaussian upsampling predicts the duration and influ-
ence range simultaneously. These parameters are used in the
attention weights computation, which purely relies on the pre-
dicted position. During training, the ground truth duration is
used instead and an additional mean square loss is measured to
train the duration predictor. This is labeled as Duration Loss in
Figure 1. Ground truth duration is not needed during inference
and predicted duration is adopted instead.
3.3. Sampling Window
Since the waveform resolution is very high (24,000 samples per
second in our case), it is not feasible to compute the loss on
all waveform samples in an utterance because of the high com-
putation cost and memory constraints. Instead, after learning
the representations on the whole input sequence, we sample a
small segment to synthesize the waveform. Due to the resam-
pling layer, the encoder representations and waveform samples
are already aligned. Random segments are sampled individually
in each minibatch and the corresponding waveform segment is
extracted based on the upsampling rate (300 in our setup). The
full encoder sequence (after resampling) is used during infer-
ence which introduces a small mismatch between training and
inference. We conduct ablation studies to study how the sam-
pling window size influences fidelity in Section 4.1.
yn
Sampled Frames
n
DBlock (512, /5)
DBlock (256, /3)
DBlock (128, /2)
DBlock (128, /2)
5×1Conv (32)
3×1Conv (768)
UBlock (512, ×5)
UBlock (512, ×5)
UBlock (256, ×3)
UBlock (128, ×2)
UBlock (128, ×2)
3×1Conv (1)
√¯α
FiLM
FiLM
FiLM
FiLM
FiLM
Figure 2: WaveGrad decoder. The inputs consists of the condi-
tioning representations x, the noisy waveform generated from
the previous iteration yn, and the noise level √¯α. The model
produces the direction nat each iteration, which can be used
to update yn. FiLM is the Feature-wise Linear Modulation [32]
which combines information from ynand x.
3.4. Decoder
The decoder gradually upsamples the hidden representations to
match the waveform resolution. In our case, the waveform is
sampled at 24 kHz and we need to upsample by 300 times. This
is achieved using the WaveGrad decoder [23], as shown in Fig-
ure 2. The architecture includes 5 upsampling blocks (UBlock)
and 4 downsampling blocks (DBlock). In each iteration of the
generation process, the network denoises the noisy input wave-
form estimate ynby predicting the included noise term n, con-
ditioning on the hidden presentations following equation 6. As
described in Section 2, the generation process begins from a
random noise estimate yN, and iteratively refines it over N(typ-
ically set to 1000) steps to generate a waveform sample. Fol-
lowing our previous work [23], the training objective is the L1
loss between the predicted and ground truth noise term. During
training, this loss is computed using a single, randomly sam-
pled, iteration.
4. Experiments
We compare WaveGrad 2 with other neural TTS systems. Fol-
lowing [23], baseline systems were trained on a proprietary
dataset consisted of 385 hours of high-quality English speech
from 84 professional voice talents. A female speaker was cho-
sen from the training dataset for evaluation. 128-dimensional
mel-spectrogram features were extracted from 24 kHz wave-
form following the previous setup [23]: 50 ms Hanning win-
dow, 12.5 ms frame shift, 2048-point FFT, 20 Hz & 12 kHz
lower & upper frequency cutoffs. For Wave-Tacotron [33] and
the proposed WaveGrad 2 models, we used a subset of training
set which included all the audio of the test speaker. This sub-
set included 39 hours of speech. Preliminary results suggested
that WaveGrad 2 trained on a single-speaker dataset gave better
performance, especially when the network size was small.
The following models were used for comparison:
Tacotron 2 + WaveRNN which was conditioned on mel-
spectrograms predicted by a Tacotron 2 model in teacher-
forcing mode following [5]. The WaveRNN used a single
LSTM layer with 1,024 hidden units, 5 convolutional layers
with 512 channels to process the mel-spectrogram features. A
10-component mixture of logistic distributions [34] was used
as its output layer, generating 16-bit quantized audio at 24 kHz
sample rate. Preliminary results indicated that further reducing
the number of LSTM units hurt performance.
Tacotron 2 + WaveGrad which was trained on ground truth
mel-spectrograms following [23]. Two different network sizes
were included.The 15M-parameter WaveGrad Base model took
7,200 samples corresponding to 0.3 seconds of audio as in-
put during training. For the 23M-parameter WaveGrad Large
model, each training sample included 60 frames corresponding
to a 0.75 second audio segment.
Tacotron 2 + GAN-TTS/MelGAN follow the first baseline, us-
ing non-autoregressive neural vocoders. We followed the setup
and hyperparameters in their original work. MelGAN [17] in-
cluded 3.22M parameters, trained for 4M steps while GAN-
TTS included 21.4M parameters, and was trained for 1M steps.
Wave-Tacotron [33] which used a Tacotron-like encoder-
decoder architecture, where the autoregressive decoder network
uses a normalizing flow to synthesize waveform samples di-
rectly as a sequence of non-overlapping 40ms frames.
For all baseline systems which used separate vocoder
networks, we used predicted mel-spectrograms from the
Tacotron 2 model during inference. The same Tacotron 2 model
was used for all baselines.
The evaluation set included 1,000 sentences. A five-point
Likert scale score (1: Bad, 2: Poor, 3: Fair, 4: Good, 5: Ex-
cellent) was adopted and the rating increment was 0.5. Subjects
rated the naturalness of each stimulus after listening to it in a
quiet room. Each subject was allowed to evaluate up to six stim-
uli that were randomly chosen and presented in isolation. The
subjects were native speakers of English living in the United
States and were requested to use headphones.
Subjective evaluation results are summarized in Table 1.
The WaveGrad 2 models almost matched the performance of
the autoregressive Tacotron 2 + WaveRNN baseline and outper-
formed other baselines with non-autoregressive vocoders.
4.1. Sampling Window Size
Memory usage is a major concern for end-to-end training. Long
sequences corresponding to multi-second utterances may not
fit into memory since the main computation bottleneck comes
from the WaveGrad decoder which operates at the waveform
sample rate. To make training efficient, we sample a small seg-
ment from the resampled encoder representation and train the
decoder network using this segment instead of the full sequence.
Two different window sizes were explored: 64 and 256
frames, corresponding to 0.8 and 3.2 seconds of speech, respec-
tively. The results are shown in Table 2. The use of the large
window gave better MOS compared to the small window. In all
following experiments we use the large window for training.
4.2. Network Size
We carried out ablations using different network sizes. The en-
coder only needs to be computed once, thus increasing the hid-
den dimension has small impact to the inference speed. On the
other hand, the WaveGrad decoder needs to be executed multi-
ple times depending on the number of iterations.
Subjective evaluation results are presented in Table 3. It
can be seen from the table that the larger encoder size increased
Table 1: Mean opinion scores (MOS) of various models and their confidence intervals. MT: Multi-task learning.
Model Model size MOS (↑)
Tacotron 2 + WaveRNN 38M + 18M 4.49 ±0.04
Tacotron 2 + WaveGrad(Base, 1000) 38M + 15M 4.47 ±0.04
Tacotron 2 + WaveGrad(Large, 1000) 38M + 23M 4.51 ±0.04
Tacotron 2 + MelGAN 38M + 3M 3.95 ±0.06
Tacotron 2 + GAN-TTS 38M + 21M 4.34 ±0.04
Wave-Tacotron [33] 38M 4.08 ±0.06
WaveGrad 2
Encoder(2048) + WaveGrad(Large, 1000) 193M 4.37 ±0.05
Encoder(2048) + WaveGrad(Large, 1000) + MT 193M 4.39 ±0.05
Encoder(1024) + WaveGrad(Large, 1000) + MT + SpecAug 73M 4.43 ±0.05
Ground Truth – 4.58 ±0.05
Table 2: Comparison between different sampling window sizes.
All models use 1,000 iterations for inference.
Model Window Size MOS (↑)
Encoder(512) + WaveGrad(Base) 0.8 sec 3.80 ±0.07
Encoder(512) + WaveGrad(Base) 3.2 sec 3.88 ±0.07
Table 3: Comparison between different network sizes. All mod-
els use 1000 iterations for inference.
Model Model size MOS (↑)
Encoder(512) + WaveGrad(Base) 37M 3.88 ±0.07
Encoder(512) + WaveGrad(Large) 40M 4.19 ±0.06
Encoder(2048) + WaveGrad(Base) 188M 4.05 ±0.07
Encoder(2048) + WaveGrad(Large) 193M 4.37 ±0.05
the number of parameters by a large margin, and led to a small
quality improvement. However, the improvement was smaller
compared to using a larger WaveGrad decoder, indicating that
having a larger decoder is crucial.
4.3. Hidden Features Augmentation
We explored applying a variant of SpecAugment [35] to the
conditioning input to the decoder (the resampled encoder out-
put). The augmentation is applied on the learned hidden repre-
sentations instead of the spectrograms. This can be viewed as a
form of correlated block dropout. 32 consecutive frames were
randomly selected to be masked and we applied it twice. The
intuition is that the WaveGrad decoder can recover the masked
part by conditioning the contextual information. This enforces
the encoder to learn robust representations which include more
context information. Results are shown in Table 4. We did not
observe large improvements with this regularization.
4.4. Multi-task Learning and Speed-Quality Tradeoff
Inspired by FastSpeech 2s [7], we explored leveraging mel-
spectrogram features to enhance encoder training. The en-
coder is encouraged to extract representations that can directly
predict the spectrogram features. We added a separate mel-
spectrogram decoder after the resampling layer to predict the
mel-spectrogram features. This decoder included one upsam-
Table 4: Impact of augmentation on the learned representa-
tions. All models use 1000 iterations for inference.
Model SpecAug MOS (↑)
Encoder(2048) + WaveGrad(Large) N 4.37 ±0.05
Encoder(2048) + WaveGrad(Large) Y 4.40 ±0.05
Table 5: Impact of multi-task (MT) learning and number of it-
erations.
Model MT Iter MOS (↑)
Encoder(2048) + WaveGrad(Large) N 1000 4.37 ±0.05
Encoder(2048) + WaveGrad(Large) Y 1000 4.39 ±0.05
Encoder(2048) + WaveGrad(Large) Y 50 4.32 ±0.05
pling block [23] and the mean squared error (MSE) was mea-
sured as an additional loss on the whole sequence. During
inference, we simply dropped this decoder similar to Fast-
Speech 2s [7].
As shown in Table 5, there was no significant performance
difference with multi-task training. This suggests that multi-
task learning is not beneficial for the end-to-end generation. We
also explored reducing the number of iterations from 1000 to 50
and found a small performance degradation (about 0.07 points).
5. Conclusions
In this paper, we presented WaveGrad 2, an end-to-end non-
autoregressive TTS model which takes a phoneme sequence
as input and synthesizes the waveform directly without using
hand-designed intermediate features (e.g., spectrograms) like
most TTS systems. Similar to prior work [23], the output wave-
form is generated through an iterative refinement process begin-
ning from random noise. The generation procedure provides a
tradeoff between fidelity and speed by varying the number of
refinement steps. Experiments demonstrate that WaveGrad 2 is
capable of generating high fidelity audio, comparable to strong
baselines. Ablation studies exploring different model configura-
tions, found that increased model size is the most important fac-
tor in determining WaveGrad 2 synthesis quality. Future work
includes improving the performance under the limited number
of refinement iterations.
6. References
[1] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,
A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu,
“WaveNet: A generative model for raw audio,” arXiv preprint
arXiv:1609.03499, 2016.
[2] J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. C.
Courville, and Y. Bengio, “Char2Wav: End-to-end speech syn-
thesis,” in Proc. ICLR, 2017.
[3] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals,
K. Kavukcuoglu, G. Driessche, E. Lockhart, L. Cobo, F. Stimberg
et al., “Parallel WaveNet: Fast high-fidelity speech synthesis,” in
Proc. ICML, 2018, pp. 3918–3926.
[4] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss,
N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le,
Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron:
Towards end-to-end speech synthesis,” in Proc. Interspeech, 2017,
pp. 4006–4010.
[5] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang,
Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Natural TTS
synthesis by conditioning WaveNet on mel spectrogram predic-
tions,” in Proc. ICASSP, 2018, pp. 4779–4783.
[6] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neural Speech Syn-
thesis with Transformer Network,” in Proc. AAAI, vol. 33, 2019,
pp. 6706–6713.
[7] Y. Ren, C. Hu, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fast-
Speech 2: Fast and high-quality end-to-end text-to-speech,” arXiv
preprint arXiv:2006.04558, 2020.
[8] C. Yu, H. Lu, N. Hu, M. Yu, C. Weng, K. Xu, P. Liu,
D. Tuo, K. Kang, G. Lei, D. Su, and D. Yu, “DurIAN: Dura-
tion informed attention network for multimodal synthesis,” arXiv
preprint arXiv:1909.01700, 2019.
[9] J. Shen, Y. Jia, M. Chrzanowski, Y. Zhang, I. Elias, H. Zen, and
Y. Wu, “Non-attentive Tacotron: Robust and controllable neural
TTS synthesis including unsupervised duration modeling,” arXiv
preprint arXiv:2010.04301, 2020.
[10] R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A flow-
based generative network for speech synthesis,” in Proc. ICASSP,
2019, pp. 3617–3621.
[11] W. Ping, K. Peng, and J. Chen, “ClariNet: Parallel wave genera-
tion in end-to-end text-to-speech,” in Proc. ICLR, 2018.
[12] S. Kim, S.-G. Lee, J. Song, J. Kim, and S. Yoon, “FloWaveNet:
A generative flow for raw audio,” in Proc. ICML, 2019, pp. 3370–
3378.
[13] H. Kim, H. Lee, W. H. Kang, S. J. Cheon, B. J. Choi, and N. S.
Kim, “WaveNODE: A continuous normalizing flow for speech
synthesis,” arXiv preprint arXiv:2006.04598, 2020.
[14] N.-Q. Wu and Z.-H. Ling, “WaveFFJORD: FFJORD-based
vocoder for statistical parametric speech synthesis,” in Proc.
ICASSP, 2020, pp. 7214–7218.
[15] C. Donahue, J. McAuley, and M. Puckette, “Adversarial audio
synthesis,” in Proc. ICLR, 2018.
[16] J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and
A. Roberts, “GANSynth: Adversarial neural audio synthesis,” in
Proc. ICLR, 2018.
[17] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Zhen Teoh,
J. Sotelo, A. de Brebisson, Y. Bengio, and A. Courville, “Mel-
GAN: Generative adversarial networks for conditional waveform
synthesis,” Proc. NeurIPS, 2019.
[18] G. Yang, S. Yang, K. Liu, P. Fang, W. Chen, and L. Xie, “Multi-
band MelGAN: Faster waveform generation for high-quality text-
to-speech,” arXiv preprint arXiv:2005.05106, 2020.
[19] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A
fast waveform generation model based on generative adversarial
networks with multi-resolution spectrogram,” in Proc. ICASSP,
2020, pp. 6199–6203.
[20] M. Bi ´
nkowski, J. Donahue, S. Dieleman, A. Clark, E. Elsen,
N. Casagrande, L. C. Cobo, and K. Simonyan, “High fidelity
speech synthesis with adversarial networks,” in Proc. ICLR, 2020.
[21] J. Yang, J. Lee, Y. Kim, H.-Y. Cho, and I. Kim, “VocGAN: A high-
fidelity real-time vocoder with a hierarchically-nested adversarial
network,” Proc. Interspeech, pp. 200–204, 2020.
[22] O. McCarthy and Z. Ahmed, “HooliGAN: Robust, high quality
neural vocoding,” arXiv preprint arXiv:2008.02493, 2020.
[23] N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan,
“WaveGrad: Estimating gradients for waveform generation,” in
Proc. ICLR, 2021.
[24] J. Donahue, S. Dieleman, M. Bi ´
nkowski, E. Elsen, and K. Si-
monyan, “End-to-end adversarial text-to-speech,” arXiv preprint
arXiv:2006.03575, 2020.
[25] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic
models,” arXiv preprint arXiv:2006.11239, 2020.
[26] A. Hyv ¨
arinen and P. Dayan, “Estimation of non-normalized sta-
tistical models by score matching,” JMLR, vol. 6, no. 4, 2005.
[27] P. Vincent, “A connection between score matching and denois-
ing autoencoders,” Neural Computation, vol. 23, no. 7, pp. 1661–
1674, 2011.
[28] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Gan-
guli, “Deep unsupervised learning using nonequilibrium thermo-
dynamics,” in International Conference on Machine Learning,
2015.
[29] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov, “Dropout: a simple way to prevent neural net-
works from overfitting,” JMLR, vol. 15, no. 1, pp. 1929–1958,
2014.
[30] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” in Proc.
ICML, 2015, pp. 448–456.
[31] D. Krueger, T. Maharaj, J. Kram´
ar, M. Pezeshki, N. Ballas, N. R.
Ke, A. Goyal, Y. Bengio, H. Larochelle, A. Courville et al., “Zo-
neout: Regularizing rnns by randomly preserving hidden activa-
tions,” arXiv preprint arXiv:1606.01305, 2016.
[32] V. Dumoulin, E. Perez, N. Schucher, F. Strub, H. d. Vries,
A. Courville, and Y. Bengio, “Feature-wise transformations,” Dis-
till, 2018, https://distill.pub/2018/feature-wise-transformations.
[33] R. J. Weiss, R. Skerry-Ryan, E. Battenberg, S. Mariooryad,
and D. P. Kingma, “Wave-Tacotron: Spectrogram-free end-to-
end text-to-speech synthesis,” arXiv preprint arXiv:2011.03568,
2020.
[34] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, “Pixel-
CNN++: Improving the PixelCNN with discretized logistic mix-
ture likelihood and other modifications,” in Proc. ICLR, 2017.
[35] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D.
Cubuk, and Q. V. Le, “SpecAugment: A simple data augmen-
tation method for automatic speech recognition,” in Proc. Inter-
speech, 2019, pp. 2613–2617.