Available via license: CC BY-SA 4.0
Content may be subject to copyright.
Few Shot Adaptive Normalization Driven Multi-Speaker Speech Synthesis
Neeraj Kumar
Hike Private Limited
neerajku@hike.in
Srishti Goel
Hike Private Limited
srishtig@hike.in
Ankur Narang
Hike Private Limited
ankur@hike.in
Brejesh Lall
IIT Delhi
brejesh@ee.iitd.ac.in
Abstract
The style of the speech varies from person to person and every
person exhibits his or her own style of speaking that is deter-
mined by the language, geography, culture and other factors.
Style is best captured by prosody of a signal. High quality
multi-speaker speech synthesis while considering prosody and
in a few shot manner is an area of active research with many
real-world applications. While multiple efforts have been made
in this direction, it remains an interesting and challenging prob-
lem.
In this paper, we present a novel few shot multi-speaker speech
synthesis approach (FSM-SS) that leverages adaptive normal-
ization architecture with a non-autoregressive multi-head at-
tention model. Given an input text and a reference speech
sample of an unseen person, FSM-SS can generate speech
in that person’s style in a few shot manner. Additionally, we
demonstrate how the affine parameters of normalization help
in capturing the prosodic features such as energy and funda-
mental frequency in a disentangled fashion and can be used
to generate morphed speech output. We demonstrate the ef-
ficacy of our proposed architecture on multi-speaker VCTK
and LibriTTS datasets, using multiple quantitative metrics
that measure generated speech distortion and MoS, along with
speaker embedding analysis of the generated speech vs the
actual speech samples.
1 Introduction
A lot of exciting developments have been made in speech syn-
thesis systems to synthesize natural sounding human speech.
The developments in this area have helped in a number of
applications including audiobook narration, news readers,
conversational assistants and engaging user experiences in
the virtual worlds.
To realise a natural speech synthesis system, the model
has to capture the speaking style of every person. For this
prosodic features of speech play an important role. Prosody is
a confluence of a number of phenomena such as paralinguistic
information, intonation, stress, and style. Such phenomena
are best described by the duration, fundamental frequency
and energy of any speech. Multiple efforts are being made
to incorporate and control such features into the model to
capture and synthesize the speech in a person’s speaking
style.
High quality multi-speaker speech synthesis (with prosody
consideration) in a few shot manner is an interesting and chal-
lenging research problem. Present approaches for state-of-the-
art TTS (Text to Speech Synthesis) such as Tacotron (Shen
et al., 2018), Fast Speech Ren et al. (2019), Fast speech 2 Ren
et al. (2020) have focussed on generating the speaking style
of a single speaker. These approaches do not generate audio
on multiple speakers. Some of the current approaches (Jia
et al., 2018; Chen et al., 2020; Arik et al., 2018; Ping et al.,
2017) have used the speaker embedding to capture the iden-
tity and speaking style of the person in the speech. Such
approaches fail to generate expressive speech as they have
not taken the prosodic features and emotions into account and
hence have lower quality in generated speech. While some
of these approaches consider zero-shot approach for multi-
speaker speech synthesis, none of them consider few shot
explicit prosody transfer. Other approaches (Skerry-Ryan
et al., 2018) rely on prosodic features such as fundamental
frequency, duration and energy to generate the expressive
speech. Such approaches are able to generate the expressive
speech for the speakers which are the already part of training.
Such approaches are not able to generate expressive speech
in few shot manner on multiple speakers.
We propose a novel approach,
FSM-SS
(Few Shot Multi-
speaker Speech Synthesis), that is capable of generating
speech in an unseen person’s speaking style in a few shot
manner. Our model uses non-autoregressive feed forward
transformer based architecture (Ren et al., 2020) along with
adaptive normalization to generate the speech on an unseen
person’s style. The model takes as inputs:
(a)
an unseen text,
and
(b)
a reference speech sample of an unseen speaker, and
generates high quality speech for the given text in the given
person’s speaking style.
Our main contributions are as follows:
•
We have proposed a novel few shot approach (FSM-
SS) that uses adaptive normalization along with non-
autoregressive feed forward transformer based architec-
ture. FSM-SS can generate multi-speaker speech output
in a few shot manner, given an input unseen text and an
unseen person’s reference speech sample.
•
For adaptive normalization, we have proposed two archi-
tectures based on convolution and on multi-head attention
to capture the prosodic properties in the network through
affine parameters. This helps to capture the various affine
parameters based on speaker embedding, pitch and energy.
•
We have proposed that the affine parameters of instance
normalization are able to capture the information of
speaker identity, pitch and energy. Conditioning on the
pitch, energy and speaker embedding generates person-
alized and temporally smoother speech which captures
the speaking style of a person much better than known
state-of-the-art approaches.
•
Using extensive experiments on multi-speaker VCTK and
arXiv:2012.07252v1 [eess.AS] 14 Dec 2020
LibriTTS datasets, we show both qualitative and quantita-
tive improvements over prior approaches along with high
quality of output and the capability of our approach to
generate speech for a wide variety of unseen speakers.
•
FSM-SS can also be used as a voice morphing tool by
varying the embedding, frequency and energy inputs to the
adaptive normalization module.
2 Related Work
Earlier work in prosody and modeling of the speaking style
has been studied since the era of HMM-based speech synthe-
sis. In (Eyben et al., 2012), expressive clusters are generated
using hierarchical k-means clustering and then HMM-based
speech synthesis is used to provide a flexible framework to
model the varying expressions. In (Nose et al., 2007), multi-
ple emotional expressions and speaking styles of speech are
modeled in a single model by using a multiple-regression hid-
den semi-Markov model and the authors proposed estimating
the transformation matrix for a set of predefined style vectors.
Our approach uses non-autoregressive deep neural networks
based method instead of HMM-based speech generation.
Various efforts such as ToBI (Silverman et al., 1992), Au-
ToBI (Rosenberg, 2010), INTSINT (Hirst, 2004), SLAM
(Obin et al., 2014) have described methods for the annotation
and automatic labeling of prosody. Such methods often re-
quire domain experts, however, and inter-rater annotations
can differ substantially. Our approach uses deep learning tech-
niques to transfer the prosodic features on generated speech
instead of manual labeling of prosody.
After the advent of deep learning techniques, a lot of work
has been done in text to speech generation on multiple speak-
ers. VoiceLoop (Taigman et al., 2017) proposed a novel
architecture based on a fixed size memory buffer that can gen-
erate speech from voices unseen during training. However,
obtaining good results required tens of minutes of enroll-
ment speech and transcripts for a new speaker. (Nachmani
et al., 2018) extended VoiceLoop to utilize a target speaker
encoding network to predict a speaker embedding. This net-
work is trained jointly with the synthesis network using a
contrastive triplet loss to ensure that embeddings predicted
from utterances by the same speaker are closer than embed-
dings computed from different speakers. In addition, a cycle-
consistency loss is used to ensure that the synthesized speech
encodes to a similar embedding as the adaptation utterance.
Our proposed approach (FSM-SS) uses a pretrained speaker
embedding model (Wan et al., 2017) to feed speaker embed-
ding via adaptive normalization into a non-autoregressive
architecture to generate speech for an unseen speaker.
(Amodei et al., 2015) introduced a multispeaker variation
of Tacotron which learned low-dimensional speaker embed-
dings for each training speaker and phoneme durations are
predicted first and then are used as inputs to the frequency
model. CNN-based multispeaker model (Ping et al., 2017)
develops many sophisticated mechanisms in the speaker em-
bedding and attention block to ensure the synthesized quality.
These systems learn a fixed set of speaker embeddings and
therefore only support synthesis of voices already seen dur-
ing training. (Amodei et al., 2015) and (Ping et al., 2017)
have used autoregressive methods to generate speaker embed-
ding, whereas our proposed approach (FSM-SS) uses adap-
tive normalization along with non-autoregressive multi-head
attention architecture (Ren et al., 2020) to generate speech,
leading to faster training and inference and better quality
as compared to these methods. Adaptive normalization in
FSM-SS helps in few shot multi-speaker speech synthesis.
(Arik et al., 2018) used multi-head attention for generat-
ing speaker embedding. To see the effectiveness, they have
used DeepVoice 3 (Ping et al., 2017) TTS architecture to
generate multi-speaker speech. For speaker adaptation, they
have shown the few shot approach to generate speech on
unseen speakers. They have used the speaker classification
method which used the convolution and GRU layer to calcu-
late the PLDA score which is then passed to the sigmoid layer.
FSM-SS leverages (Wan et al., 2017) based
256
speaker em-
bedding, rather than multi head attention based
128
speaker
embedding, to generate speech on unseen speaker. The ad-
dition of pitch and energy into our proposed approach in
normalisation helps in the transfer of prosodic features from
the reference speech sample to generated speech.
VAE-based method has been further leveraged (Hsu et al.,
2018) to handle noisy multi-speaker speech data and can
control latent attributes in the generated speech that are rarely
annotated in the training data, such as speaking style, accent,
background noise, and recording conditions. Our proposed
method uses disentangled pitch and energy of the reference
speech sample to synthesize speech whearas Hsu et al. (2018)
uses probabilistic hierarchical generative model to disen-
tangle style attributes. Jia et al. (2018) used RNN-based
Tacotron 2 that enjoys the benefits of recurrent attention com-
putation and leverages the attention information in previous
steps to help the attention calculation in the current step.
They have utilized a network that is independently-trained
for a speaker verification task on a large dataset of untran-
scribed audio from tens of thousands of speakers, using a
state-of-the-art generalized end-to-end loss. (Chen et al.,
2020) introduced a diagonal constraint on the weight matrix
of the encoder-decoder attention during training and infer-
ence and employed a bottleneck structure in the decoder
pre-net which encourages the decoder to generalize on the
representation of speech frame instead of memorization, and
forces the decoder to attend to text/phoneme inputs. Our
proposed method uses adaptive normalization architecture
along with non-autoregressive multi head attention network
to generate high quality speech on unseen speakers.
All the methods discussed above have either used CNN
and Transformer based TTS (Li et al., 2018; Ping et al., 2017)
that can speed up the training over RNN-based models (Shen
et al., 2018). All the models generate a melspectrogram con-
ditioned on the previously generated melspectrograms and
suffer from slow inference speed. These autoregressive mod-
els generate melspectrograms one by one automatically, with-
out explicitly leveraging the alignments between text and
speech. Fast Speech (Ren et al., 2019) speeds up the syn-
thesis on single speaker through parallel generation of mel-
spectogram. Fast Speech relies on the autoregressive teacher
model to predict the phoneme duration and generated mel-
spectogram for knowledge distillation. Fast Speech 2 (Ren
et al., 2020) uses the ground truth for phoneme duration pre-
diction and incorporates other features such as pitch and en-
ergy in variance predictor for single speaker speech synthesis.
Our proposed method (FSM-SS) leverages the feed-forward
transformer based non-autoregressive approach along with
variance adapter (Ren et al., 2020) but uses a novel adaptive
normalization architecture to capture the reference style of
an unseen speaker. This technique helps FSM-SS to deliver
high quality multi-speaker output personalization in a few
shot manner.
Many previous works have explicitly focussed on gener-
ating style based text to speech. (Skerry-Ryan et al., 2018)
incorporated architectures to generate prosody embedding
and speaker embedding which is combined with text encoder
representation which goes to the tacotron based decoder to
generate the speech. Conditioning Tacotron on this learned
embedding space results in synthesized audio that matches
the prosody of the reference signal with fine time detail
even when the reference and synthesis speakers are different.
This method uses Tacotron based autoregressive approach
which is different from our proposed method which employs
multi head attention based non-autoregressive method along
with adaptive normalisation to capture the prosodic features.
(Wang et al., 2018) proposed “global style tokens” (GSTs), a
bank of embeddings that are jointly trained within Tacotron,
and learn to model a large range of acoustic expressiveness.
The architecture consists of a reference encoder, style atten-
tion, style embedding, and sequence-to-sequence (Tacotron)
model. We have used adaptive normalisation to capture pro-
sidic features rather than attention netowrk to capture the
style. (Zhang et al., 2019) introduced the Variational Au-
toencoder (VAE) to an end-to-end speech synthesis model,
to learn the latent representation of speaking styles in an
unsupervised manner in tacotron 2 based framework. KL
annealing is introduced to solve this problem instead of KL
loss. We have used adaptive normalisation based architecture
to capture the style features rather than variation autoencoder.
(Sun et al., 2020a) has introduced the vector-quantized VAE
(VQ-VAE), and a two-stage training approach to generate
high fidelity speech samples.
Our proposed method (FSM-SS) used normalisation based
architecture along with multi head attention instead of VQ-
VAE based tacotron architecture. (Sun et al., 2020b) aims to
achieve disentangled control of each prosody attribute at dif-
ferent levels (utterance, word and phone levels) and proposes
a multilevel model based on Tacotron 2 integrated with a
hierarchical latent variable model. Our proposed method uses
adaptive normalisation instead of an hierachical approach to
capture the prosody.
3 FSM-SS Design
In this section, we present the overall design and architec-
ture of FSM-SS including adaptive normalization and non-
autoregressive multi-head attention based feed forward trans-
former for few shot multi-speaker speech synthesis.
3.1 Model Overview
The speech synthesis model uses non-autoregressive multi-
head attention feed forward transformer (Ren et al., 2020)
which is state of the art in speech synthesis for a single
speaker. It helps in parallel melspectrogram generation and
speeds up the speech synthesis compared to Fast Speech and
autoregressive models such as Transformer based TTS (Li
et al., 2018; Ping et al., 2017) and Tactotron based TTS (Shen
et al., 2018). It uses multi-head attention-based encoder-
decoder architecture along with the variance adapter method.
We have designed two architectures for adaptive normal-
ization: one based on multi-head attention network (Vaswani
et al., 2017) and another on a convolution network to learn
the affine parameters in normalization. The inputs to the nor-
malization module are: speaker embedding and pitch and
energy values per frame extracted from the given reference
speech sample, of an unseen person. The speaking style of the
unseen speaker is fine-tuned on trained proposed architecture
using a few audio-text pairs for few shot inference.
This architecture is used to generate the melspectogram
and the final audio is generated by using Griffin-Lim spectro-
gram inversion (Griffin and Jae Lim, 1984) and Wave Glow
architecture (Prenger et al., 2019).
3.2 Architecture
Fig. 1 illustrates the architecture used in FSM-SS. During
training, it takes as input: the text-audio pairs of a person
along with his(her) reference speech samples. During infer-
ence, it takes a few unseen text-audio pairs along with one
reference speech sample on an unseen speaker, to generate
speech in that person’s speaking style. The adaptive normal-
ization is applied both during encoder and decoder stages
(Fig. 1) and hence helps in prosody transfer in a few shot
manner.
Figure 1: FSM-SS : Few Shot style based text to Speech
Generation Architecture
Feed-Forward Transformer
The architecture of Feed-
Forward Transformer (Fig. 2) is based on a multi-head self-
attention network, and position feed-forward network which
consists of two Conv1D and normalization stages. The pro-
posed method stacks multiple FFT blocks with phoneme
embedding and position encoding as an input as the phoneme
side, and multiple FFT blocks for the melspectrogram gener-
ation, with variance adapter in between.
Figure 2: Left: FFT block , Centre : Variance Adaptor , Right
: Variance Predictor
Adaptive Normalization Stage
This stage consists of
adaptive normalization with learnable parameters such as
γ
and
β
which are computed through two proposed approaches:
based on convolution network and multi-head attention net-
work. This helps in adjusting the bias and scale of the normal-
ized features to learn the required properties of speech signal
including prosody. This module enables adaptive instance
normalization of the feature map coming as output from the
prior FFT block (Fig. 2).
Convolution based Normalization
We have taken three
audio-related features: speaker embedding, fundamental fre-
quency and energy of the reference speech sample which
are important for capturing the prosody of reference speech.
These three features are passed into the convolution layer
to generate the affine parameters (Fig. 3). The parameter
ρ
is used to combine these parameters (Equation
(1)
). The
value of
ρ
’s is constrained to the range of [0, 1] simply by
imposing bounds at the parameter update step. We employ
a residual connection around each of the two sub-layers, x
= z + Sublayer(z), followed by layer normalization, where
Sublayer(z) is the convolution function implemented by the
sub-layer itself. The other part of this equation has instance
normalization having
γSE
and
βSE
coming from speaker em-
bedding. The second equation (Equation
(2)
) generates the
affine parameters from the energy and pitch values for each
frame of the reference speech sample.
y=ρ(γLNxLN +βLN ) + (1 −ρ)(γSExIN +βSE )(1)
γenergy.(γpitch .y +βpitch) + βenergy (2)
Figure 3: Convolution based normalization in proposed FSM-
SS architecture
Multi Head Attention Based Normalization
In this archi-
tecture (Fig. 4), we have concatenated the speaker embedding
(
256
dimensional vector), frequency and energy of the refer-
ence speech sample to generate a tensor of size (audio-frames
* 258 * batches). This is then fed it into the multi-head atten-
tion network to generate the affine parameters. These affine
parameters (Equation
(3)
) are used to bias
βattention
and scale
γattention
the output feature map coming from previous FFT
block (Fig. 2).
ρ(γLNxLN +βLN ) + (1 −ρ)(γattentionxIN +βattention )(3)
The speaker embeddings, frequency and energy are con-
catenated and passed to the linear layer independently to
become query, key and values of multi-head attention layer.
The multi-head attention equation is given by:
Attention(Q, K, V ) = sof tmax(QKT
dk
)V(4)
Figure 4: Multi Head Attention Based Normalization in pro-
posed FSM-SS architecture
Variance Adapter
The variance predictor is used to pre-
dict the prosodic features of speech such as duration, funda-
mental frequency and energy. The variance adapter consists
of three predictors namely: the duration predictor, pitch pre-
dictor and energy predictor. During the training phase, all
three predictors are trained with the ground truth of duration,
pitch and energy through three separate variance predictors
independently and optimized with mean square error.
Variance Predictor
Variance predictor consists of a 2-
layer 1D-convolution network with ReLU activation, each
followed by the layer normalization and the dropout layer,
and an extra linear layer to project the hidden states into the
output sequence (Ren et al., 2020). For the duration predictor,
the output is the length of each phoneme in the logarithmic
scale. For pitch and energy predictor, the output is the frame-
level fundamental frequency and energy of melspectrogram
respectively.
3.3 Few shot approach for style adaptation
We have used few shot approach for speaker adaptation (at in-
ference time) using the reference speech sample of an unseen
person and text. At inference, we update the whole model on
a few samples of unseen speech and text pairs, while the ref-
erence speech sample remains the same since that provides
prosody information via adaptive normalization. Training
the whole model with all the losses gives more degrees of
freedom. Early stopping is used to avoid overfitting.
4 Experiments
4.1 Implementation Details
Datasets
We train and evaluate the model on two datasets
namely VCTK (Veaux et al., 2017) and LibriTTS multi-
speaker dataset (Panayotov et al., 2015). We have used 44
hours of speech with 108 speakers of the VCTK dataset and
586 hours of speech with 2456 speakers of the LibriTTS
dataset.
Preprocesing Steps
To alleviate the mispronunciation
problem, we convert the text sequence into the phoneme
sequence (Amodei et al., 2015; Shen et al., 2018) using
open-source grapheme-to phoneme tool (g2p). We extract
the phoneme duration with MFA (McAuliffe et al., 2017),
an open-source system for speech-text alignment to improve
the alignment accuracy.
We transfer the raw waveform into melspectrograms by
setting the frame size and hop size to 1024 and 256 with
respect to the sample rate of 22050 Hz. We extract funda-
mental frequency, F0 from the raw waveform with the same
hop size to obtain the pitch of each frame and compute the
L2-norm of the amplitude of each STFT frame as the energy.
We feed the values of pitch and energy values in the proposed
normalization method.
In the training process, we quantize the F0 and energy of
each frame to 256 possible values and encode them into a
sequence of one-hot vectors as p and e respectively. We feed
the pitch and energy embedding with p and e at the variance
adapter stage. The output of pitch and energy predictors are
values of F0 and energy which is minimized with mean square
error.
We have generated the speaker embedding from pretrained
model, Generalized end-to-end loss for speaker verification
(Wan et al., 2017) which is trained on : (1) LibriSpeech
Other (Panayotov et al., 2015), which contains 461 hours of
speech from a set of 1,166 speakers disjoint from those in the
clean subsets, (2) VoxCeleb (Nagrani et al., 2017), and (3)
VoxCeleb2 (Chung et al., 2018) which 139K utterances from
1,211 speakers, and 1.09M utterances from 5,994 speakers,
respectively.
Model Details
We have used
4
feed forward transformer
blocks at the phoneme encoding stage and at the output mel-
spectrogram decoder stage. The dimension of phoneme em-
bedding and hidden layer of self attention is set to 256 in
every FFT block. The number of attention heads is set to
2
.
The output linear layer converts the 256-dimensional hidden
states into 80-dimensional mel-spectrograms. The size of the
phoneme vocabulary is 76, including punctuations.
The Convolution based normalization architecture feeds
256
dimensional speaker embedding into 1D convolution
layer to generate affine parameters. The fundamental fre-
quency and energy of the reference speech are fed to 1D
convolution layers each to reduce the channel length from
max frames of speech signal in the dataset to
512
. The affine
parameters are then calculated by adding 1D convolution
layer to generate 256 channel output respectively.
In multi head attention based normalization architecture,
the
256
dimensional speaker embedding is replicated along
the time frame of the mel spectrogram and then concatenated
with frequency and energy features to generate
258
dimen-
sional feature vectors for all time steps (audio frames). It is
then fed to multi head attention with the number of heads set
to
6
. The generated feature map is then fed to 1D convolution
to generate
256
channel output which is added with output
of layer normalization using the learnable parameter ρ.
The Variance predictor consists of
2
blocks of Conv1D,
relu, layer normalization and dropout layer. The kernel sizes
of the 1D-convolution is set to
3
, with input/output sizes of
256/256 for both layers and the dropout rate is set to 0.5.
The pretrained Wave Glow architecture (Prenger et al.,
2019)is used as a vocoder to generate the speech at 22050 Hz.
It is trained on LibriSpeech dataset at the sampling frequency
of 22050 Hz.
Training and Inference
We have used the batch size of
96
and
64
for the convolution-based normalization method and
multi-head attention based normalization technique in the
proposed architecture respectively with the initial value of
ρ
is
0.7
. The Adam optimizer is used with
β1
=
0.9
,
β2
=
0.98
,
=10e-9. It takes around 120K steps for the convolution-
based normalization method on VCTK and LibriTTS dataset.
The multi-head attention normalization based model takes
470K steps and 800K steps to converge on VCTK (Veaux
et al., 2017) and LibriTTS datasets (Panayotov et al., 2015).
We have trained the model on 4 V100 GPU based machine.
Note that the length of the reference speech sample and the
speech generated from unrelated text input can be different
During inference time, we have used a few shot approach
with samples from
0
to
5
to generate the speech in the speak-
ing style of the reference person. We have used Wave Glow
vocoder to generate the final speech from melspectrogram.
4.2 Implementation Results
Speaker Embedding Space
The speech samples are gen-
erated on test speakers to visualize how well different samples
are spread on embedded space. We have generated the 256
dimensional embedding of every speech and done the t-SNE
visualization which shows that the synthesized utterances
on the same speaker tend to lie very close in the embedding
space, demonstrating the consistency of generation. The vi-
sualization is done on the speech synthesized in zero-shot
approach on an unseen speaker. Figure 5 shows that gener-
ated embedding on male and female speakers form distinct
clusters. In Figure 6 we have done the t-SNE visualization
where we have shown that we are able to generate the samples
from speakers that are far away from the clusters correctly,
which demonstrates the variety of multiple unseen speakers
that can be handled by our approach.
Figure 5: t-SNE visualization of speaker embeddings of gen-
erated samples of VCTK dataset. Cluster id 0 to 4 refers to
male speakers and 5 to 9 refers to female speakers
Speaker Similarity
We expect the utterances from the
same speaker to have high similarity values and those distinct
to have lower one. We have evaluated the cosine similarity as
the similarity metric on the speaker embedding of generated
samples with actual samples. We have extracted the speech in
zero shot approach on unseen speaker. Figure 7 shows that
the higher similarity of emebedding on actual and generated
speech for same speaker. Figure 8 shows that the median
Figure 6: t-SNE Visualization of speaker embeddings of male
actual and generated samples of VCTK dataset. The left
side shows the actual embedding space of all male speak-
ers in the dataset. Right side shows the embedding space
of train(green), val(red: cluster id - 7,26,42) and test(black:
cluster id - 0,6,9,12,13)
Figure 7: Cross Similarity between utterances of actual
speaker(x-axis) and generated speaker(y-axis) of VCTK
dataset
values of cosine similarities are higher for same speaker and
lower for different speaker.
Speaker Classification on few shot approach
We have
used the few shot approach for speaker adaptation by provid-
ing different audio and text pair of unseen speaker. We have
used gaussian naive bayes multilabel classifier (Vikramku-
mar et al., 2014) whose accuracy is around
92%
. Figure 9
shows that with increase in the number of samples in few
shot approach the probability of speaker identification has in-
creased from from
0.39
to
0.59
. (Arik et al., 2018) has shown
improvement to the probability of
0.55
when 5 samples are
used for few shot approach.
Audio Quality Twenty samples of speakers with different
accents are taken for VCTK test and Twenty samples of
english speaking speakers from LibriTTS are used to perform
mean opinion score. The text content is kept consistent among
different systems so that all testers only examine the audio
quality without other interference factors. Table 1 shows
better MOS score than NVS (Arik et al., 2018) as they
have used 128 dimensional speaker embedding based on
multi head attention network with transformer based TTS
architecture to generate sample, whereas, FSM-SS uses (Wan
Figure 8: Normalized histogram of similarity values between
utterances actual speaker and generated speaker pf VCTK
dataset
Figure 9: probability of belonging to a speaker class on few
show approach with different number of unseen speaker sam-
ples and text pairs.
et al., 2017) based
256
speaker embedding pretrained on
3 datasets along with pitch and energy values for speech
synthesis.
Method VCTK↑LibriTTS↑
GT 4.05 ±0.05 4.10±0.24
GTmel+waveglow 3.84 ±0.14 3.92 ±0.46
Conv+waveglow 3.75 ±0.56 3.45 ±0.68
Attention+waveglow 3.72 ±0.24 3.38 ±0.08
NVS 3.13 ±0.42 -
Table 1: MOS score on FSM-SS with 95
%
confidence in-
terval for VCTK and LibriTTS dataset. GT - ground Truth
, GTmel+waveglow - groung truth mel spectrogram with
waveglow as vocoder, Conv+waveglow - Convolution based
normalsation in FSM-SS with waveglow vocoder , Atten-
tion+waveglow - Multi head attention based normalisation
in FSM-SS with waveglow architecture, NVS : Neural Voice
cloning with few samples method
Apart from subjective evaluation, we have used the metrics
namely Gross Pitch Error (Nakashika et al., 2016), Voicing
Decision Error (Nakashika et al., 2016), F0 Frame Error
(Wei Chu and Alwan, 2009), Mel Cepstral Distortion (Ku-
bichek, 1993) which are used in audio signal processing
to measure the prosody of the signal. The qualitative and
quantitative metrics are extracted on the speech generated in
the zero-shot approach on unseen speakers. The generated
outputs from FSM-SS architecture are given in 1
Table 2 shows that the convolution-based normalization
has lower errors compared to multi-head attention based nor-
malization. Table 3 shows that the few-shot approach is able
to lower the errors than the zero-shot approach. (Skerry-Ryan
et al., 2018) has higher MCD(10.87) due to the use of tacotron
based encoders to capture the pitch and speaker embedding
whereas FSM-SS uses pitch, energy and speaker embedding
of reference speech though proposed normalization methods.
(Sun et al., 2020b) have shown lower MCD(8.8) on LibriTTS
dataset on seen speakers due to multi-resolution architecture
of prosody while FSM-SS has an MCD value of 9.78 on
unseen speakers.
Method MCD↓GPE↓VDE↓FFE↓
Conv-1 13.65 28.70 18.04 35.46
Attention-1 14.63 30.45 19.60 37.64
Conv-2 14.15 30.50 19.98 38.26
Attention-2 15.17 32.51 21.06 41.64
Table 2: Quantitative metrics on FSM-SS on zero shot ap-
proach. Conv: Convolution based normalization in FSM-SS
, Attention : Multi head attention based normalization in
FSM-SS, 1- VCTK dataset, 2- LibriTTS dataset
Method MCD↓GPE↓VDE↓FFE↓
Conv-1 09.78 24.45 14.47 28.90
Attention-1 10.56 26.45 16.12 30.36
Conv-2 11.05 26.76 16.67 30.58
Attention-2 12.71 27.52 17.23 31.62
Table 3: Quantitative metrics on FSM-SS on few shot ap-
proach (5 samples). Conv: Convolution based normaliza-
tion, Attention : Multi head attention based normalization, 1-
VCTK dataset, 2- LibriTTS dataset
4.3 Ablation Study
We have done the ablation study on Convolution based nor-
malization architecture in FSM-SS with a zero-shot approach
on the VCTK dataset. The base model with pitch and en-
ergy in the normalization stage without speaker embedding
of reference speech has not shown very good results as the
information of speaker identity is missing in the architecture.
We have then used speaker embedding in the normalization
steps with the base model and do not incorporate pitch and
energy values of the reference unseen speaker. The quality
of output degrades as the variance predictor is not able to
predict the required duration, frequency, and energy values.
The addition of pitch values along with speaker embedding
of reference speech helps in improving the speech quality.
Table 4 shows the decreasing values of different errors when
adding pitch and energy in the normalization architecture.
4.4 Extension of Proposed Method
Voice morphing
We can independently tune the speaker
embedding, fundamental frequency and energy of the refer-
ence speech which are fed into the normalization steps to
1Generated audios : https://sites.google.com/view/fsmss/home
Method MCD↓GPE↓VDE↓FFE↓MoS↑
BM+P+E 21.68 40.45 45.76 65.87 2.64±0.14
BM+SE 20.65 38.71 39.56 56.64 2.92±0.26
BM+SE+P 16.38 33.65 29.76 45.76 3.32±0.08
FSM-SS 13.65 28.70 18.04 35.46 3.72±0.24
Table 4: Ablation Study of FSM-SS. BM is Base Model
without normalization method. SE is speaker embedding in
normalization, P and E are pitch and energy values in the
normalization network.
generate the morphed speech. Figure 10 shows that indepen-
dently modulating the pitch and energy values leads to the
voice morphing. This has a lot of applications in the virtual
world, the gaming industry, voice modulation, etc.
Figure 10: Left: Synthesized samples on a reference speaker
and text, Centre: Pitch is modulated by increasing the F0 to
1.25F0 keeping the energy values constant on same reference
speaker and text Right: Energy values are reduced from E
to 0.5E keeping the pitch values constant on same reference
speaker and text
5 Conclusions
In this paper, we have proposed a novel few shot approach
(FSM-SS) that uses adaptive normalization along with non-
autoregressive feed forward transformer based architecture.
FSM-SS can generate multi-speaker speech output in a few
shot manner, given an input unseen text and an unseen per-
son’s reference speech sample. For adaptive normalization,
we have proposed two architectures based on convolution
and on multi-head attention to capture the prosodic prop-
erties in the network through affine parameters. This helps
to capture the various affine parameters based on speaker
embedding, pitch and energy. Using extensive experiments
on multi-speaker VCTK and LibriTTS datasets, we show
both qualitative and quantitative improvements over prior ap-
proaches along with high quality of output and the capability
of our approach to generate speech for a wide variety of un-
seen speakers. FSM-SS can also be used as a voice morphing
tool by varying the embedding, frequency and energy inputs
to the adaptive normalization module.
References
D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Bat-
tenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng,
G. Chen, J. Chen, J. Chen, Z. Chen, M. Chrzanowski,
A. Coates, G. Diamos, K. Ding, N. Du, E. Elsen, and
Z. Zhu. Deep speech 2: End-to-end speech recognition in
english and mandarin. 12 2015.
S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou. Neural
voice cloning with a few samples. 02 2018.
M. Chen, X. Tan, Y. Ren, J. Xu, H. Sun, S. Zhao, and
T. Qin. Multispeech: Multi-speaker text to speech with
transformer, 06 2020.
J. S. Chung, A. Nagrani, and A. Zisserman. Voxceleb2: Deep
speaker recognition. pages 1086–1090, 09 2018. doi:
10.21437/Interspeech.2018-1929.
F. Eyben, S. Buchholz, N. Braunschweiler, J. Latorre, V. Wan,
M. Gales, and K. Knill. Unsupervised clustering of emo-
tion and voice styles for expressive tts. 03 2012. doi:
10.1109/ICASSP.2012.6288797.
D. Griffin and Jae Lim. Signal estimation from modified
short-time fourier transform. IEEE Transactions on Acous-
tics, Speech, and Signal Processing, 32(2):236–243, 1984.
D. Hirst. Lexical and non-lexical tone and prosodic typology.
03 2004.
W.-N. Hsu, Y. Zhang, R. Weiss, H. Zen, Y. Wu, Y. Wang,
Y. Cao, Y. Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang.
Hierarchical generative modeling for controllable speech
synthesis, 10 2018.
Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen,
P. Nguyen, R. Pang, I. Moreno, and Y. Wu. Transfer
learning from speaker verification to multispeaker text-
to-speech synthesis, 06 2018.
R. Kubichek. Mel-cepstral distance measure for objective
speech quality assessment. In Proceedings of IEEE Pa-
cific Rim Conference on Communications Computers and
Signal Processing, volume 1, pages 125–128 vol.1, 1993.
N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, and M. Zhou. Close
to human quality tts with transformer. 09 2018.
M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Son-
deregger. Montreal forced aligner: Trainable text-speech
alignment using kaldi. In INTERSPEECH, 2017.
E. Nachmani, A. Polyak, Y. Taigman, and L. Wolf. Fitting
new speakers based on a short untranscribed sample. 02
2018.
A. Nagrani, J. S. Chung, and A. Zisserman. Voxceleb:
A large-scale speaker identification dataset. ArXiv,
abs/1706.08612, 2017.
T. Nakashika, T. Takiguchi, and Y. Minami. Non-parallel
training in voice conversion using an adaptive restricted
boltzmann machine. IEEE/ACM Transactions on Au-
dio, Speech, and Language Processing, 24(11):2032–2045,
2016.
T. Nose, J. Yamagishi, T. Masuko, and T. Kobayashi. A
style control technique for hmm-based expressive speech
synthesis. IEICE Transactions, 90-D:1406–1413, 09 2007.
doi: 10.1093/ietisy/e90-d.9.1406.
N. Obin, J. Beliao, C. Veaux, and A. Lacheret. Slam: Auto-
matic stylization and labelling of speech melody. Speech
Prosody, 05 2014.
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Lib-
rispeech: An asr corpus based on public domain audio
books. In 2015 IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP), pages 5206–
5210, 2015.
W. Ping, K. Peng, A. Gibiansky, S. Arik, A. Kannan,
S. Narang, J. Raiman, and J. Miller. Deep voice 3: 2000-
speaker neural text-to-speech. 10 2017.
R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-
based generative network for speech synthesis. In ICASSP
2019 - 2019 IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP), pages 3617–
3621, 2019.
Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-
Y. Liu. Fastspeech: Fast, robust and controllable text to
speech. 05 2019.
Y. Ren, C. Hu, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu.
Fastspeech 2: Fast and high-quality end-to-end text-to-
speech. 06 2020.
A. Rosenberg. Autobi - a tool for automatic tobi annotation.
pages 146–149, 01 2010.
J. Shen, R. Pang, R. Weiss, M. Schuster, N. Jaitly, Z. Yang,
Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, R. Saurous,
Y. Agiomvrgiannakis, and Y. Wu. Natural tts synthesis
by conditioning wavenet on mel spectrogram predictions.
pages 4779–4783, 04 2018. doi: 10.1109/ICASSP.2018.
8461368.
K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf,
C. Wightman, P. Price, J. Pierrehumbert, and J. Hirschberg.
Tobi: A standard for labeling english prosody. 01 1992.
R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton,
J. Shor, R. Weiss, R. Clark, and R. Saurous. Towards end-
to-end prosody transfer for expressive speech synthesis
with tacotron. 03 2018.
G. Sun, Y. Zhang, R. Weiss, Y. Cao, H. Zen, A. Rosenberg,
B. Ramabhadran, and Y. Wu. Generating diverse and natu-
ral text-to-speech samples using a quantized fine-grained
vae and auto-regressive prosody prior, 02 2020a.
G. Sun, Y. Zhang, R. Weiss, Y. Cao, H. Zen, and Y. Wu.
Fully-hierarchical fine-grained prosody modeling for inter-
pretable speech synthesis, 02 2020b.
Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani. Voice
synthesis for in-the-wild speakers via a phonological loop.
07 2017.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
A. Gomez, L. Kaiser, and I. Polosukhin. Attention is all
you need. 06 2017.
C. Veaux, J. Yamagishi, and K. Macdonald. Cstr vctk corpus:
English multi-speaker corpus for cstr voice cloning toolkit.
2017.
Vikramkumar, V. B, and T. Tripathy. Bayes and naive bayes
classifier. 04 2014.
L. Wan, Q. Wang, A. Papir, and I. Moreno. Generalized
end-to-end loss for speaker verification. 10 2017.
Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Batten-
berg, J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. Saurous. Style
tokens: Unsupervised style modeling, control and transfer
in end-to-end speech synthesis. 03 2018.
Wei Chu and A. Alwan. Reducing f0 frame error of f0
tracking algorithms under noisy conditions with an un-
voiced/voiced classification frontend. In 2009 IEEE In-
ternational Conference on Acoustics, Speech and Signal
Processing, pages 3969–3972, 2009.
Y.-J. Zhang, S. Pan, L. He, and Z.-H. Ling. Learning latent
representations for style control and transfer in end-to-
end speech synthesis. pages 6945–6949, 05 2019. doi:
10.1109/ICASSP.2019.8683623.
Supplementary material of Few Shot Adaptive Normalization Driven
Multi-Speaker Speech Synthesis
Neeraj Kumar,
Hike Private Limited
neerajku@hike.in,
Srishti Goel,
Hike Private Limited
srishtig@hike.in,
Ankur Narang,
Hike Private Limited
ankur@hike.in,
Brejesh Lall ,
IIT Delhi
brejesh@ee.iitd.ac.in
1 Evaluation Metric
We have used the metrics namely Gross Pitch Error
(Nakashika et al., 2016), Voicing Decision Error (Nakashika
et al., 2016), F0 Frame Error (Wei Chu and Alwan, 2009),
Mel Cepstral Distortion (Kubichek, 1993) which are used
in audio signal processing to measure the prosody of the
signal.All pitch and voicing metrics are computed using the
output of the YIN (Cheveigné and Kawahara, 2002) pitch
tracking algorithm. For all comparisons of predicted signals
to target signals, we extend the shorter signal to the length of
the longer signal using a domain-appropriate padding (0 for a
time domain waveform and for a log magnitude spectrogram
with a 1∗10-6 stabilizing offset)
•
Gross Pitch Error (Nakashika et al., 2016) - the percentage
of frames for which the absolute pitch error is higher than
a certain threshold. For speech, this threshold is usually
20% where p
t
,
p0
t
are the pitch signals from the reference
and predicted audio, v
t
,
v0
t
are the voicing decisions from
the reference and predicted audio, and is the indicator
function.
GP E =Pt[|pt−p0
t| − 0.2pt] [vt] [v0
t]
Pt[vt][v0t]
•
Voicing Decision Error (Nakashika et al., 2016) - The per-
centage of frames for which an incorrect voiced/unvoiced
decision is made where v
t
,
v0
t
are the voicing decisions
from the reference and predicted audio, and is the indi-
cator function and T is the total number of frames.
V DE =Pt=T−1
t=0 [vt6=v0
t]
T
•
F0 Frame Error (Wei Chu and Alwan, 2009) - percentage
of frames where either a GPE or VDE is observed.
FFE =Pt[|pt−p0
t| − 0.2pt] [vt] [v0
t] + [vt6=v0
t]
T
•
Mel Cepstral Distortion (Kubichek, 1993) - It is the mea-
sure of how different two sequences of mel cepstra are. It is
used in assessing the quality of parametric speech synthesis
systems Where c
t
,
c0
t
are the k-th mel frequency cepstral
coefficient (MFCC) of the t-th frame from the reference
and predicted audio. We sum the squared differences over
the first K MFCCs
MCDK=1
T
T−1
X
t=0
v
u
u
t
K
X
k=1
(c0t,k −ct,k)(1)
The generated outputs from FSM-SS architecture are given
in 1
2 Model Hyperparameters
Hyperparameter FSM-SS
Phoneme Embedding Dimension 256
encoder layer 4
encoder attention head 2
encoder hidden 256
decoder layer 4
decoder attention head 2
decoder hidden 2
Encoder/Decoder conv1d filter size 1024
Encoder/Decoder Conv1D Kernel size 9
Encoder/Decoder Dropout 0.2
variance predictor filter size 256
variance predictor kernel size 3
variance predictor dropout 0.5
Table 1: Hyperparameters of FSM-SS
Hyperparameter Conv
Pitch embedding 1
frequency embedding 1
speaker embedding(SE) 256
conv1d kernel size 9
Pitch/frequency intial filter conv1d filter size 512
Pitch/frequency gamma and beta conv1d filter size 256
Speaker embedding intial filter conv1d filter size 512
Speaker embedding gamma and beta conv1d filter size 256
Table 2: Hyperparameters of Convolution based normaliza-
tion architecture of FSM-SS
1Generated audios : https://sites.google.com/view/fsmss/home
arXiv:2012.07252v1 [eess.AS] 14 Dec 2020
Hyperparameter Attention
Pitch embedding 1
frequency embedding 1
speaker embedding(SE) 256
Concatenated pitch/frequency/SE embedding 258
conv1d kernel size 9
attention head 6
gamma and beta conv1d filter size 256
Table 3: Hyperparameters Multi head attention based Nor-
malization architecture of FSM-SS
3 Experimental Results
We have computed the evaluation metrics for few shot ap-
proach on VCTK dataset. Table 4 shows that the metrics have
decreased with the increasing number of shots on Convolu-
tion based normalization in FSM-SS method..
Shots MCD↓GPE↓VDE↓FFE↓
1-shot 13.42 27.56 17.52 34.44
2-shot 11.78 27.08 17.18 33.38
3-shot 10.91 26.04 16.13 32.41
4-shot 10.12 25.34 15.71 30.56
5-shot 09.78 24.45 14.47 28.90
Table 4: Metric for few shot approach for convolution based
normalization in FSM-SS for VCTK dataset
Table 5 shows that the few shot approach for attention
based normalization in FSM-SS has improved the evaluation
metrics. The errors have decreased with the increase in the
number of shots.
Shots MCD↓GPE↓VDE↓FFE↓
1-shot 14.12 29.44 19.39 37.11
2-shot 13.39 28.66 18.06 35.38
3-shot 12.09 26.89 18.22 33.74
4-shot 11.77 26.03 17.21 32.91
5-shot 11.18 25.91 16.12 31.23
Table 5: Metric for few shot approach for attention based
normalization in FSM-SS for VCTK dataset
We have computed the evaluation metrics to see the effect
of ratio of reference speech length with input speech length
in zero shot approach on convolution based normalization
in FSM-SS method. Table 6 shows that the The range of
ratio (0.8 to 1.3) have lowest metrics as compared to other
ratio range. The errors have increased for higher and lower
ratios range. All the range have lower metrics and the speech
generated are of good quality.
4 Ablation Study
We have done the ablation study for few show approach for
VCTK dataset on convolution architecture of FSM-SS . The
errors are decreasing with the increase in the number of shots.
a
Ratio-range MCD↓GPE↓VDE↓FFE↓
0.3-0.8 14.54 29.45 19.82 37.17
0.8-1.3 11.53 26.73 16.14 33.91
1.3-1.7 13.97 29.31 18.18 36.71
1.7-2.3 15.31 30.34 20.59 39.47
2.3-5 16.51 30.96 17.16 38.79
5-8 16.71 30.51 17.14 38.09
Table 6: Metric for zero shot approach for convolution based
normalization in FSM-SS for VCTK dataset .Ratio-range
denotes the range of ratio of reference speech length with
input speech length
Method MCD↓GPE↓VDE↓FFE↓
BM+P+E 20.91 40.04 44.56 65.46
BM+SE 19.78 37.98 39.90 56.05
BM+SE+P 16.20 33.68 29.94 45.72
FSM-SS 13.42 27.56 17.52 34.44
Table 7: Ablation Study of FSM-SS on 1 shot approach
Method MCD↓GPE↓VDE↓FFE↓
BM+P+E 19.51 39.84 43.52 62.34
BM+SE 19.20 36.34 39.42 54.42
BM+SE+P 14.38 31.22 20.86 40.24
FSM-SS 11.78 27.08 17.18 33.38
Table 8: Ablation Study of FSM-SS on 2 shot approach
Method MCD↓GPE↓VDE↓FFE↓
BM+P+E 18.78 39.12 42.88 61.31
BM+SE 18.04 35.45 37.89 51.45
BM+SE+P 13.77 29.43 18.91 37.65
FSM-SS 10.91 26.04 16.13 32.92
Table 9: Ablation Study of FSM-SS on 3 shot approach
Method MCD↓GPE↓VDE↓FFE↓
BM+P+E 17.11 37.37 40.85 56.43
BM+SE 16.49 33.86 36.43 46.32
BM+SE+P 13.01 28.65 18.76 35.76
FSM-SS 10.12 25.34 15.71 30.56
Table 10: Ablation Study of FSM-SS on 4 shot approach
Method MCD↓GPE↓VDE↓FFE↓
BM+P+E 16.43 37.67 39.42 53.87
BM+SE 15.85 32.69 34.61 44.59
BM+SE+P 12.38 26.79 16.02 31.45
FSM-SS 09.78 24.45 14.47 28.90
Table 11: Ablation Study of FSM-SS on 5 shot approach
5 Evaluation text speaker
5.1 Test speaker for VCTK dataset
Table 12 shows the test speakers of VCTK dataset (Veaux
et al., 2017).
Speaker ID Gender Nationality
229 Female english
238 Female NorthernIrish
266 Female Irish
228 Female English
231 Femle English
288 Female Irish
243 Male English
245 Male irish
251 Male Indian
275 Male Scottish
273 Male English
Table 12: Test Speakers from VCTK dataset
5.2 Test speaker for LibriTTS dataset
Table 13 shows the test speakers of LibriTTS dataset (Panay-
otov et al., 2015).
Speaker ID Gender
6829 Female
9026 Female
8975 Female
6696 Female
192 Femle
557 Male
1355 Male
176 Male
3144 Male
4345 Male
1065 Male
Table 13: Test Speakers from LibriTTS dataset
6 MOS Interface
Figure 1 shows the MOS interface to rate the quality of
speech generated from proposed FSM-SS.
References
A. Cheveigné and H. Kawahara. Yin, a fundamental fre-
quency estimator for speech and music. The Journal of the
Acoustical Society of America, 111:1917–30, 05 2002. doi:
10.1121/1.1458024.
R. Kubichek. Mel-cepstral distance measure for objective
speech quality assessment. In Proceedings of IEEE Pa-
cific Rim Conference on Communications Computers and
Signal Processing, volume 1, pages 125–128 vol.1, 1993.
T. Nakashika, T. Takiguchi, and Y. Minami. Non-parallel
training in voice conversion using an adaptive restricted
boltzmann machine. IEEE/ACM Transactions on Au-
dio, Speech, and Language Processing, 24(11):2032–2045,
2016.
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Lib-
rispeech: An asr corpus based on public domain audio
books. In 2015 IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP), pages 5206–
5210, 2015.
Figure 1: MOS interface for calculating the MOS score of
the proposed method.
C. Veaux, J. Yamagishi, and K. Macdonald. Cstr vctk corpus:
English multi-speaker corpus for cstr voice cloning toolkit.
2017.
Wei Chu and A. Alwan. Reducing f0 frame error of f0
tracking algorithms under noisy conditions with an un-
voiced/voiced classification frontend. In 2009 IEEE In-
ternational Conference on Acoustics, Speech and Signal
Processing, pages 3969–3972, 2009.