PreprintPDF Available

Few Shot Speaker Recognition using Deep Neural Networks

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

The recent advances in deep learning are mostly driven by availability of large amount of training data. However, availability of such data is not always possible for specific tasks such as speaker recognition where collection of large amount of data is not possible in practical scenarios. Therefore, in this paper, we propose to identify speakers by learning from only a few training examples. To achieve this, we use a deep neural network with prototypical loss where the input to the network is a spectrogram. For output, we project the class feature vectors into a common embedding space, followed by classification. Further, we show the effectiveness of capsule net in a few shot learning setting. To this end, we utilize an auto-encoder to learn generalized feature embeddings from class-specific embeddings obtained from capsule network. We provide exhaustive experiments on publicly available datasets and competitive baselines, demonstrating the superiority and generalization ability of the proposed few shot learning pipelines.
Content may be subject to copyright.
Few Shot Speaker Recognition using Deep Neural Networks
Prashant Anand1, Ajeet Kumar Singh1, Siddharth Srivastava1,2, Brejesh Lall1
1Indian Institute of Technology Delhi, India
2Centre for Development of Advanced Computing, Noida,,,
The recent advances in deep learning are mostly driven by avail-
ability of large amount of training data. However, availability
of such data is not always possible for specific tasks such as
speaker recognition where collection of large amount of data is
not possible in practical scenarios. Therefore, in this paper, we
propose to identify speakers by learning from only a few train-
ing examples. To achieve this, we use a deep neural network
with prototypical loss where the input to the network is a spec-
trogram. For output, we project the class feature vectors into a
common embedding space, followed by classification. Further,
we show the effectiveness of capsule net in a few shot learning
setting. To this end, we utilize an auto-encoder to learn gener-
alized feature embeddings from class-specific embeddings ob-
tained from capsule network. We provide exhaustive experi-
ments on publicly available datasets and competitive baselines,
demonstrating the superiority and generalization ability of the
proposed few shot learning pipelines.
1. Introduction
Speaker recognition with only a few and short utterances is a
challenging problem. In this paper, we assume that for each
speaker, only a few very short utterances are available. Specif-
ically, we learn from 1to 5utterances of 3seconds each for
each speaker unlike earlier work where recordings upto 10 sec-
onds are considered as a short utterance. Apart from computa-
tional and technical improvisations obtained by solving under
such constraints, this setting also allows enrolment of speakers
to be easier and more practical.
Recently many approaches have been proposed for speaker
identification using deep neural networks [1,2,3] but they as-
sume availability of large amount of training data. Moreover,
the deep learning pipelines attempting to learn from short utter-
ances [4] are, in general, based on i-vectors or Mel-frequency
cepstral coefficients (MFCCs). While MFCCs are known for
susceptibility to noise [5], the performance of i-vectors tend to
suffer in case of short utterances [6]. Moreover, it has been
shown that convolutional neural networks (CNNs) are able to
mitigate the noise susceptibility of i-vectors, MFCCs [7] and
have been successfully used for speaker recognition [8]. Since
convolutional neural networks are data-hungry, and are able to
exploit structured information such as images very effectively,
recently large scale speaker recognition datasets have been
made publicly available where benchmarks setup on CNNs
with spectrogram as input have been shown to perform very
well [9,10]. However, their effectiveness to learn or generalize
with limited amount of data (few-shots) and short utterances is
not established very well.
Few shot learning paradigms have recently been effectively
applied for audio processing [11,12]. However, their effec-
tiveness to speech processing especially speaker recognition is
still unknown. To this end, we utilize CNN as base network
with spectrogram derived directly from raw audio files as input
and evaluate the effectiveness of these networks in case of con-
strained setting of few shot learning for speaker identification.
We choose VGGNet [13] and ResNet [14] as the base architec-
tures and evaluate them under various settings. For generalizing
them under unseen speakers, we use prototypical loss [15].
While CNNs have shown to perform very well, but they
are not able to exploit the spatial relationships within an image.
Bae et al [16] argued that CNNs are not able to leverage the
spatial information within spectrograms such as between pitch
and formant. Therefore, they utilized capsule networks [17] for
speech command recognition. Based on their work, we argue
that exploiting spatial relationships in spectrogram images can
lead to better speaker recognition as well. However, their are
two problems in applying capsule network for speaker recog-
nition. First is that their applicability to complex data is yet to
be established [18] and hence the generalization ability. Sec-
ond, they are extremely computationally intensive. We reduce
the computational complexity by dropping the reconstruction
loss from the default capsule network. Next, we add an auto-
encoder to map the class feature vectors from capsule network
to a common embedding space. The projected feature vector
from the embedding space are then subjected to prototypical
loss to learn from the constrained data. The entire pipeline is
trained end-to-end.
In view of the above, following are the contributions of this
To the best of our knowledge, this is the first work that poses
speaker recognition as a few-shot learning problem and ap-
plies convolutional neural networks and capsule network for
speaker recognition under the constraints of short and limited
number of utterances.
• We show that using convolutional neural network having
spectrogram as input and prototypical loss, a speaker can be
identified with high confidence with only a few training sam-
ples and very short utterances (3seconds).
We propose a novel network based on Capsule Network by
significantly reducing the number of parameters and learn-
ing a class embedding using auto-encoder with a prototypical
loss for generalizing capsule network to unseen speakers. We
show that the proposed method performs better than VGG
with equivalent number parameters while lags behind ResNet
having significantly higher number of parameters.
• We perform exhaustive experiments on publicly available
datasets and analyze the performance of the considered net-
works under various settings.
The rest of the paper is organized as follows. Section 2, dis-
cusses the various steps and the proposed pipeline. In Section 3,
arXiv:1904.08775v1 [eess.AS] 17 Apr 2019
Spectrogram Prototypical LossInput Audio Deep Convolutional Network
Pooling layer
Convolutional layer
Figure 1: Flow diagram for few shot learning with deep neural networks.
results are discussed while in Section 4, conclusion is provided.
2. Methodology
Figure 1shows the pipeline for using spectrogram with deep
neural networks. The input of the network is a spectrogram
obtained from raw audio file. For few shot learning, the net-
work may be pre-trained with a large dataset. Now the task
is to classify new speakers for whom, we have limited number
of samples. Here we additionally, pose a constraint that along
with limited number of samples, the duration of each sample is
limited to 3seconds only. For classification, the existing net-
work is fine-tuned with the training samples of the new speak-
ers. However, as demonstrated in experiments, directly using
the embeddings obtained from a pre-trained network causes sig-
nificant drop in the performance. Therefore, we propose to use
prototypical loss to optimize the embeddings by forming repre-
sentative prototypes (Section 2.3) .
We have evaluated against two types of networks viz. Con-
volutional Neural Networks (VGG, ResNet) and Capsule Net-
works. While extending CNNs for prototypical loss is straight-
forward as the provide class agnostic feature embeddings and
hence the prototypes can be learned directly from the feature
vectors. However, it is not the case with Capsule Networks as
they learn class specific embeddings. Therefore, we learn a gen-
eralized embedding using an auto-encoder prior to using proto-
typical loss (Section 2.4). We now describe each component in
2.1. Spectrogram Construction
First we convert all audio to single-channel, 16-bit streams at
a16 kHz sampling rate for consistency. The spectrograms are
then generated by sliding window protocol by using a hamming
window. The width of the hamming window is 25 ms with a
step size of 10 ms. This gives spectrograms of size 128 (number
of fft features) x 300 for 3seconds of randomly sampled speech
for each audio. Subsequently, each frequency bin is normalized
(mean, variance). the spectrograms are constructed from the
raw audio input i.e. no pre-processing such as silence removal
etc. is performed.
2.2. Model
2.2.1. Capsule Network
Capsule network proposed by Hinton [17] replaces the scalar
output neuron present in Deep Neural Networks (DNNs) and
Convolutional Neural Networks (CNNs) with a capsule (a vec-
tor output neuron). This enables capsule network to capture
pose information and contain it in a vector. The network is
trained by ”dynamic routing between capsules” algorithm. Cap-
sule Network has a convolutional layer with stride 1, on a pri-
mary capsule layer and a dense capsule layer. The convolution
operation in primary capsule layer has stride 2. Our modified
capusle network, CapsuleNet-M has sride 6in both - the first
convolution layer and primary capsule layer. Capsule Network
is trained with margin loss shown below.
Lc=Tcmax(0,m+− kvck)2
+λ(1 Tc) max 0,kvck − m2(1)
Here, Lcis the margin loss of class c,vcis a final output
capsule in class c,Tc= 1 iff the target class is c,m+= 0.9,
m= 0.1and λ= 0.5.
2.2.2. VGG-M and Resnet-34
VGG and Resnet are Deep CNN architecture which have very
good classification performance on image data, so, we chose
to use these models with slight modifications to adapt to spec-
trogram input. These modeified VGG and Resnet-34 models
were introduced in [9] and [10] respectively. The architectures
of VGG-M amd Resnet-34 used are specified in Table 1and 2
2.3. Few shot learning using Prototypical Loss
In few shot learning, at test time we have to classify test sam-
ples among Knew classes given very few labeled examples of
each class. We are provided with Ddimensional embeddings
xiRDfor each input xiand its corresponding label yiwhere
yi∈ {1,...,K}. The objective is to compute a Mdimen-
sional representation i.e. the prototype of each class akRM.
The embedding is computed via a function fφ:RDRM
where findicates a deep neural network and φindicates its pa-
rameters. The prototype akis the mean of the support points
for a class.
By using a distance function d, a distrbution over classes is
learned for a query point q, and is given as
pφ(y=k|q) = exp (d(fφ(q),ak))
Pk0exp (d(fφ(q),ak0)) (2)
At train time, we minimize negative log probability of the
positive class. The training data is generated by randomly se-
lecting a smaller subset from the training set. Then from each
class, we choose a subset of samples which are considered as
support points while the rest are considered as query points. The
flow diagram for few shot learning is shown in Fig. 1.
2.4. Projection of Capsule Net class vectors into embedding
space for few shot learning
A problem in extending CapsuleNet to few shot recognition is
that the final layer learns class specific embeddings. This pre-
vents using pre-trained capsule networks or fine tuning them for
Prototypical Loss
Input Audio
Encoder Decoder
Capsule Network Autoencoder
Figure 2: Flow diagram for few shot learning with Capsule Networks.
Table 1: VGG-M architecture.
layer support filt dim #filts stride
conv1 7x7 1 96 2x2
mpool1 3x3 - - 2x2
conv2 5x5 96 256 2x2
mpool2 3x3 - - 2x2
conv3 3x3 256 384 1x1
conv4 3x3 384 256 1x1
conv5 3x3 256 256 1x1
mpool5 5x3 - - 3x2
fc6 9x1 256 4096 1x1
apool6 1xn - - 1x1
fc7 1x1 4096 1024 1x1
fc8 1x1 1024 1251 1x1
Table 2: Modified Resnet-34 Architecture
Layer name Resnet-34
7x7, 64, stride 2
3x3, max pool, stride 2
conv2 x 3×3,64
3×3,64 ×3
conv3 x 3×3,128
3×3,128 ×4
conv4 x 3×3,256
3×3,256 ×6
conv5 x 3×3,512
3×3,512 ×3
pool time
4x1, 512, stride 1
1x10, avg pool, stride 1
1x1, 50
different number of classes. In order to overcome this, we ap-
pend an autoencoder to capsule network which takes as input
the concatenated class vectors from capsule net and learns an
embedding of these class vectors. To adapt it to few shot recog-
nition, we apply the prototypical loss to these embeddings. The
block diagram is shown in Figure 2.
The intuition behind using an auto-encoder is that a con-
catenation of class vectors represents a distribution of the in-
put over a feature vector. However, these class vectors are ob-
tained for specific classes. Therefore, we need an embedding
which can generalize over unseen classes as well. Therefore,
we choose contactive auto-encoder [19], as we want the embed-
dings to be similar for similar inputs yet discriminative enough
for similar audios not belonging to the same speaker. This is
essentially useful for the prototypical loss as the embeddings
from the auto-encoder can be compared with a distance func-
tion assisting the formation of prototypes for classes with lim-
ited training samples.
3. Experiments
3.1. Datasets
We use a subset of Voxceleb1 dataset [9] and VCTK Cor-
pus [20] for few shot speaker recognition.
Voxceleb1: VoxCeleb1 contains over 100,000 utterances for
1,251 speakers extracted from YouTube videos.
Since the dataset does not provide a standard split for few shot
recognition, we follow the following methodology. From the
training set, we randomly sample 5instances 3second audios
for each speaker. To avoid any overlap in the training data,
the sampling is performed on separate training files provided
in the dataset. At test time, we randomly sample 3second
audio from the test files.
VCTK: The VCTK Corpus includes speech data uttered by
109 native speakers of English with various accents. VCTK
corpus contains clearly read speech, while VoxCeleb has
more background noise and overlapping speech. The dataset
does not provide a standard train and test split. Therefore,
we use 70 : 30 train and test split. We only use a randomly
sampled 3second audio from each split for training and eval-
uation purpose.
For future benchmarking and reproducibility, we will make
the train and test split of the above datasets publicly available.
3.2. Results
3.2.1. Speaker Identification with Deep Neural Network
In the first experiment, we analyze the relative performance of
various deep networks when large amount of training data is
available. In Table 3, we show results using vanilla networks
with standard training and test data provided with VoxCeleb1
dataset (not few-shot setting). We indicate the capsule net archi-
tecture without reconstruction loss as CapsuleNet-M. As cap-
sule net requires significant computational resources for large
datasets [21], we test on a subset of VoxCeleb1 for comparing
against VGG and RestNet. We use first 50 and 200 classes of
Voxceleb1 dataset and use the standard train, val and test split
given for these classes. We generate the spectrogram for audio
files as explained in Section 2.1 and feed it into the network.
It can be seen that ResNet significantly outperforms the other
methods and has nearly 3times the number of parameters as
compared to VGG-M and CapsuleNet-M. Both VGG and Cap-
suleNet have comparable number of parameters, however, VGG
performs 6% better on an average. In addition to the input size
of 1x128x300, we also experimented by varying number of FFT
features to 256 and 512, and found that while it significantly
increases the number of parameters, it doesn’t lead to any sig-
nificant boost in performance.
Table 3: Performance of deep networks on first 50 and 200
classes of Voxceleb1 dataset with standard train and test split
Input size is 1x128x300 in each case. NP is number of parame-
Architecture 50 Classes 200 Classes
NP Top-1 Acc Top-5 Acc NP Top-1 Acc Top-5 Acc
CapsuleNet-M 8,196,864 67.70 90.99 16,798,464 47.01 71.71
VGG-M 8,291,634 76.70 95.34 8,445,384 58.63 84.17
Resnet-34 22,354,162 90.37 98.13 22,431,112 71.48 88.45
Table 4: Performance of Few shot learning approach with deep
neural networks on Voxceleb1
Architecture 5-way 20-way
1 shot 5 shot 1 shot 5 shot
CapsuleNet-MA 53.62 82.93 42.08 64.72
VGG-M 52.42 82.10 20.75 51.82
Resnet-34 79.97 91.50 48.09 72.77
3.2.2. Effect of Number of Training Samples per Class
We now study the impact of reducing the training samples on
the vanilla networks. As the previous experiment was based on
training with entire training data, this setting allows us to eval-
uate the performance of these networks when the number of
samples for each speaker reduces drastically. We vary the num-
ber of training samples per class and the results are shown in
Figure 3. It can be observed that with 10 shots, the performance
of all the network decreases drastically. However, CapsuleNet-
M performs better than both ResNet and VGGNet. Moreover,
as expected, with increasing number of samples, accuracy of
all the three networks increases, with Capsule Network consis-
tently performing better than VGG till 70 samples. This is an
interesting observation and indicates that with lesser amount of
samples, it is able to better exploit the structural composition of
the input spectrogram. Moreover, with nearly one-third of the
number of parameters as compared to ResNet, it performs close
to ResNet at 10 and 20 shots
3.2.3. Few Shot Learning with Prototypical Loss
The results on few shot speaker identification are shown in Ta-
ble 4. CapsuleNet-MA shows the results with the architecture
discussed in Section 2.4. It can be observed that with prototyp-
ical loss, the performance of all the networks increases signifi-
cantly. Interestingly, all the networks provide significantly bet-
ter performance than just reducing the number of training sam-
ples (Section 3.2.2). Moreover, the capsule network with au-
toencoder outperforms VGG while performs closer to ResNet.
Moreover, ResNet provides close to 80% accuracy with 1-shot
for 5speakers and nearly 70% with 5-shot classification for 20
speakers. This is important as it indicates that even with heav-
ily constrained settings one can identify the speaker with high
Table 5: Performance of Speaker identification on VCTK corpus
(non few-shot). NP is number of parameters
Architecture NP Top-1 Acc Top-5 Acc
CapsuleNet-M 8,196,864 91.95 98.13
VGG-M 8,291,634 95.25 99.45
Resnet-34 22,354,162 96.91 99.91
Table 6: Performance of few shot speaker identification on
VCTK dataset.
Architecture 5-way 20-way
1 shot 5 shot 1 shot 5 shot
CapsuleNet-MA 65.26 91.28 32.45 68.75
VGG-M 54.08 84.29 19.66 54.21
Resnet-34 80.96 96.46 44.95 77.11
Figure 3: Test accuracy on 50 classes of Voxceleb1 for different
networks trained with limited samples per class
3.2.4. Generalization Ability of Networks
To show the generalization ability of networks, first we use the
trained models on 50 classes of Voxceleb1 and fine-tune them
on first 50 classes of VCTK corpus for speaker identification
task. We report the accuracy in Table 5. We also used pre-
trained models on Voxceleb1 for few shot learning task and use
them to perform few shot classification on VCTK Corpus with-
out fine-tuning them and report these results in Table 6. It can
be observed that, with pretrained networks the method is also
able to generalize entirely to an unseen class with samples col-
lected using entirely different criteria. It can be observed that
from Table 5, that by just fine tuning the networks, ResNet
achieves a top-5accuracy of 99% while VGG and Capsu-
leNet follow similar trends with their accuracies being slightly
behind ResNet. However, in case of few shot recognition (Ta-
ble 6, we notice significant drop in the performance of VGG
while CapsuleNet-MA performs 10% better than VGG on an
4. Conclusion
We have shown effectiveness of few shot learning approaches
to speaker identification. We also demonstrated that with deep
neural networks, one can identify speakers with high confidence
with ResNet outperforming other techniques by a significant
margin. However, number of parameters for ResNet were com-
paratively high. On the other hand, CapsuleNet performed bet-
ter on less amount of data but applying it for few shot recogni-
tion is not trivial. Therefore, we proposed an extension of Cap-
sule Network by learning a generalized embedding using a con-
tractive auto-encoder. The computed embeddings are used for
learning a prototype emebedding using prototypical loss. We
believe that this work will accelerate use of CNN for practi-
cal implementations as well as catalyze the research on Capsule
Network for making it more efficient on large scale data.
5. References
[1] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-
pur, “X-vectors: Robust dnn embeddings for speaker recognition,”
in 2018 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP). IEEE, 2018, pp. 5329–5333.
[2] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur,
“Deep neural network embeddings for text-independent speaker
verification.” in Interspeech, 2017, pp. 999–1003.
[3] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme
for speaker recognition using a phonetically-aware deep neural
network,” in 2014 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 1695–
[4] J. Guo, N. Xu, K. Qian, Y. Shi, K. Xu, Y. Wu, and A. Alwan,
“Deep neural network based i-vector mapping for speaker verifi-
cation using short utterances,” Speech Communication, vol. 105,
pp. 92–102, 2018.
[5] X. Zhao and D. Wang, “Analyzing noise robustness of mfcc and
gfcc features in speaker identification,” in 2013 IEEE interna-
tional conference on acoustics, speech and signal processing.
IEEE, 2013, pp. 7204–7208.
[6] A. Kanagasundaram, R. Vogt, D. B. Dean, S. Sridharan, and
M. W. Mason, “I-vector based speaker recognition on short ut-
terances,” in Proceedings of the 12th Annual Conference of the
International Speech Communication Association. International
Speech Communication Association (ISCA), 2011, pp. 2341–
[7] M. McLaren, Y. Lei, N. Scheffer, and L. Ferrer, “Application
of convolutional neural networks to speaker recognition in noisy
conditions,” in Fifteenth Annual Conference of the International
Speech Communication Association, 2014.
[8] Z. Liu, Z. Wu, T. Li, J. Li, and C. Shen, “Gmm and cnn hybrid
method for short utterance speaker recognition,” IEEE Transac-
tions on Industrial Informatics, vol. 14, no. 7, pp. 3244–3252,
[9] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-
scale speaker identification dataset,” in INTERSPEECH, 2017.
[10] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep
speaker recognition,” in INTERSPEECH, 2018.
[11] S.-Y. Chou, K.-H. Cheng, J.-S. R. Jang, and Y.-H. Yang, “Learn-
ing to match transient sound events using attentional similarity for
few-shot sound recognition,arXiv preprint arXiv:1812.01269,
[12] S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice
cloning with a few samples,” in Advances in Neural Information
Processing Systems, 2018, pp. 10 019–10 029.
[13] K. Simonyan and A. Zisserman, “Very deep convolutional
networks for large-scale image recognition, CoRR, vol.
abs/1409.1556, 2014.
[14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” CoRR, vol. abs/1512.03385, 2015.
[15] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for
few-shot learning,” in Advances in Neural Information Process-
ing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Asso-
ciates, Inc., 2017, pp. 4077–4087.
[16] J. Bae and D.-S. Kim, “End-to-end speech command recogni-
tion with capsule network,” Proc. Interspeech 2018, pp. 776–780,
[17] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between
capsules,” in Advances in neural information processing systems,
2017, pp. 3856–3866.
[18] E. Xi, S. Bing, and Y. Jin, “Capsule network performance on com-
plex data,” arXiv preprint arXiv:1712.03480, 2017.
[19] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Con-
tractive auto-encoders: Explicit invariance during feature extrac-
tion,” in Proceedings of the 28th International Conference on In-
ternational Conference on Machine Learning. Omnipress, 2011,
pp. 833–840.
[20] C. Veaux, J. Yamagishi, and K. MacDonald, “Superseded - cstr
vctk corpus: English multi-speaker corpus for cstr voice cloning
toolkit,” 2016.
[21] R. Mukhometzianov and J. Carrillo, “Capsnet comparative per-
formance evaluation for image classification,arXiv preprint
arXiv:1805.11195, 2018.
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Text-independent speaker recognition using short utterances is a highly challenging task due to the large variation and content mismatch between short utterances. I-vector and probabilistic linear discriminant analysis (PLDA) based systems have become the standard in speaker verification applications, but they are less effective with short utterances. In this paper, we first compare two state-of-the-art universal background model (UBM) training methods for i-vector modeling using full-length and short utterance evaluation tasks. The two methods are Gaussian mixture model (GMM) based (denoted I-vector_GMM) and deep neural network (DNN) based (denoted as I-vector_DNN) methods. The results indicate that the I-vector_DNN system outperforms the I-vector_GMM system under various durations (from full length to 5 s). However, the performances of both systems degrade significantly as the duration of the utterances decreases. To address this issue, we propose two novel nonlinear mapping methods which train DNN models to map the i-vectors extracted from short utterances to their corresponding long-utterance i-vectors. The mapped i-vector can restore missing information and reduce the variance of the original short-utterance i-vectors. The proposed methods both model the joint representation of short and long utterance i-vectors: the first method trains an autoencoder first using concatenated short and long utterance i-vectors and then uses the pre-trained weights to initialize a supervised regression model from the short to long version; the second method jointly trains the supervised regression model with an autoencoder reconstructing the short utterance i-vector itself. Experimental results using the NIST SRE 2010 dataset show that both methods provide significant improvement and result in a 24.51% relative improvement in Equal Error Rates (EERs) from a baseline system. In order to learn a better joint representation, we further investigate the effect of a deep encoder with residual blocks, and the results indicate that the residual network can further improve the EERs of a baseline system by up to 26.47%. Moreover, in order to improve the short i-vector mapping to its long version, an additional vector, which represents the average value of phoneme posteriors across frames, is also added to the input, and results in a 28.43% improvement. When further testing the best-validated models of SRE10 on the Speaker In The Wild (SITW) dataset, the methods result in a 23.12% improvement on arbitrary-duration (1–5 s) short-utterance conditions.
Full-text available
Voice cloning is a highly desired feature for personalized speech interfaces. Neural network based speech synthesis has been shown to generate high quality speech for a large number of speakers. In this paper, we introduce a neural voice cloning system that takes a few audio samples as input. We study two approaches: speaker adaptation and speaker encoding. Speaker adaptation is based on fine-tuning a multi-speaker generative model with a few cloning samples. Speaker encoding is based on training a separate model to directly infer a new speaker embedding from cloning audios and to be used with a multi-speaker generative model. In terms of naturalness of the speech and its similarity to original speaker, both approaches can achieve good performance, even with very few cloning audios. While speaker adaptation can achieve better naturalness and similarity, the cloning time or required memory for the speaker encoding approach is significantly less, making it favorable for low-resource deployment.
During the last few years, the speaker recognition technique have been widely attractive for its extensive application in many fields, such as speech communications, domestics services, and smart terminals. As a critical method, the Gaussian Mixture Model (GMM) makes it possible to achieve the recognition capability that is close to the hearing ability of human in a long speech. However, the GMM is fail to recognize a short utterance speaker in a high accuracy. Facing this problem, in this paper, we propose a novel model to enhance the recognition accuracy of the short utterance speaker recognition system. Different from traditional models based on the GMM, we design a method to train a Convolutional Neural Network (CNN) to process spectrograms, which can describe speakers better. Thus, the recognition system gains the considerable accuracy as well as the reasonable convergence speed. The results of the experiments show that our model can help to decrease the equal error rate of the recognition from 4.9% to 2.5%.
In recent years, convolutional neural networks (CNN) have played an important role in the field of deep learning. Variants of CNN's have proven to be very successful in classification tasks across different domains. However, there are two big drawbacks to CNN's: their failure to take into account of important spatial hierarchies between features, and their lack of rotational invariance. As long as certain key features of an object are present in the test data, CNN's classify the test data as the object, disregarding features' relative spatial orientation to each other. This causes false positives. The lack of rotational invariance in CNN's would cause the network to incorrectly assign the object another label, causing false negatives. To address this concern, Hinton et al. propose a novel type of neural network using the concept of capsules in a recent paper. With the use of dynamic routing and reconstruction regularization, the capsule network model would be both rotation invariant and spatially aware. The capsule network has shown its potential by achieving a state-of-the-art result of 0.25% test error on MNIST without data augmentation such as rotation and scaling, better than the previous baseline of 0.39%. To further test out the application of capsule networks on data with higher dimensionality, we attempt to find the best set of configurations that yield the optimal test error on CIFAR10 dataset.
A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or object part. We use the length of the activity vector to represent the probability that the entity exists and its orientation to represent the instantiation paramters. Active capsules at one level make predictions, via transformation matrices, for the instantiation parameters of higher-level capsules. When multiple predictions agree, a higher level capsule becomes active. We show that a discrimininatively trained, multi-layer capsule system achieves state-of-the-art performance on MNIST and is considerably better than a convolutional net at recognizing highly overlapping digits. To achieve these results we use an iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule.