Content uploaded by Maxim Tkachenko
Author content
All content in this area was uploaded by Maxim Tkachenko on Jul 31, 2018
Content may be subject to copyright.
Lightweight Embeddings for Speaker Verification
Maxim Tkachenko1, Alexander Yamshinin1, Mikhail Kotov1, and Marina
Nastasenko2
1ASM Solutions LLC, Moscow, Russia
2Master synthesis LLC, Moscow, Russia
{m.tkachenko,a.yamshinin,kotov}@asmsolutions.ru,marina.nastasenko@gmail.
com
Abstract. This paper presents speaker verification (SV) system using
deep neural networks with hash representations (binarization) of em-
beddings. The training procedure is performed on NIST SRE train set,
verification is performed on the same corpus with test set. The system
architecture is based on deep recurrent layers with attention mechanism.
Semi-hard triplets selection is used for the training procedure. The re-
sulting layer of neural network is the tanh function and it makes the
hash representation training as end-to-end possible. As a consequence,
such a system decreases the embedding memory size in 32x times and
increases the system evaluation performance. The equal error rate (EER)
is achieved with regard to embeddings without binarization.
Keywords: Hash ·Embeddings ·Binarization ·Neural networks ·Speaker
verification.
1 Introduction
The SV is the process of a speaker identity using their speech and voice char-
acteristics captured by a sound recording device. There are two main subclasses
of the SV: text-depended and text-independent tasks. In this paper we consider
the last ones.
The SV system consists of three steps: background model creation (devel-
opment), obtain embeddings using background model for representation of new
speakers (enrollment), and verification step when each speaker embedding tries
to match a test audio signal. Formally, the embedding is a one-dimension vector
which represents the speaker.
Although many approaches have reached the state-of-the-art over the past
years, the SV is still an actively developing task. The traditional and classic
realization of SV tasks entails using i-vectors as embeddings [7] and probabilistic
linear discriminant analysis as a comparing step between enrollment and different
utterances [8, 9]. I-vector is a high dimensional vector that encodes speaker
identity with other utterance-level variability. Sufficient statistics which produces
i-vectors is computed from a Gaussian Mixture Model-Universal Background
Model (GMM-UBM) which has a sequence of feature vectors as input (e.g.,
mel-frequency cepstral coefficients, MFCC).
2 Authors Suppressed Due to Excessive Length
The next evolutionary step of SV systems are end-to-end neural networks
combine all three steps together while training [1, 2]. In this paper, we extend the
end-to-end speaker embedding systems with hash-embeddings. First, the deep
neural network is used to extract frame-level features from utterances. Then
the attention mechanism over recurrent neural network (RNN) frames generates
speaker embeddings. The model is trained using triplet loss [3], which minimizes
the distance between embedding pairs of the same speaker and maximizes the
distance between pairs of different speakers. The pre-training on l2triplet loss
network without a hash-layer improved performance of hash-embeddings.
2 General Embeddings
2.1 Model
Audio is converted to logarithmic FBanks [5], normalized to have zero mean and
unit variance for each file, and it passes as input to the neural network. Right
after the input layer we use 2D convolution. It reduces dimensionality in both the
frequency and time domains, allowing to accelerate next computations. Following
the convolutional layer is the bidirectional gated recurrent unit (GRU) [11] layer,
recurrent in the time dimension.
We use recurrent networks because they work well for speech recognition [10].
The GRU is comparable to an LSTM [4] with a properly initialized forget gate
bias [12]. We conducted several experiments and discovered that the GRU works
slightly better than the LSTMs for the same number of cell units, but the GRUs
were faster to be trained.
After the GRU layer, we apply the attention mechanism [13]. Such approach
is widely used in machine translation area. We made simplified version be-
cause we don’t need an encoder-decoder scheme as in [14]. Our technique allows
to make self-attention to the most important moments of GRU outputs, and
thereby it can be used instead of averaging or pooling to obtain the vector that
summarizes all the information about frames in the utterance:
ut=tanh(W ht+b) (1)
αt=exp(uT
tu)
Ptexp(uT
tu)(2)
v=X
t
αtht(3)
where ht∈RDis GRU hidden states at step t, W∈RA×Dis the attention
matrix, b∈RAis the attention bias, u∈RAis the context vector, αt∈[0,1] is
the weight for GRU hidden state at step t.
After unit length normalization applied to the attention output we pass ob-
tained embeddings into l2triplet loss.
Lightweight Embeddings for Speaker Verification 3
Fig. 1. General and lightweight embeddings architectures of proposed systems.
Fig. 2. Triplet loss minimizes distance between an anchor and a positive and maximizes
it between an anchor and a negative. Before (left) and after (right) training.
4 Authors Suppressed Due to Excessive Length
2.2 Loss Function
We model the probability of embeddings xiand xjwhich belong to the same
speaker by their distances and it allows to use the triplet loss function like in
[3, 15]:
distance(xi, xj) = kxi−xjk2
2(4)
where xiand xjare embeddings.
The triplet loss operates with three input samples: an anchor (an embed-
ding for a target speaker), a positive example (another embedding for the target
speaker), and a negative example (an embedding for another speaker). The loss
function must drive neural network in such a way that the distance metric be-
tween the anchor and the negative example will be more than between the anchor
and the positive example:
kxa
i−xp
ik2
2+α < kxa
i−xn
ik2
2(5)
where xa
i, xp
i, xn
iare anchor, positive and negative embeddings respectively,
αis a constant margin between positives and negatives. Thus, the cost function
can be the following:
cost =
N
X
i=0
max(kxa
i−xp
ik2
2− kxa
i−xn
ik2
2+α, 0) (6)
where N is all possible triplets.
The separate task is to find hard negatives corresponding to pairs of (anchor,
positive). In the beginning of learning almost all negatives are hard for the
network. But step-by-step it distinguishes the anchor and negatives better and
better, and simply random selected negatives gives no benefit. To solve this we
make a negatives preselect using the last network model after each update step
while training.
2.3 Configuration
The input signal is 8 kHz mono. Audio frames are logarithmic 64-dimensional
FBanks [5] using 256 STFT (short time Fourier transform) with hop size 0.01
second and window size 0.032 second. The convolution layer consists of 16 filters
with 10 x 10 kernel size and 3 x 3 stride. The forward and backward GRU layers
include 256 units. The attention has 256 units.
3 Lightweight Hash-Embeddings
3.1 Model
The goal of the lightweight hash-embeddings model is to learn nonlinear hash
function f:features →y∈ {−1,1}K,f eatures are from input space RT×F
Lightweight Embeddings for Speaker Verification 5
(T- time, F- features dimension) to Hamming space {−1,1}Kusing recurrent
neural networks, which encodes each feature sequence into the compact K-bit
binary hash code y=f(x) such that the distances between embeddings in the
given triplets can be preserved in the compact hash codes.
Based on this, the neural network should give out the values from {−1,1}K.
But it’s not allowed to use discrete values and conditional operators inside of
neural network graph because they are undifferentiable. The solution is an ap-
proximation of {−1,1}by tanh function (See Fig. 3).
Fig. 3. Sign approximation using tanh function.
The lightweight architecture is based on general embeddings model, but it
replaces unit length normalization by dense (fully connected layer) and tanh
activation (see Fig. 1). Experiments with the number of neurons in this dense
lead to significant outcomes published in ”Results” section. As mentioned in
[16], tanh activation should be y=tanh(βz). Due to l1norm in triplet loss we
avoided problems with βoptimization and discarded this in favor of y=tanh(z).
Once the embedding yis obtained it will be binarized and passed to the scoring
procedure:
bi=(+1, yi>0;
−1, yi<0.
where bis binarized hash-embedding, i∈[0, K], Kis the embedding dimen-
tionality.
We tried to train such model from scratch, but the best results were obtained
by pretrain created as described in the previous section ”General embeddings”.
3.2 Loss Function
The Hamming distance between x, y ∈ {−1,1}Kis defined as
K
P
i=0
|xi−yi|. There-
fore, it will be expedient to use l1distance in the loss function. We ran a lot of
6 Authors Suppressed Due to Excessive Length
experiments with different distances and made sure this one performs best for
this task.
distance(xi, xj) = |xi−xj|(7)
cost =
N
X
i=0
max(|xa
i−xp
i|−|xa
i−xn
i|+α, 0) (8)
Note that the range of αis different for l1in contrast to l2with unit length
normalization. For l2it is from 0 to 2 (e.g. 1.0), but for L1 it is from 0 to half
size of hash-embedding dimentionality (e.g. 256).
3.3 Configuration
The configuration is the same as described in 2.3, except for the last layer. We
tried 256, 512, 1024 for the dense neuron number at the end and 64.0, 128.0,
256.0 for αrespectively. The results of these runs are described below.
4 Experiment Setup
4.1 Dataset
The power of neural networks is revealed when they are fed with huge data.
Several NIST corpuses were mixed: NIST Speaker Recognition Evaluation (SRE)
2004, 2005, 2006, 2008, 2010 [17]. It takes more than 250 GB mono audio 8 kHz
from different (GSM, microphone, etc) channels in total. Our training set consists
of 2826 speakers, validation set includes 149. The final test uses 330 speakers.
4.2 Data Treatment
On the one hand current GPUs don’t have enough memory to store the whole
audio features and data must be split equally-sized. On the other hand GPUs
use batch processing which needs data to be the same length to reach the best
performance. During the parameters optimization optimal audio duration was
found and is equal to 10 second. Thus long files are divided into pieces of equal
length for 10 seconds using VAD (Voice Activity Detector). So, training, valida-
tion and test sets are of truncated lengths.
Also we found out that the whole file processing can boost performance up
to 5-10% relatively. There are two common ways to evaluate full-length signals:
a) to split it to 10 sec frames and average obtained embeddings, b) to sort files
by length and feed them as small batches (to prevent out of memory errors and
speed up calculation time) with advanced model architecture when durations are
used to stop working RNN with the attention mechanism on the last zero-frames
over time. We present results from (b) in this paper.
Lightweight Embeddings for Speaker Verification 7
4.3 Training
The Adam optimizer is applied to learn neural models [6]. It uses first-order
stochastic gradient descent updates with adaptive estimates of lower-order mo-
ments. The learning rate is 0.002 with decreasing to 0.00002 at the training
finish. The batch size depends on the files sampling procedure for the triplet loss
function: we select N files for M speakers to make the number of positive and
negative samples representative in the batch [15]. Our GPU memory capacity is
not enough to keep a lot of speakers, so we stopped at the maximum available
values N=5 and M=90, which means batch size is 450.
Our hardware is Intel Core i7, Nvidia GeForce 1080 Ti GPUs, SSD, 128
GB RAM. Each epoch duration is about two hours. The training procedure has
about 10-20 epochs till the convergence reached.
All codes are written in Python using TensorFlow. As experiment repository
we use Testarium [18]. To simplify TensorFlow pipeline we developed a tiny
framework TfMicro [19].
5 Results
Our most significant experiment results are presented in Fig. 4. As it can be seen
from the Table 1, the hash-embedding system outperforms general embeddings
due to expanding network parameters when it runs with the last 1024-dense
layer. Note that the threshold values depend on αused in the triplet loss.
Table 1. Results for hash-embeddings and general embeddings.
System EER, % minDCF Threshold
General embeddings
Bidirectional GRU 256
(model for pretrain) 6.00 0.474 0.28
Hash-embeddings
Pretrain, dense 256 6.41 0.493 101
Pretrain, dense 512 6.10 0.442 196
Pretrain, dense 1024 5.89 0.413 374
No pretrain, dense 1024 5.93 0.423 376
As for the attention work, it leaves only necessary pieces in time which can
be seen in Fig.6.
6 Conclusions
In this paper we present a novel end-to-end speaker hash-embeddings. The pro-
posed system demonstrates an excellent performance on hash-embeddings. It
allows to use such embeddings to reduce memory usage and speed scoring up to
8 Authors Suppressed Due to Excessive Length
Fig. 4. False alarm and false reject plots.
Fig. 5. Training losses for 256, 512 and 1024 hash-embedding models.
Lightweight Embeddings for Speaker Verification 9
Fig. 6. Attention mechanism in action. It’s crucial to see the attention mechanism
taking weights similar to VAD. The green rectangle across all the images is meant to
show how the attention throws unwanted chunks out.
32 times faster. For example, one embedding vector with 256 dimensionality in
general system approach takes 256 x 4 bytes/float = 1024 bytes. Now it’s able
to store bits instead of floats and it will be 32 bytes in total. As a consequence,
hash-embeddings reduce the number of operations in the same number of times.
Besides that processor operates with integer values, rather than floating-point
math and it is a good start point to use such systems on reduced instruction set
computers (RISC).
The experiments show that the end-to-end speaker hash-embeddings can be
compared in accuracy with the general embeddings system. On the mix of all
NIST SRE datasets we achieved 6.00% EER with general embeddings and 5.89%
EER with hash-embeddings. To build hash models we used pretrained from the
general embedding model.
In out future work we will focus on advanced attention mechanisms, stacked
deep RNN architectures with residual blocks, large and different input utterance
lengths while learning, more effective techniques of the negative preselecting for
the triplet loss.
References
1. Heigold, G., Moreno, I., Bengio, S., Shazeer, N.: End-to-end text-dependent speaker
verification. In: IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP). Shanghai, China (2016)
2. David, S., Pegah, G., Daniel, P., Daniel, G.R., Yishay, C., Sanjeev K.: Neural
Network-Based Speaker Embeddings for End-To-End Speaker Verification. In: IEEE
Spoken Language Technology Workshop (SLT). San Diego, California (2016)
10 Authors Suppressed Due to Excessive Length
3. Schroff, F., Philbin, J.: FaceNet: A Unified Embedding for Face Recognition and
Clustering. In: The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 815-823. Boston, MA (2015)
4. Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory. In: Neural Computation
November 15, vol. 9, no. 8, pp. 1735-1780 (1997)
5. Hinton, G., Deng, L., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke,
V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for acoustic mod-
eling in speech recognition. In: IEEE Signal Processing Magazine, vol. 29, no. 6, pp.
82-97. IEEE Press, (2012)
6. Kingma, D., Ba, J.: Adam: A Method for Stochastic Optimization. In: 3rd Interna-
tional Conference for Learning Representations. San Diego (2015)
7. Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor anal-
ysis for speaker verification. In: IEEE Transactions on Audio, Speech, and Language
Processing, vol. 19, pp. 788-798. IEEE Press (2010)
8. Prince, S.J., Elder, J.H.: Probabilistic linear discriminant analysis for inferences
about identity. In: 11th International Conference on Computer Vision (ICCV), pp.
1-8. Rio de Janeiro, Brazil (2007)
9. Cumani, S., Laface, P., Torino, P.: Probabilistic Linear Discriminant Analysis Of
Ivector Posterior Distributions. In: IEEE International Conference on Acoustics,
Speech, and Signal Processing (ICASSP). Vancouver, Canada (2013)
10. Sainath, Tara N., Weiss, Ron J., Senior, Andrew, Wilson, Kevin W., Vinyals, Oriol:
Learning the speech front-end with raw waveform cldnns. In: 16th Annual Con-
ference of the International Speech CommunicationAssociation (INTERSPEECH).
Dresden, Germany (2015)
11. Cho, K., Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F.,Schwenk,
H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statis-
tical machine translation. In: Conference on Empirical Methods in Natural Language
Processing (EMNLP). Doha, Qatar (2014)
12. Jozefowicz, R., Zaremba W., Sutskever, I.: An empirical exploration of recurrent
network architectures. In: International Conference on Machine Learning (ICML).
Lille, France (2015)
13. Yang, Z., Yang, D., Dyer Chr., He, X., Smola, A., Hovy, E.: Hierarchical attention
networks for document classification. In: Proceedings of the 2016 Conference of the
North American Chapter of the Association for Computational Linguistics: Human
Language Technologies. San Diego, California (2016)
14. Luong, M., Pham, H., Christopher, M.: Effective Approaches to Attention-based
Neural Machine Translation. In: Empirical Methods in Natural Language Processing
(EMNLP). Lisbon, Portugal (2015)
15. Li., Ch., Ma, X., Jiang, B., Li, X., Zhang, X., Liu, X., Cao, Y., Kannan, A., Zhu,
Zh.: Deep Speaker: an End-to-End Neural Speaker Embedding System. In: IEEE
Spoken Language Technology Workshop (SLT). San Diego, California (2016)
16. Cao, Z., Long, M., Wang, J., Yu, P.: HashNet: Deep Learning to Hash by Con-
tinuation. In: IEEE International Conference on Computer Vision (ICCV). Venice,
Italy (2017)
17. NIST SRE. https://www.nist.gov/itl/iad/mig/speaker- recognition
18. Testarium. Research tool. http://testarium.makseq.com
19. TfMicro. Tensorflow binding. http://github.com/makseq/tfmicro