Conference PaperPDF Available

Lightweight embeddings for speaker verification

Authors:
  • Heartex Inc
  • ASM Solutions

Abstract and Figures

This paper presents speaker verification (SV) system using deep neural networks with hash representations (binarization) of embeddings. The training procedure is performed on NIST SRE train set, verification is performed on the same corpus with test set. The system architecture is based on deep recurrent layers with attention mechanism. Semi-hard triplets selection is used for the training procedure. The resulting layer of neural network is the tanh function and it makes the hash representation training as end-to-end possible. As a consequence, such a system decreases the embedding memory size in 32x times and increases the system evaluation performance. The equal error rate (EER) is achieved with regard to embeddings without binarization.
Content may be subject to copyright.
Lightweight Embeddings for Speaker Verification
Maxim Tkachenko1, Alexander Yamshinin1, Mikhail Kotov1, and Marina
Nastasenko2
1ASM Solutions LLC, Moscow, Russia
2Master synthesis LLC, Moscow, Russia
{m.tkachenko,a.yamshinin,kotov}@asmsolutions.ru,marina.nastasenko@gmail.
com
Abstract. This paper presents speaker verification (SV) system using
deep neural networks with hash representations (binarization) of em-
beddings. The training procedure is performed on NIST SRE train set,
verification is performed on the same corpus with test set. The system
architecture is based on deep recurrent layers with attention mechanism.
Semi-hard triplets selection is used for the training procedure. The re-
sulting layer of neural network is the tanh function and it makes the
hash representation training as end-to-end possible. As a consequence,
such a system decreases the embedding memory size in 32x times and
increases the system evaluation performance. The equal error rate (EER)
is achieved with regard to embeddings without binarization.
Keywords: Hash ·Embeddings ·Binarization ·Neural networks ·Speaker
verification.
1 Introduction
The SV is the process of a speaker identity using their speech and voice char-
acteristics captured by a sound recording device. There are two main subclasses
of the SV: text-depended and text-independent tasks. In this paper we consider
the last ones.
The SV system consists of three steps: background model creation (devel-
opment), obtain embeddings using background model for representation of new
speakers (enrollment), and verification step when each speaker embedding tries
to match a test audio signal. Formally, the embedding is a one-dimension vector
which represents the speaker.
Although many approaches have reached the state-of-the-art over the past
years, the SV is still an actively developing task. The traditional and classic
realization of SV tasks entails using i-vectors as embeddings [7] and probabilistic
linear discriminant analysis as a comparing step between enrollment and different
utterances [8, 9]. I-vector is a high dimensional vector that encodes speaker
identity with other utterance-level variability. Sufficient statistics which produces
i-vectors is computed from a Gaussian Mixture Model-Universal Background
Model (GMM-UBM) which has a sequence of feature vectors as input (e.g.,
mel-frequency cepstral coefficients, MFCC).
2 Authors Suppressed Due to Excessive Length
The next evolutionary step of SV systems are end-to-end neural networks
combine all three steps together while training [1, 2]. In this paper, we extend the
end-to-end speaker embedding systems with hash-embeddings. First, the deep
neural network is used to extract frame-level features from utterances. Then
the attention mechanism over recurrent neural network (RNN) frames generates
speaker embeddings. The model is trained using triplet loss [3], which minimizes
the distance between embedding pairs of the same speaker and maximizes the
distance between pairs of different speakers. The pre-training on l2triplet loss
network without a hash-layer improved performance of hash-embeddings.
2 General Embeddings
2.1 Model
Audio is converted to logarithmic FBanks [5], normalized to have zero mean and
unit variance for each file, and it passes as input to the neural network. Right
after the input layer we use 2D convolution. It reduces dimensionality in both the
frequency and time domains, allowing to accelerate next computations. Following
the convolutional layer is the bidirectional gated recurrent unit (GRU) [11] layer,
recurrent in the time dimension.
We use recurrent networks because they work well for speech recognition [10].
The GRU is comparable to an LSTM [4] with a properly initialized forget gate
bias [12]. We conducted several experiments and discovered that the GRU works
slightly better than the LSTMs for the same number of cell units, but the GRUs
were faster to be trained.
After the GRU layer, we apply the attention mechanism [13]. Such approach
is widely used in machine translation area. We made simplified version be-
cause we don’t need an encoder-decoder scheme as in [14]. Our technique allows
to make self-attention to the most important moments of GRU outputs, and
thereby it can be used instead of averaging or pooling to obtain the vector that
summarizes all the information about frames in the utterance:
ut=tanh(W ht+b) (1)
αt=exp(uT
tu)
Ptexp(uT
tu)(2)
v=X
t
αtht(3)
where htRDis GRU hidden states at step t, WRA×Dis the attention
matrix, bRAis the attention bias, uRAis the context vector, αt[0,1] is
the weight for GRU hidden state at step t.
After unit length normalization applied to the attention output we pass ob-
tained embeddings into l2triplet loss.
Lightweight Embeddings for Speaker Verification 3
Fig. 1. General and lightweight embeddings architectures of proposed systems.
Fig. 2. Triplet loss minimizes distance between an anchor and a positive and maximizes
it between an anchor and a negative. Before (left) and after (right) training.
4 Authors Suppressed Due to Excessive Length
2.2 Loss Function
We model the probability of embeddings xiand xjwhich belong to the same
speaker by their distances and it allows to use the triplet loss function like in
[3, 15]:
distance(xi, xj) = kxixjk2
2(4)
where xiand xjare embeddings.
The triplet loss operates with three input samples: an anchor (an embed-
ding for a target speaker), a positive example (another embedding for the target
speaker), and a negative example (an embedding for another speaker). The loss
function must drive neural network in such a way that the distance metric be-
tween the anchor and the negative example will be more than between the anchor
and the positive example:
kxa
ixp
ik2
2+α < kxa
ixn
ik2
2(5)
where xa
i, xp
i, xn
iare anchor, positive and negative embeddings respectively,
αis a constant margin between positives and negatives. Thus, the cost function
can be the following:
cost =
N
X
i=0
max(kxa
ixp
ik2
2− kxa
ixn
ik2
2+α, 0) (6)
where N is all possible triplets.
The separate task is to find hard negatives corresponding to pairs of (anchor,
positive). In the beginning of learning almost all negatives are hard for the
network. But step-by-step it distinguishes the anchor and negatives better and
better, and simply random selected negatives gives no benefit. To solve this we
make a negatives preselect using the last network model after each update step
while training.
2.3 Configuration
The input signal is 8 kHz mono. Audio frames are logarithmic 64-dimensional
FBanks [5] using 256 STFT (short time Fourier transform) with hop size 0.01
second and window size 0.032 second. The convolution layer consists of 16 filters
with 10 x 10 kernel size and 3 x 3 stride. The forward and backward GRU layers
include 256 units. The attention has 256 units.
3 Lightweight Hash-Embeddings
3.1 Model
The goal of the lightweight hash-embeddings model is to learn nonlinear hash
function f:features y∈ {−1,1}K,f eatures are from input space RT×F
Lightweight Embeddings for Speaker Verification 5
(T- time, F- features dimension) to Hamming space {−1,1}Kusing recurrent
neural networks, which encodes each feature sequence into the compact K-bit
binary hash code y=f(x) such that the distances between embeddings in the
given triplets can be preserved in the compact hash codes.
Based on this, the neural network should give out the values from {−1,1}K.
But it’s not allowed to use discrete values and conditional operators inside of
neural network graph because they are undifferentiable. The solution is an ap-
proximation of {−1,1}by tanh function (See Fig. 3).
Fig. 3. Sign approximation using tanh function.
The lightweight architecture is based on general embeddings model, but it
replaces unit length normalization by dense (fully connected layer) and tanh
activation (see Fig. 1). Experiments with the number of neurons in this dense
lead to significant outcomes published in ”Results” section. As mentioned in
[16], tanh activation should be y=tanh(βz). Due to l1norm in triplet loss we
avoided problems with βoptimization and discarded this in favor of y=tanh(z).
Once the embedding yis obtained it will be binarized and passed to the scoring
procedure:
bi=(+1, yi>0;
1, yi<0.
where bis binarized hash-embedding, i[0, K], Kis the embedding dimen-
tionality.
We tried to train such model from scratch, but the best results were obtained
by pretrain created as described in the previous section ”General embeddings”.
3.2 Loss Function
The Hamming distance between x, y ∈ {−1,1}Kis defined as
K
P
i=0
|xiyi|. There-
fore, it will be expedient to use l1distance in the loss function. We ran a lot of
6 Authors Suppressed Due to Excessive Length
experiments with different distances and made sure this one performs best for
this task.
distance(xi, xj) = |xixj|(7)
cost =
N
X
i=0
max(|xa
ixp
i|−|xa
ixn
i|+α, 0) (8)
Note that the range of αis different for l1in contrast to l2with unit length
normalization. For l2it is from 0 to 2 (e.g. 1.0), but for L1 it is from 0 to half
size of hash-embedding dimentionality (e.g. 256).
3.3 Configuration
The configuration is the same as described in 2.3, except for the last layer. We
tried 256, 512, 1024 for the dense neuron number at the end and 64.0, 128.0,
256.0 for αrespectively. The results of these runs are described below.
4 Experiment Setup
4.1 Dataset
The power of neural networks is revealed when they are fed with huge data.
Several NIST corpuses were mixed: NIST Speaker Recognition Evaluation (SRE)
2004, 2005, 2006, 2008, 2010 [17]. It takes more than 250 GB mono audio 8 kHz
from different (GSM, microphone, etc) channels in total. Our training set consists
of 2826 speakers, validation set includes 149. The final test uses 330 speakers.
4.2 Data Treatment
On the one hand current GPUs don’t have enough memory to store the whole
audio features and data must be split equally-sized. On the other hand GPUs
use batch processing which needs data to be the same length to reach the best
performance. During the parameters optimization optimal audio duration was
found and is equal to 10 second. Thus long files are divided into pieces of equal
length for 10 seconds using VAD (Voice Activity Detector). So, training, valida-
tion and test sets are of truncated lengths.
Also we found out that the whole file processing can boost performance up
to 5-10% relatively. There are two common ways to evaluate full-length signals:
a) to split it to 10 sec frames and average obtained embeddings, b) to sort files
by length and feed them as small batches (to prevent out of memory errors and
speed up calculation time) with advanced model architecture when durations are
used to stop working RNN with the attention mechanism on the last zero-frames
over time. We present results from (b) in this paper.
Lightweight Embeddings for Speaker Verification 7
4.3 Training
The Adam optimizer is applied to learn neural models [6]. It uses first-order
stochastic gradient descent updates with adaptive estimates of lower-order mo-
ments. The learning rate is 0.002 with decreasing to 0.00002 at the training
finish. The batch size depends on the files sampling procedure for the triplet loss
function: we select N files for M speakers to make the number of positive and
negative samples representative in the batch [15]. Our GPU memory capacity is
not enough to keep a lot of speakers, so we stopped at the maximum available
values N=5 and M=90, which means batch size is 450.
Our hardware is Intel Core i7, Nvidia GeForce 1080 Ti GPUs, SSD, 128
GB RAM. Each epoch duration is about two hours. The training procedure has
about 10-20 epochs till the convergence reached.
All codes are written in Python using TensorFlow. As experiment repository
we use Testarium [18]. To simplify TensorFlow pipeline we developed a tiny
framework TfMicro [19].
5 Results
Our most significant experiment results are presented in Fig. 4. As it can be seen
from the Table 1, the hash-embedding system outperforms general embeddings
due to expanding network parameters when it runs with the last 1024-dense
layer. Note that the threshold values depend on αused in the triplet loss.
Table 1. Results for hash-embeddings and general embeddings.
System EER, % minDCF Threshold
General embeddings
Bidirectional GRU 256
(model for pretrain) 6.00 0.474 0.28
Hash-embeddings
Pretrain, dense 256 6.41 0.493 101
Pretrain, dense 512 6.10 0.442 196
Pretrain, dense 1024 5.89 0.413 374
No pretrain, dense 1024 5.93 0.423 376
As for the attention work, it leaves only necessary pieces in time which can
be seen in Fig.6.
6 Conclusions
In this paper we present a novel end-to-end speaker hash-embeddings. The pro-
posed system demonstrates an excellent performance on hash-embeddings. It
allows to use such embeddings to reduce memory usage and speed scoring up to
8 Authors Suppressed Due to Excessive Length
Fig. 4. False alarm and false reject plots.
Fig. 5. Training losses for 256, 512 and 1024 hash-embedding models.
Lightweight Embeddings for Speaker Verification 9
Fig. 6. Attention mechanism in action. It’s crucial to see the attention mechanism
taking weights similar to VAD. The green rectangle across all the images is meant to
show how the attention throws unwanted chunks out.
32 times faster. For example, one embedding vector with 256 dimensionality in
general system approach takes 256 x 4 bytes/float = 1024 bytes. Now it’s able
to store bits instead of floats and it will be 32 bytes in total. As a consequence,
hash-embeddings reduce the number of operations in the same number of times.
Besides that processor operates with integer values, rather than floating-point
math and it is a good start point to use such systems on reduced instruction set
computers (RISC).
The experiments show that the end-to-end speaker hash-embeddings can be
compared in accuracy with the general embeddings system. On the mix of all
NIST SRE datasets we achieved 6.00% EER with general embeddings and 5.89%
EER with hash-embeddings. To build hash models we used pretrained from the
general embedding model.
In out future work we will focus on advanced attention mechanisms, stacked
deep RNN architectures with residual blocks, large and different input utterance
lengths while learning, more effective techniques of the negative preselecting for
the triplet loss.
References
1. Heigold, G., Moreno, I., Bengio, S., Shazeer, N.: End-to-end text-dependent speaker
verification. In: IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP). Shanghai, China (2016)
2. David, S., Pegah, G., Daniel, P., Daniel, G.R., Yishay, C., Sanjeev K.: Neural
Network-Based Speaker Embeddings for End-To-End Speaker Verification. In: IEEE
Spoken Language Technology Workshop (SLT). San Diego, California (2016)
10 Authors Suppressed Due to Excessive Length
3. Schroff, F., Philbin, J.: FaceNet: A Unified Embedding for Face Recognition and
Clustering. In: The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 815-823. Boston, MA (2015)
4. Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory. In: Neural Computation
November 15, vol. 9, no. 8, pp. 1735-1780 (1997)
5. Hinton, G., Deng, L., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke,
V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for acoustic mod-
eling in speech recognition. In: IEEE Signal Processing Magazine, vol. 29, no. 6, pp.
82-97. IEEE Press, (2012)
6. Kingma, D., Ba, J.: Adam: A Method for Stochastic Optimization. In: 3rd Interna-
tional Conference for Learning Representations. San Diego (2015)
7. Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor anal-
ysis for speaker verification. In: IEEE Transactions on Audio, Speech, and Language
Processing, vol. 19, pp. 788-798. IEEE Press (2010)
8. Prince, S.J., Elder, J.H.: Probabilistic linear discriminant analysis for inferences
about identity. In: 11th International Conference on Computer Vision (ICCV), pp.
1-8. Rio de Janeiro, Brazil (2007)
9. Cumani, S., Laface, P., Torino, P.: Probabilistic Linear Discriminant Analysis Of
Ivector Posterior Distributions. In: IEEE International Conference on Acoustics,
Speech, and Signal Processing (ICASSP). Vancouver, Canada (2013)
10. Sainath, Tara N., Weiss, Ron J., Senior, Andrew, Wilson, Kevin W., Vinyals, Oriol:
Learning the speech front-end with raw waveform cldnns. In: 16th Annual Con-
ference of the International Speech CommunicationAssociation (INTERSPEECH).
Dresden, Germany (2015)
11. Cho, K., Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F.,Schwenk,
H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statis-
tical machine translation. In: Conference on Empirical Methods in Natural Language
Processing (EMNLP). Doha, Qatar (2014)
12. Jozefowicz, R., Zaremba W., Sutskever, I.: An empirical exploration of recurrent
network architectures. In: International Conference on Machine Learning (ICML).
Lille, France (2015)
13. Yang, Z., Yang, D., Dyer Chr., He, X., Smola, A., Hovy, E.: Hierarchical attention
networks for document classification. In: Proceedings of the 2016 Conference of the
North American Chapter of the Association for Computational Linguistics: Human
Language Technologies. San Diego, California (2016)
14. Luong, M., Pham, H., Christopher, M.: Effective Approaches to Attention-based
Neural Machine Translation. In: Empirical Methods in Natural Language Processing
(EMNLP). Lisbon, Portugal (2015)
15. Li., Ch., Ma, X., Jiang, B., Li, X., Zhang, X., Liu, X., Cao, Y., Kannan, A., Zhu,
Zh.: Deep Speaker: an End-to-End Neural Speaker Embedding System. In: IEEE
Spoken Language Technology Workshop (SLT). San Diego, California (2016)
16. Cao, Z., Long, M., Wang, J., Yu, P.: HashNet: Deep Learning to Hash by Con-
tinuation. In: IEEE International Conference on Computer Vision (ICCV). Venice,
Italy (2017)
17. NIST SRE. https://www.nist.gov/itl/iad/mig/speaker- recognition
18. Testarium. Research tool. http://testarium.makseq.com
19. TfMicro. Tensorflow binding. http://github.com/makseq/tfmicro
Preprint
Full-text available
Existing methods for few-shot speaker identification (FSSI) obtain high accuracy, but their computational complexities and model sizes need to be reduced for lightweight applications. In this work, we propose a FSSI method using a lightweight prototypical network with the final goal to implement the FSSI on intelligent terminals with limited resources, such as smart watches and smart speakers. In the proposed prototypical network, an embedding module is designed to perform feature grouping for reducing the memory requirement and computational complexity, and feature interaction for enhancing the representational ability of the learned speaker embedding. In the proposed embedding module, audio feature of each speech sample is split into several low-dimensional feature subsets that are transformed by a recurrent convolutional block in parallel. Then, the operations of averaging, addition, concatenation, element-wise summation and statistics pooling are sequentially executed to learn a speaker embedding for each speech sample. The recurrent convolutional block consists of a block of bidirectional long short-term memory, and a block of de-redundancy convolution in which feature grouping and interaction are conducted too. Our method is compared to baseline methods on three datasets that are selected from three public speech corpora (VoxCeleb1, VoxCeleb2, and LibriSpeech). The results show that our method obtains higher accuracy under several conditions, and has advantages over all baseline methods in computational complexity and model size.
Article
Full-text available
This paper presents an extension of our previous work which proposes a new speaker representation for speaker verification. In this modeling, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis. This space is named the total variability space because it models both speaker and channel variabilities. Two speaker verification systems are proposed which use this new representation. The first system is a support vector machine-based system that uses the cosine kernel to estimate the similarity between the input data. The second system directly uses the cosine similarity as the final decision score. We tested three channel compensation techniques in the total variability space, which are within-class covariance normalization (WCCN), linear discriminate analysis (LDA), and nuisance attribute projection (NAP). We found that the best results are obtained when LDA is followed by WCCN. We achieved an equal error rate (EER) of 1.12% and MinDCF of 0.0094 using the cosine distance scoring on the male English trials of the core condition of the NIST 2008 Speaker Recognition Evaluation dataset. We also obtained 4% absolute EER improvement for both-gender trials on the 10 s-10 s condition compared to the classical joint factor analysis scoring.
Conference Paper
Full-text available
Many current face recognition algorithms perform badly when the lighting or pose of the probe and gallery images differ. In this paper we present a novel algorithm designed for these conditions. We describe face data as resulting from a generative model which incorporates both within- individual and between-individual variation. In recognition we calculate the likelihood that the differences between face images are entirely due to within-individual variability. We extend this to the non-linear case where an arbitrary face manifold can be described and noise is position-dependent. We also develop a "tied" version of the algorithm that al- lows explicit comparison across quite different viewing con- ditions. We demonstrate that our model produces state of the art results for (i) frontal face recognition (ii) face recog- nition under varying pose.
Article
An attentional mechanism has lately been used to improve neural machine translation (NMT) by selectively focusing on parts of the source sentence during translation. However, there has been little work exploring useful architectures for attention-based NMT. This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time. We demonstrate the effectiveness of both approaches over the WMT translation tasks between English and German in both directions. With local attention, we achieve a significant gain of 5.0 BLEU points over non-attentional systems which already incorporate known techniques such as dropout. Our ensemble model using different attention architectures has established a new state-of-the-art result in the WMT'15 English to German translation task with 25.9 BLEU points, an improvement of 1.0 BLEU points over the existing best system backed by NMT and an n-gram reranker.
Article
Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks with many hidden layers, that are trained using new methods have been shown to outperform Gaussian mixture models on a variety of speech recognition benchmarks, sometimes by a large margin. This paper provides an overview of this progress and represents the shared views of four research groups who have had recent successes in using deep neural networks for acoustic modeling in speech recognition.