ArticlePDF Available

# Front-End Factor Analysis for Speaker Verification

Authors:

## Abstract and Figures

This paper presents an extension of our previous work which proposes a new speaker representation for speaker verification. In this modeling, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis. This space is named the total variability space because it models both speaker and channel variabilities. Two speaker verification systems are proposed which use this new representation. The first system is a support vector machine-based system that uses the cosine kernel to estimate the similarity between the input data. The second system directly uses the cosine similarity as the final decision score. We tested three channel compensation techniques in the total variability space, which are within-class covariance normalization (WCCN), linear discriminate analysis (LDA), and nuisance attribute projection (NAP). We found that the best results are obtained when LDA is followed by WCCN. We achieved an equal error rate (EER) of 1.12% and MinDCF of 0.0094 using the cosine distance scoring on the male English trials of the core condition of the NIST 2008 Speaker Recognition Evaluation dataset. We also obtained 4% absolute EER improvement for both-gender trials on the 10 s-10 s condition compared to the classical joint factor analysis scoring.
Content may be subject to copyright.
A preview of the PDF is not available
... Recently, deep learning has been applied in various fields with tremendous success, and in speaker verification (SV) tasks, deep neural network (DNN)-based embedding learning methods [1,2] exhibit satisfactory results compared to traditional schemes [3,4]. Most embedding extractors perform reliably in clean scenarios but suffer from performance degradation in noisy environments. ...
Preprint
Background noise is a well-known factor that deteriorates the accuracy and reliability of speaker verification (SV) systems by blurring speech intelligibility. Various studies have used separate pretrained enhancement models as the front-end module of the SV system in noisy environments, and these methods effectively remove noises. However, the denoising process of independent enhancement models not tailored to the SV task can also distort the speaker information included in utterances. We argue that the enhancement network and speaker embedding extractor should be fully jointly trained for SV tasks under noisy conditions to alleviate this issue. Therefore, we proposed a U-Net-based integrated framework that simultaneously optimizes speaker identification and feature enhancement losses. Moreover, we analyzed the structural limitations of using U-Net directly for noise SV tasks and further proposed Extended U-Net to reduce these drawbacks. We evaluated the models on the noise-synthesized VoxCeleb1 test set and VOiCES development set recorded in various noisy scenarios. The experimental results demonstrate that the U-Net-based fully joint training framework is more effective than the baseline, and the extended U-Net exhibited state-of-the-art performance versus the recently proposed compensation systems.
... It is about the prediction, whether the speaker truly made the said utterance or not. This area of speaker verification is elaborately analyzed by cepstrogram and vocal cord excitation [7,16,38]. Furthermore, the false reject is an important issue in the mass data analysis of live stream voice signals to identify the target speaker who belongs to a captured voice. If the false accept rate is growing, then there will be a high chance of the convicted speaker missed from the target list. ...
Article
Full-text available
The false accept and false reject are the most vulnerable areas of speaker recognition and speaker authentication process. Speaker verification is an action to verify the existence of a specific speech signal in the collected set of utterances. The target set is prior equipped with the said speaker’s voice signal as well as with other speaker’s voice signal. The other speaker’s utterances are the impostor speakers, present in the target set. The speaker verification is a methodology that follows the one- to-one comparison procedure. It brings a conclusion either true or false, about the existence of the said speaker in the target list. This is an authentication for the existence of a said speaker in the target set. The speaker recognition is a process about the conjecture of association for the said speaker utterance with a sub-list of target speakers. In each type of testing like speaker verification and speaker identification, the role of decision threshold is extremely crucial. In a real-world scenario, the predicted score with respect to decision threshold value needs further rectification to minimize the false accept and false reject. This paper exhibits the environment based thresholds typically; lower-level threshold and upper-level threshold effectively reduce the EER (Equal error rate) during the testing of environment specific voice signals. In the simulation of speaker verification for the robust threshold selection, the system vigorously tested a large set of language-independent voice samples that are collected from the different environments. The performance analysis are conducted by using the ‘Detection Error Tradeoff’ (DET) plots based on the predicted lower-level threshold and upper-level thresholds, obtained for the specific environment. It brings impact over the false acceptance as well as on the false reject in real time testing for environment centric utterance. This work proposes a novel methodology to reduce the equal error rate for environment specific voice signal. This work focus on the environment based thresholds typically; lower-level threshold and upper-level threshold effectively to reduce the EER (Equal error rate) during the testing of environment specific voice signals. This work helps to enhance the audio-visual system use in forensic domain.
... Pre-training For PBERT, we adapt the LibriSpeech recipe from Kaldi [27] and ensure that only the train-clean-100 portion is used in the training pipeline, including HMM-GMM training, i-vector extractor training [28,29], and neural acoustic model training. The acoustic model is a TDNN-F network [30] trained with LF-MMI criterion [23]. ...
Preprint
Full-text available
Recently, masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition. It usually requires a codebook obtained in an unsupervised way, making it less accurate and difficult to interpret. We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance and also the pre-training efficiency, either through decoding with a hybrid ASR system to generate phoneme-level alignments (named PBERT), or performing clustering on the supervised speech features extracted from an end-to-end CTC model (named CTC clustering). Both the hybrid and CTC models are trained on the same small amount of labeled speech as used in fine-tuning. Experiments demonstrate significant superiority of our methods to various SSL and self-training baselines, with up to 17.0% relative WER reduction. Our pre-trained models also show good transferability in a non-ASR speech task.
... As a type of biometric technology, automatic speaker verification (ASV) [1] has made significant progress in recent years. From early algorithms based on statistical machine learning [2,3,4] to current deep learning based models [5,6,7], ASV systems with increasingly lower equal error rates (EERs) are gradually being used in practice. However, there are various spoofing attacks against ASV systems, including impersonation, replay, text-to-speech (TTS) and voice conversion (VC). ...
... To further understand the effect of these two types of strategies, we conducted analysis on the speaker discriminability of each system. Table 3 shows the speaker discriminability evaluated with between-class within-class variance ratio of speaker embeddings e S [29,30]. The training with SI-loss improves the speaker variance ratio as expected, by introducing an auxiliary speaker identification task. ...
Preprint
Full-text available
Target speech extraction is a technique to extract the target speaker's voice from mixture signals using a pre-recorded enrollment utterance that characterize the voice characteristics of the target speaker. One major difficulty of target speech extraction lies in handling variability in intra-speaker'' characteristics, i.e., characteristics mismatch between target speech and an enrollment utterance. While most conventional approaches focus on improving {\it average performance} given a set of enrollment utterances, here we propose to guarantee the {\it worst performance}, which we believe is of great practical importance. In this work, we propose an evaluation metric called worst-enrollment source-to-distortion ratio (SDR) to quantitatively measure the robustness towards enrollment variations. We also introduce a novel training scheme that aims at directly optimizing the worst-case performance by focusing on training with difficult enrollment cases where extraction does not perform well. In addition, we investigate the effectiveness of auxiliary speaker identification loss (SI-loss) as another way to improve robustness over enrollments. Experimental validation reveals the effectiveness of both worst-enrollment target training and SI-loss training to improve robustness against enrollment variations, by increasing speaker discriminability.
... SV systems are developed for various notable applications [1][2][3], such as speaker diarization, bio-metric authentication, and security. Deep neural network (DNN) based models [4][5][6][7][8] are predominantly adopted in current SV systems and lead to appreciable performance gain over conventional models, e.g., GMM-UBM, I-vectors [9][10][11]. Typically these DNN models take a certain form of acoustic features as input and produce neural embeddings that represent speaker-specific information in speech, which are then used for speaker discrimination. ...
Preprint
Mel-scale spectrum features are used in various recognition and classification tasks on speech signals. There is no reason to expect that these features are optimal for all different tasks, including speaker verification (SV). This paper describes a learnable front-end feature extraction model. The model comprises a group of filters to transform the Fourier spectrum. Model parameters that define these filters are trained end-to-end and optimized specifically for the task of speaker verification. Compared to the standard Mel-scale filter-bank, the filters' bandwidths and center frequencies are adjustable. Experimental results show that applying the learnable acoustic front-end improves speaker verification performance over conventional Mel-scale spectrum features. Analysis on the learned filter parameters suggests that narrow-band information benefits the SV system performance. The proposed model achieves a good balance between performance and computation cost. In resource-constrained computation settings, the model significantly outperforms CNN-based learnable front-ends. The generalization ability of the proposed model is also demonstrated on different embedding extraction models and datasets.
... To evaluate the TTS systems, we analysed the objective measures of word error rate (WER) and a speaker-encoder-based cosine similarity over the entire Obama test set. These two measures have recently been found to have a high correlation to the perceptual measures obtained in listening tests [29,30]. The WERs are extracted based on the automatic transcripts provided by the SpeechBrain ASR system [22], while the cosine similarity uses SpeechBrain's speaker embedding network. ...
Preprint
Full-text available
The task of converting text input into video content is becoming an important topic for synthetic media generation. Several methods have been proposed with some of them reaching close-to-natural performances in constrained tasks. In this paper, we tackle a subissue of the text-to-video generation problem, by converting the text into lip landmarks. However, we do this using a modular, controllable system architecture and evaluate each of its individual components. Our system, entitled FlexLip, is split into two separate modules: text-to-speech and speech-to-lip, both having underlying controllable deep neural network architectures. This modularity enables the easy replacement of each of its components, while also ensuring the fast adaptation to new speaker identities by disentangling or projecting the input features. We show that by using as little as 20 min of data for the audio generation component, and as little as 5 min for the speech-to-lip component, the objective measures of the generated lip landmarks are comparable with those obtained when using a larger set of training samples. We also introduce a series of objective evaluation measures over the complete flow of our system by taking into consideration several aspects of the data and system configuration. These aspects pertain to the quality and amount of training data, the use of pretrained models, and the data contained therein, as well as the identity of the target speaker; with regard to the latter, we show that we can perform zero-shot lip adaptation to an unseen identity by simply updating the shape of the lips in our model.
Chapter
In this paper, the temporal phase influence on speech signal is demonstrated through different experimental models, notably for speaker verification. Feature extraction is a fundamental block in a speaker recognition system responsible for obtaining speaker characteristics from speech signal. The commonly used short-term spectral features accentuate the magnitude spectrum while totally removing the phase spectrum. In this paper, the phase spectrum knowledge is extensively extracted and studied along with the magnitude information for speaker verification. The Linear Prediction Cepstral Coefficients (LPCC) are extracted from speech signal temporal phase and its scores are fused with Mel-Frequency Cepstral Coefficients (MFCC) scores. The trained data are modeled using the state-of-art speaker specific Gaussian mixture model (GMM) and GMM-Universal Background Model (GMM-UBM) for both LPCC and MFCC features. The scores are matched using dynamic time warping (DTW). The proposed method is tested on a fixed-pass phrase with a duration of <5 s in a speech signal. The score level fusion technique helps in the reduction of equal error rate (EER) and improves recognition rate.
Chapter
Natural language interfaces are gaining popularity as an alternative interface for non-technical users. Natural language interface to database (NLIDB) systems have been attracting considerable interest recently that are being developed to accept user’s query in natural language (NL), and then converting this NL query to an SQL query, the SQL query is executed to extract the resultant data from the database. This Text-to-SQL task is a long-standing, open problem, and towards solving the problem, the standard approach that is followed is to implement a sequence-to-sequence model. In this paper, I recast the Text-to-SQL task as a machine translation problem using sequence-to-sequence-style neural network models. To this end, I have introduced a parallel corpus that I have developed using the WikiSQL dataset. Though there are a lot of work done in this area using sequence-to-sequence-style models, most of the state-of-the-art models use semantic parsing or a variation of it. None of these models’ accuracy exceeds 90%. In contrast to it, my model is based on a very simple architecture as it uses an open-source neural machine translation toolkit OpenNMT, that implements a standard SEQ2SEQ model, and though my model’s performance is not better than the said models in predicting on test and development datasets, its training accuracy is higher than any existing NLIDB system to the best of my knowledge.
Article
Full-text available
This article presents several techniques to combine between Support vector machines (SVM) and Joint Factor Analysis (JFA) model for speaker verification. In this combination, the SVMs are applied to different sources of information produced by the JFA. These informations are the Gaussian Mixture Model supervectors and speakers and Common factors. We found that using SVM in JFA factors gave the best results especially when within class covariance normalization method is applied in order to compensate for the channel effect. The new combination results are comparable to other classical JFA scoring techniques.
Article
Full-text available
It is widely believed that speaker verification systems per-form better when there is sufficient background training data to deal with nuisance effects of transmission chan-nels. It is also known that these systems perform at their best when the sound environment of the training data is similar to that of the context of use (test context). For some applications however, training data from the same type of sound environment is scarce, whereas a consid-erable amount of data from a different type of environ-ment is available. In this paper, we propose a new ar-chitecture for text-independent speaker verification sys-tems that are satisfactorily trained by virtue of a limited amount of application-specific data, supplemented with a sufficient amount of training data from some other con-text. This architecture is based on the extraction of pa-rameters (i-vectors) from a low-dimensional space (to-tal variability space) proposed by Dehak [1]. Our aim is to extend Dehak's work to speaker recognition on sparse data, namely microphone speech. The main challenge is to overcome the fact that insufficient application-specific data is available to accurately estimate the total variability covariance matrix. We propose a method based on Joint Factor Analysis (JFA) to estimate microphone eigenchan-nels (sparse data) with telephone eigenchannels (suffi-cient data). For classification, we experimented with the follow-ing two approaches: Support Vector Machines (SVM) and Cosine Distance Scoring (CDS) classifier, based on cosine distances. We present recognition results for the part of female voices in the interview data of the NIST 2008 SRE. The best performance is obtained when our system is fused with the state-of-the-art JFA. We achieve 13% relative improvement on equal error rate and the minimum value of detection cost function decreases from 0.0219 to 0.0164.
Conference Paper
Full-text available
Gaussian mixture models with universal backgrounds (UBMs) have become the standard method for speaker recognition. Typically, a speaker model is constructed by MAP adaptation of the means of the UBM. A GMM supervector is constructed by stacking the means of the adapted mixture components. A recent discovery is that latent factor analysis of this GMM supervector is an effective method for variability compensation. We consider this GMM supervector in the context of support vector machines. We construct a support vector machine kernel using the GMM supervector. We show similarities based on this kernel between the method of SVM nuisance attribute projection (NAP) and the recent results in latent factor analysis. Experiments on a NIST SRE 2005 corpus demonstrate the effectiveness of the new technique
Conference Paper
Full-text available
This paper presents a new speaker verification system architecture based on Joint Factor Analysis (JFA) as feature extractor. In this modeling, the JFA is used to define a new low-dimensional space named the total variability factor space, instead of both channel and speaker variability spaces for the classical JFA. The main contribution in this approach, is the use of the cosine kernel in the new total factor space to design two different systems: the first system is Support Vector Machines based, and the second one uses directly this kernel as a decision score. This last scoring method makes the process faster and less computation complex compared to others classical methods. We tested several intersession compensation methods in total factors, and we found that the combination of Linear Discriminate Analysis and Within Class Covariance Normalization achieved the best performance. We achieved a remarkable results using fast scoring method based only on cosine kernel especially for male trials, we yield an EER of 1.12% and MinDCF of 0.0094 on the English trials of the NIST 2008 SRE dataset.
Chapter
In the history of research of the learning problem one can extract four periods that can be characterized by four bright events: (i) Constructing the first learning machines, (ii) constructing the fundamentals of the theory, (iii) constructing neural networks, (iv) constructing the alternatives to neural networks.
Book
Setting of the learning problem consistency of learning processes bounds on the rate of convergence of learning processes controlling the generalization ability of learning processes constructing learning algorithms what is important in learning theory?.
Article
Despite intuitive expectation and experimental evidence that phonemes contain useful speaker discriminating information, phoneme-based speaker recognition systems reported so far were not found to perform better than phoneme-independent speaker recognition systems based on Gaussian Mixture Model (GMM). The paper proposes a new phoneme-based speaker verification technique that uses models obtained by adaptation of well-trained speaker GMMs. The new proposed system was found to consis-tently outperform comparable sized phoneme-independent GMM based speaker verification systems in experiments held with clean and telephone speech databases.
Article
Reynolds, Douglas A., Quatieri, Thomas F., and Dunn, Robert B., Speaker Verification Using Adapted Gaussian Mixture Models, Digital Signal Processing10(2000), 19–41.In this paper we describe the major elements of MIT Lincoln Laboratory's Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs). The system is built around the likelihood ratio test for verification, using simple but effective GMMs for likelihood functions, a universal background model (UBM) for alternative speaker representation, and a form of Bayesian adaptation to derive speaker models from the UBM. The development and use of a handset detector and score normalization to greatly improve verification performance is also described and discussed. Finally, representative performance benchmarks and system behavior experiments on NIST SRE corpora are presented.
Conference Paper
This paper extends the within-class covariance normalization (WCCN) technique described in (1, 2) for training generalized linear kernels. We describe a practical procedure for applying WCCN to an SVM-based speaker recognition system where the input feature vectors reside in a high-dimensional space. Our ap- proach involves using principal component analysis (PCA) to split the original feature space into two subspaces: a low-dimensional "PCA space" and a high-dimensional "PCA-complement space." After performing WCCN in the PCA space, we concatenate the resulting feature vectors with a weighted version of their P CA- complements. When applied to a state-of-the-art MLLR-SVM speaker recognition system, this approach achieves improvements of up to 22% in EER and 28% in minimum decision cost function (DCF) over our previous baseline. We also achieve substantial im- provements over an MLLR-SVM system that performs WCCN in the PCA space but discards the PCA-complement. Index Terms: kernel machines, support vector machines, feature normalization, generalized linear kernels, speaker recog nition.