Conference Paper

Analysis of Speaker Recognition Systems in Realistic Scenarios of the SITW 2016 Challenge

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In the recent DIHARD diarization challenge, the latest techniques are investigated with reference to these modules by different research groups [14]- [17]. For SV in a multi-speaker condition, the works of [18], [19] in SITW challenge evaluation showed that diarization can be useful for multi-speaker test condition. The result under such scenario is reported assuming that a specific number of speakers are present in the test trials. ...
... A baseline system with x-vector modeling is initially developed without conducting any speaker clustering on multispeaker test trials. Then we apply traditional AHC based [18], [19]. Finally, the proposed framework with speaker cluster refinement using penalty distance is implemented on those two conditions for comparison. ...
Conference Paper
Full-text available
Speaker verification in a multi-speaker environment is an emerging research topic. Speaker clustering, that separates multiple speakers, can be effective if a predetermined threshold or the number of speakers present in a multi-speaker utterance is given. However, the problem in practice does not provide the leverage for either of the factors. This work proposes to handle such a problem by introducing a penalty distance factor in the pipeline of traditional clustering techniques. The proposed framework first uses traditional clustering techniques to form speaker clusters for a given number of speakers. We then compute the penalty distance based on Bayesian information criterion that is used for merging alike clusters in a multi-speaker utterance. The studies are conducted on speakers in the wild (SITW) and recent NIST SRE 2018 databases that contain multi-speaker conversational speech in noisy environments. The results show the effectiveness of the proposed penalty distance based refinement in such a scenario.
... The performance of speaker verification degrades significantly when the speech is corrupted by interference speakers. Speaker diarization can be useful for speaker verification with nonoverlapping multi-talker speech [1][2][3][4][5][6]. It can effectively exclude unwanted speech segments when the speakers only slightly overlap [7,8]. ...
... Our system is a single SR system, at least one of the best-performing systems used in SITW evaluation uses system fusion[36], or employs much more training data. Thus the performance gap between our system and other systems in the evaluation should be smaller ...
Conference Paper
Full-text available
This paper investigates text-independent speaker recognition using neural embedding extractors based on the time-delay neural network. Our primary focus is to explore the teacher-student (TS) training framework for knowledge distillation in a text-independent (TI) speaker recognition task. We report the results on both proprietary and public benchmarks, obtaining competitive results with 88-93% smaller models. Particularly , in clean testing conditions, we find TS training on neural-based TI systems achieved same or better performance than the i-vector based counterparts. Neural embeddings are less prone to short segment issues, and offer better performance particularly in the high-recall setting. They can also provide some additional insights about speakers, such as gender or how difficult a given speaker can be for recognition.
... Mis à part les données fournies par les bases NIST SRE et la base SITW, on génère artificiellement des données corrompues afin d'avoir plus de contrôle sur les conditions 6. Cette opération vient au prix d'une perte de performances. Le lecteur peut se référer à l'analyse de (Novotnỳ et al., 2016) pour voir la différence entre les systèmes qui utilisent les données échantillonnées à 16kHz et à 8kHz. On traite dans nos expériences deux types de nuisances ; le bruit additif et la variabilité des durées. ...
Thesis
Full-text available
Speaker recognition witnessed considerable progress in the last decade, achieving very low error rates in controlled conditions. However, the implementation of this tech- nology in real applications is hampered by the great degradation of performances in presence of acoustic nuisances. A lot of effort has been invested by the research com- munity in the design of nuisance compensation techniques in the past years. These al- gorithms operate at different levels : signal, acoustic parameters, models or scores. With the development of the "total variability" paradigm, new possibilities can be explored due to the simple statistical properties of the i-vector space. Our work falls within this framework and presents new compensation techniques which operate directly in the i-vector space. These algorithms use simple relationships between corrupted i-vectors and the corresponding clean versions and ignore the real effect of nuisances in this domain. In order to implement this methodology, pairs of clean and corrupted data are artificially generated then used to develop nuisance com- pensation algorithms.
... The aim of this study is to evaluate one of the well-performed systems in NIST challenges [5] in case of limited training dataset (total duration of speech is about 36 minutes) and to compare it with the algorithm that was developed at the University of Novi Sad [6]. In the last decade systems based on i-vectors extracted from mel-frequency cepstral coefficients (MFCCs) or perceptual linear prediction (PLP) or bottle neck (BN) features in combination with linear discriminant analysis (LDA) or near discriminant analysis (NDA) and probabilistic linear discriminant analysis (PLDA) [7,8,9,10,3]. Deep neural networks did not take into consideration since the amount of training data was insufficient for good parameter estimation. ...
Conference Paper
Full-text available
The paper presents results of an evaluation of covariance matrix and i-vector based speaker identification methods on Serbian S70W100s120 database. Open set speaker identification evaluation scheme was adopted. The number of target speakers and the number of impostors were 20 and 60 respectively. Additional utterances from 41 speakers were used for training. Amount of data for modeling a target speaker was limited to about 4 s of speech. In this study, the i-vector base approach showed significantly better performance (equal error rate EER ~5%) than the covariance matrix based approach (EER ~16%). This small EER for the i-vector based approach was obtained after substantial reduction of the number of the parameters in universal background model, i-vector transformation matrix and Gaussian probabilistic linear discriminant analysis that is typically reported in the papers. Additionally, these experiments showed that cepstral mean and variance normalization can deteriorate EER in case of a single channel.
... So the results are for 49 user authentications without cross-validation. We used Bosaris 1 toolkit [6] with widely used fusion approach for speaker identification/authentication [16] or Query by Example Search on Speech [19] for results comparison. We used a half of the testing set as development subset for fusion function training and applied it to the rest of the testing set -evaluation subset. ...
Article
Full-text available
In this paper we investigate the capacity of sound & timing information during typing of a password for the user identification and authentication task. The novelty of this paper lies in the comparison of performance between improved timing-based and audio-based keystroke dynamics analysis and the fusion for the keystroke authentication. We collected data of 50 people typing the same given password 100 times, divided into 4 sessions of 25 typings and tested how well the system could recognize the correct typist. Using fusion of timing (9.73%) and audio calibration scores (8.99%) described in the paper we achieved 4.65% EER (Equal Error Rate) for the authentication task. The results show the potential of using Audio Keystroke Dynamics information as a way to authenticate or identify users during log-on.
Article
Full-text available
Speaker verification has been studied mostly under the single-talker condition. It is adversely affected in the presence of interference speakers. Inspired by the study on target speaker extraction, e.g., SpEx, we propose a unified speaker verification framework for both single- and multi-talker speech, that is able to pay selective auditory attention to the target speaker. This target speaker verification (tSV) framework jointly optimizes a speaker attention module and a speaker representation module via multi-task learning. We study four different target speaker embedding schemes under the tSV framework. The experimental results show that all four target speaker embedding schemes significantly outperform other competitive solutions for multi-talker speech. Notably, the best tSV speaker embedding scheme achieves 76.0% and 55.3% relative improvements over the baseline system on the WSJ0-2mix-extr and Libri2Mix corpora in terms of equal-error-rate for 2-talker speech, while the performance of tSV for single-talker speech is on par with that of traditional speaker verification system, that is trained and evaluated under the same single-talker condition.
Preprint
Full-text available
Speaker verification has been studied mostly under the single-talker condition. It is adversely affected in the presence of interference speakers. Inspired by the study on target speaker extraction, e.g., SpEx, we propose a unified speaker verification framework for both single- and multi-talker speech, that is able to pay selective auditory attention to the target speaker. This target speaker verification (tSV) framework jointly optimizes a speaker attention module and a speaker representation module via multi-task learning. We study four different target speaker embedding schemes under the tSV framework. The experimental results show that all four target speaker embedding schemes significantly outperform other competitive solutions for multi-talker speech. Notably, the best tSV speaker embedding scheme achieves 76.0% and 55.3% relative improvements over the baseline system on the WSJ0-2mix-extr and Libri2Mix corpora in terms of equal-error-rate for 2-talker speech, while the performance of tSV for single-talker speech is on par with that of traditional speaker verification system, that is trained and evaluated under the same single-talker condition.
Article
In this paper, we present a brief history and a “longitudinal study” of all important milestone modelling techniques used in text independent speaker recognition since Brno University of Technology (BUT) first participated in the NIST Speaker Recognition Evaluation (SRE) in 2006—GMM MAP, GMM MAP with eigen-channel adaptation, Joint Factor Analysis, i-vector and DNN embedding (x-vector). To emphasize the historical context, the techniques are evaluated on all NIST SRE sets since 2004 on a time-machine principle, i.e. a system is always trained using all data available up till the year of evaluation. Moreover, as user-contributed audiovisual content dominates nowadays’ Internet, we representatively include the Speakers In The Wild (SITW) and VOiCES challenge datasets in the evaluation of our systems. Not only we present a comparison of the modelling techniques, but we also show the effect of sampling frequency.
Article
The automatic analysis of conversational audio remains difficult, in part, due to the presence of multiple talkers speaking in turns, often with significant intonation variations and overlapping speech. The majority of prior work on psychoacoustic speech analysis and system design has focused on single-talker speech or multi-talker speech with overlapping talkers (for example, the cocktail party effect). There has been much less focus on how listeners detect a change in talker or in probing the acoustic features significant in characterizing a talker's voice in conversational speech. This study examines human talker change detection (TCD) in multi-party speech utterances using a behavioral paradigm in which listeners indicate the moment of perceived talker change. Human reaction times in this task can be well-estimated by a model of the acoustic feature distance among speech segments before and after a change in talker, with estimation improving for models incorporating longer durations of speech prior to a talker change. Further, human performance is superior to several online and offline state-of-the-art machine TCD systems.
Article
Speaker and language recognition and characterization is an exciting area of research that has gained importance in the field of speech science and technology. This special issue features, among other contributions, some of the most remarkable ideas presented and discussed at Odyssey 2016: the Speaker and Language Recognition Workshop, held in Bilbao, Spain, in June 2016. This introduction provides an overview of the selected papers in the context of current challenges.
Article
Full-text available
We introduce a new database for evaluation of speaker recognition systems. This database involves types of variability already seen in NIST speaker recognition evaluations (SREs) like language, channel, speech style and vocal effort, and new types not yet available on any standard database like severe noise, and reverberation. The database is created using data from NIST SREs from 2004 to 2010. We present results of a state-of-the-art system on the different subset of this database. The database will be publicly available, and this work aims at encouraging other sites to adopt it and improve it.
Conference Paper
Full-text available
In this paper we present novel language-independent bottleneck (BN) feature extraction framework. In our experiments we have used Multilingual Artificial Neural Network (ANN), where each language is modelled by separate output layer, while all the hidden layers jointly model the variability of all the source languages. The key idea is that the entire ANN is trained on all the languages simultaneously, thus the BN-features are not biased towards any of the languages. Exactly for this reason, the final BN-features are considered as language independent. In the experiments with GlobalPhone database, we show that Multilingual BN-features consistently outperform Monolingual BN-features. Also, cross-lingual generalization is evaluated, where we train on 5 source languages and test on 3 other languages. The results show that the ANN can produce very good BN-features even for unseen languages, in some cases even better than if we trained the ANN on the target language only.
Conference Paper
Full-text available
This paper describes the speaker identification (SID) system developed by the Patrol team for the first phase of the DARPA RATS (Robust Automatic Transcription of Speech) program, which seeks to advance state of the art detection capabilities on audio from highly degraded communication channels. We present results using multiple SID systems differing mainly in the algorithm used for voice activity detection (VAD) and feature extraction. We show that (a) unsupervised VAD performs as well supervised methods in terms of downstream SID performance, (b) noise-robust feature extraction methods such as CFCCs out-perform MFCC front-ends on noisy audio, and (c) fusion of multiple systems provides 24% relative improvement in EER compared to the single best system when using a novel SVM-based fusion algorithm that uses side information such as gender, language, and channel id.
Article
Full-text available
We report on work on speaker diarization of telephone conversations which was begun at the Robust Speaker Recognition Workshop held at Johns Hopkins University in 2008. Three diarization systems were developed and experiments were conducted using the summed-channel telephone data from the 2008 NIST speaker recognition evaluation. The systems are a Baseline agglomerative clustering system, a Streaming system which uses speaker factors for speaker change point detection and traditional methods for speaker clustering, and a Variational Bayes system designed to exploit a large number of speaker factors as in state of the art speaker recognition systems. The Variational Bayes system proved to be the most effective, achieving a diarization error rate of 1.0% on the summed-channel data. This represents an 85% reduction in errors compared with the Baseline agglomerative clustering system. An interesting aspect of the Variational Bayes approach is that it implicitly performs speaker clustering in a way which avoids making premature hard decisions. This type of soft speaker clustering can be incorporated into other diarization systems (although causality has to be sacrificed in the case of the Streaming system). With this modification, the Baseline system achieved a diarization error rate of 3.5% (a 50% reduction in errors).
Article
Full-text available
This paper presents an extension of our previous work which proposes a new speaker representation for speaker verification. In this modeling, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis. This space is named the total variability space because it models both speaker and channel variabilities. Two speaker verification systems are proposed which use this new representation. The first system is a support vector machine-based system that uses the cosine kernel to estimate the similarity between the input data. The second system directly uses the cosine similarity as the final decision score. We tested three channel compensation techniques in the total variability space, which are within-class covariance normalization (WCCN), linear discriminate analysis (LDA), and nuisance attribute projection (NAP). We found that the best results are obtained when LDA is followed by WCCN. We achieved an equal error rate (EER) of 1.12% and MinDCF of 0.0094 using the cosine distance scoring on the male English trials of the core condition of the NIST 2008 Speaker Recognition Evaluation dataset. We also obtained 4% absolute EER improvement for both-gender trials on the 10 s-10 s condition compared to the classical joint factor analysis scoring.
Conference Paper
Full-text available
This work investigates into recently proposed Bottle-Neck fea- tures for ASR. The bottle-neck ANN structure is imported into Split Context architecture gaining significant WER reduction. Further, Universal Context architecture was developed which simplifies the system by using only one universal ANN for all temporal splits. Significant WER reduction can be obtained by applying fMPE on top of our BN features as a technique for dis- criminative feature extraction and further gain is also obtained by retraining model parameters using MPE criterion. The re- sults are reported on meeting data from RT07 evaluation. Index Terms: Bottle-neck, ANN architecture, features, LVCSR
Conference Paper
Full-text available
Many current face recognition algorithms perform badly when the lighting or pose of the probe and gallery images differ. In this paper we present a novel algorithm designed for these conditions. We describe face data as resulting from a generative model which incorporates both within- individual and between-individual variation. In recognition we calculate the likelihood that the differences between face images are entirely due to within-individual variability. We extend this to the non-linear case where an arbitrary face manifold can be described and noise is position-dependent. We also develop a "tied" version of the algorithm that al- lows explicit comparison across quite different viewing con- ditions. We demonstrate that our model produces state of the art results for (i) frontal face recognition (ii) face recog- nition under varying pose.
Conference Paper
Full-text available
Intonation is an important aspect of vocal production, used for a variety of communicative needs. Its modeling is therefore crucial in many speech understanding systems, particularly those requiring inference of speaker intent in real-time. However, the estimation of pitch, traditionally the first step in intonation modeling, is computationally inconvenient in such scenarios. This is because it is often, and most optimally, achieved only after speech segmentation and recognition. A consequence is that earlier speech processing components, in today�s state-of-the-art systems, lack intonation awareness by fiat; it is not known to what extent this circumscribes their performance. In the current work, we present a freely available implementation of an alternative to pitch estimation, namely the computation of the fundamental frequency variation (FFV) spectrum, which can be easily employed at any level within a speech processing system. It is our hope that the implementation we describe aid in the understanding of this novel acoustic feature space, and that it facilitate its inclusion, as desired, in the front-end routines of speech recognition, dialog act recognition, and speaker recognition systems.
Conference Paper
Features based on a hierarchy of neural networks with compressive layers - Stacked Bottle-Neck (SBN) features - were recently shown to provide excellent performance in LVCSR systems. This paper summarizes several techniques investigated in our work towards Babel 2014 evaluations: (1) using several versions of fundamental frequency (F0) estimates, (2) semi-supervised training on un-transcribed data and mainly (3) adapting the NN structure at different levels. They are tested on three 2014 Babel languages with full GMM- and DNN-based systems. Separately and in combination, they are shown to outperform the baselines and confirm the usefulness of bottle-neck features in current ASR systems.
On the use of i-vector posterior distributions in probabilistic linear discriminant analysis
  • Sandro Cumani
  • Pietro Laface
  • Oldřich Plchot
Sandro Cumani, Pietro Laface, and Oldřich Plchot, "On the use of i-vector posterior distributions in probabilistic linear discriminant analysis," IEEE/ACM TRANSAC-TIONS ON AUDIO, SPEECH AND LANGUAGE PRO-CESSING, vol. 22, no. 4, pp. 846-857, 2014.
Comparative study on the use of senone-based deep neural networks for speaker recognition
  • Y Lei
  • L Ferrer
  • M Mclaren
  • N Scheffer
Y. Lei, L. Ferrer, M. McLaren, and N. Scheffer, "Comparative study on the use of senone-based deep neural networks for speaker recognition," Submitted to IEEE Trans. ASLP, 2014.
Analysis of i-vector length normalization in Gaussian-PLDA speaker recognition systemsBosaris toolkit," https://sites.google.com/siteThe 2010 NIST speaker recognition evaluation plan (SRE10)
  • Daniel Garcia-Romero
Daniel Garcia-Romero, "Analysis of i-vector length normalization in Gaussian-PLDA speaker recognition systems," 2011. [22] "Bosaris toolkit," https://sites.google.com/site/bosaristoolkit/. [23] "The 2010 NIST speaker recognition evaluation plan (SRE10)," http://www.itl.nist.gov/iad/mig/tests/sre/2010/.
Analysis of dnn approaches to speaker identification
  • Pavel Matějka
  • Ondřej Glembek
  • Ondřej Novotný
  • Oldřich Plchot
  • František Grézl
  • Lukáš Burget
  • Jaň Cernocký
Pavel Matějka, Ondřej Glembek, Ondřej Novotný, Oldřich Plchot, František Grézl, Lukáš Burget, and Jaň Cernocký, "Analysis of dnn approaches to speaker identification," in Proceedings of the 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2016. 2016, IEEE Signal Processing Society.
Brno university of technology system for NIST 2005 language recognition evaluation
  • Pavel Matějka
  • Lukáš Burget
  • Petr Schwarz
  • Jaň Cernocký
Pavel Matějka, Lukáš Burget, Petr Schwarz, and Jaň Cernocký, "Brno university of technology system for NIST 2005 language recognition evaluation," in Proceedings of Odyssey 2006, San Juan, PR, 2006.