-
INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy, August 27-31, 2011; 01/2011
-
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, May 22-27, 2011, Prague Congress Center, Prague, Czech Republic; 01/2011
-
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, May 22-27, 2011, Prague Congress Center, Prague, Czech Republic; 01/2011
-
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, May 22-27, 2011, Prague Congress Center, Prague, Czech Republic; 01/2011
-
INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy, August 27-31, 2011; 01/2011
-
IEEE Transactions on Audio, Speech & Language Processing. 01/2011; 19:788-798.
-
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2009, 19-24 April 2009, Taipei, Taiwan; 01/2009
-
INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, Brighton, United Kingdom, September 6-10, 2009; 01/2009
-
INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, Brighton, United Kingdom, September 6-10, 2009; 01/2009
-
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2009, 19-24 April 2009, Taipei, Taiwan; 01/2009
-
IEEE Transactions on Audio, Speech & Language Processing. 01/2008; 16:980-988.
-
INTERSPEECH 2008, 9th Annual Conference of the International Speech Communication Association, Brisbane, Australia, September 22-26, 2008; 01/2008
-
IEEE Transactions on Audio, Speech & Language Processing. 01/2007; 15:2095-2103.
-
INTERSPEECH 2007, 8th Annual Conference of the International Speech Communication Association, Antwerp, Belgium, August 27-31, 2007; 01/2007
-
INTERSPEECH 2007, 8th Annual Conference of the International Speech Communication Association, Antwerp, Belgium, August 27-31, 2007; 01/2007
-
[show abstract]
[hide abstract]
ABSTRACT: We tested factor analysis models having various numbers of speaker factors on the core condition and the extended data con-dition of the 2006 NIST speaker recognition evaluation. In or-der to ensure strict disjointness between training and test sets, the factor analysis models were trained without using any of the data made available for the 2005 evaluation. The factor analysis training set consisted primarily of Switchboard data and so was to some degree mismatched with the 2006 test data (drawn from the Mixer collection). Consequently, our initial results were not as good as those submitted for the 2006 evaluation. However we found that we could compensate for this by a simple mod-ification to our score normalization strategy, namely by using 1000 z-norm utterances in zt-norm. Our purpose in varying the number of speaker factors was to evaluate the eigenvoice MAP and classical MAP components of the inter-speaker variability model in factor analysis. We found that on the core condition (i.e. 2–3 minutes of enrollment data), only the eigenvoice MAP component plays a useful role. On the other hand, on the extended data condition (i.e. 15–20 min-utes of enrollment data) both the classical MAP component and the eigenvoice component proved to be useful provided that the number of speaker factors was limited. Our best result on the extended data condition (all trials) was an equal error rate of 2.2% and a detection cost of 0.011.
-
[show abstract]
[hide abstract]
ABSTRACT: We present a new approach for constructing the kernels used to build support vector machines for speaker verification. The idea is to construct new kernels by taking linear combination of many kernels such as the GLDS and GMM supervector kernels. In this new kernel combination, The combination weights are speaker dependent rather than universal weights on score level fusion and there is no need for extra-data to estimate them. An experiment on the NIST 2006 speaker recognition evaluation dataset (all trial) was done using three different kernel functions (GLDS kernel, linear and Gaussian GMM supervector kernels). We compared our kernel combination to the optimal linear score fusion obtained using logistic regression. This optimal score fusion was trained on the same test data. We had an equal error rate of ≃ 5, 9% using the kernel combination technique which is better to the optimal score fusion system (≃ 6, 0%).
-
Lukáš Burget,
Niko Brümmer,
Douglas Reynolds,
Patrick Kenny,
Jason Pelecanos,
Robbie Vogt,
Fabio Castaldo, Najim Dehak,
Reda Dehak,
Ondřej Glembek,
Zahi N Karam,
John Noecker,
Elly ( Hye,
Young ) Na,
Ciprian Constantin Costin,
Valiantsina Hubeika,
Sachin Kajarekar,
Nicolas Scheffer,
Jan " Honza
-
[show abstract]
[hide abstract]
ABSTRACT: In recent work [1], a simplified and highly effective approach to speaker recognition based on the cosine similarity between low-dimensional vectors, termed ivectors, defined in a total variabil-ity space was introduced. The total variability space represen-tation is motivated by the popular Joint Factor Analysis (JFA) approach, but does not require the complication of estimating separate speaker and channel spaces and has been shown to be less dependent on score normalization procedures, such as z-norm and t-norm. In this paper, we introduce a modification to the cosine similarity that does not require explicit score normal-ization, relying instead on simple mean and covariance statistics from a collection of impostor speaker ivectors. By avoiding the complication of z-and t-norm, the new approach further allows for application of a new unsupervised speaker adapta-tion technique to models defined in the ivector space. Exper-iments are conducted on the core condition of the NIST 2008 corpora, where, with adaptation, the new approach produces an equal error rate (EER) of 4.8% and min decision cost function (MinDCF) of 2.3% on all female speaker trials.
-
[show abstract]
[hide abstract]
ABSTRACT: It is widely believed that speaker verification systems per-form better when there is sufficient background training data to deal with nuisance effects of transmission chan-nels. It is also known that these systems perform at their best when the sound environment of the training data is similar to that of the context of use (test context). For some applications however, training data from the same type of sound environment is scarce, whereas a consid-erable amount of data from a different type of environ-ment is available. In this paper, we propose a new ar-chitecture for text-independent speaker verification sys-tems that are satisfactorily trained by virtue of a limited amount of application-specific data, supplemented with a sufficient amount of training data from some other con-text. This architecture is based on the extraction of pa-rameters (i-vectors) from a low-dimensional space (to-tal variability space) proposed by Dehak [1]. Our aim is to extend Dehak's work to speaker recognition on sparse data, namely microphone speech. The main challenge is to overcome the fact that insufficient application-specific data is available to accurately estimate the total variability covariance matrix. We propose a method based on Joint Factor Analysis (JFA) to estimate microphone eigenchan-nels (sparse data) with telephone eigenchannels (suffi-cient data). For classification, we experimented with the follow-ing two approaches: Support Vector Machines (SVM) and Cosine Distance Scoring (CDS) classifier, based on cosine distances. We present recognition results for the part of female voices in the interview data of the NIST 2008 SRE. The best performance is obtained when our system is fused with the state-of-the-art JFA. We achieve 13% relative improvement on equal error rate and the minimum value of detection cost function decreases from 0.0219 to 0.0164.