PreprintPDF Available

Speech Communication Evaluation of Nuance Forensics 9.2 and 11.1 under conditions reflecting those of a real forensic voice comparison case (forensic_eval_01)

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Two automatic speaker recognition systems, Nuance Forensics 9.2 and 11.1, were tested within the setting of the Speech Communication virtual special issue "Multi-laboratory evaluation of forensic voice comparison systems under conditions reflecting those of a real forensic case (forensic_eval_01). " Nuance Forensics 9.2 is an i-vector PLDA system and Nuance Forensics 11.1 combines i-vector PLDA technology with some Deep Neural Networks functionalities. Both systems were tested in three variants. The difference between the first and second variant lies in the size of the "Reference Population " (42 vs. 105 speakers) and the difference between the first two and the third variant lies in the use of the "Background Model ", either working with a system default (first two variants) or a dedicated model drawn from the forensic_eval_01 training data (third variant). The Reference Population is used for the purpose of calibration (arriving at calibrated likelihood ratios from voice comparison scores); the Background Model is used for normalising the scores (Adaptive S-norm). Comparing the three variants, it was shown across the two systems that the inclusion of a Background Model that is dedicated to the conditions of the case leads to improved performance over the use of a system default. The difference in the size of the Reference Population however did not matter. Comparing the two systems, it was found that the system that includes Deep Neural Network technology leads to improved results over the use of a pure i-vector PLDA system.
Content may be subject to copyright.
Speech Communication 110 (2019) 101–107
Contents lists available at ScienceDirect
Speech Communication
journal homepage: www.elsevier.com/locate/specom
Evaluation of Nuance Forensics 9.2 and 11.1 under conditions reecting
those of a real forensic voice comparison case (forensic_eval_01)
Michael Jessen
a
,
, Gur Meir
b
, Yosef A. Solewicz
b
a
Forensic Science Institute, Bundeskriminalamt, Wiesbaden, Germany
b
Division of Identification and Forensic Science, Israel Police, Jerusalem, Israel
Keywords:
Forensic voice comparison
Automatic speaker recognition
Evaluation
Nuance Forensics
Two automatic speaker recognition systems, Nuance Forensics 9.2 and 11.1, were tested within the setting of the
Speech Communication virtual special issue “Multi-laboratory evaluation of forensic voice comparison systems
under conditions reecting those of a real forensic case ( forensic_eval_01 ). ”Nuance Forensics 9.2 is an i-vector
PLDA system and Nuance Forensics 11.1 combines i-vector PLDA technology with some Deep Neural Networks
functionalities. Both systems were tested in three variants. The dierence between the rst and second variant lies
in the size of the “Reference Population ”(42 vs. 105 speakers) and the dierence between the rst two and the
third variant lies in the use of the “Background Model ”, either working with a system default (rst two variants)
or a dedicated model drawn from the forensic_eval_01 training data (third variant). The Reference Population is
used for the purpose of calibration (arriving at calibrated likelihood ratios from voice comparison scores); the
Background Model is used for normalising the scores (Adaptive S-norm). Comparing the three variants, it was
shown across the two systems that the inclusion of a Background Model that is dedicated to the conditions of the
case leads to improved performance over the use of a system default. The dierence in the size of the Reference
Population however did not matter. Comparing the two systems, it was found that the system that includes Deep
Neural Network technology leads to improved results over the use of a pure i-vector PLDA system.
1. Introduction
With the goal of evaluating the degree of validity and reliability of
various voice comparison systems under forensically realistic conditions
the Australian-English-based validation test called forensic_eval_01 was
constructed and made available to the community by Georey Stew-
art Morrison and Ewald Enzinger. Details of the motivation behind
forensic_eval_01 , its audio le content and set of tasks are provided by
Morrison and Enzinger (2016) .
The present paper describes the application of forensic_eval_01 to
forensic automatic speaker recognition software sold by the company
Nuance. Nuance oers an i-vector PLDA system called Nuance Foren-
sics 9.2 and has more recently released a product called Nuance Foren-
sics 11.1. This more recent product combines i-vector PLDA technology
(Probabilistic Linear Discriminant Analysis) with procedures based on
Deep Neural Networks (DNN).
In both models of Nuance Forensics, a set of recordings called
the “Reference Population ”is used as part of the method to calcu-
late calibrated likelihood ratios.
1
While a range of pre-processed sets is
available, the user can create a Reference Population that is domain-
specic to the forensic task at hand. In forensic_eval_01 such case-
relevant data are available in a training set and they have been used
in the present experiments. In variations of the evaluation, the im-
pact of the size (number of speakers) of the Reference Population was
investigated.
For the purpose of providing score normalisation, Nuance Forensics
in both of their versions has a module called the “Background Model ”.
The composition of this Background Model can be left at a default setting
or fed by the user with case-relevant data, here again derived from the
training set of forensic_eval_01 . Whether such a domain-specic normal-
isation set leads to an improvement of voice comparison performance
over a domain-independent internal set will be another research ques-
tion of the present study.
All these options have been tested for both versions of Nuance Foren-
sics. The comparison of these results will show whether the combination
of i-vector PLDA technology with DNN technology leads to improvement
over the use of a pure i-vector PLDA system.
Corresponding author.
E-mail address: michael.jessen@bka.bund.de (M. Jessen).
1 “Reference Population ”is a technical term used by Nuance and some other systems. It should be kept in mind that any appropriate data input will be a sample
of the population; they will not be the population. We will proceed to use this term with capitals but without quotation marks.
https://doi.org/10.1016/j.specom.2019.04.006
Received 2 August 2018; Received in revised form 3 March 2019; Accepted 9 April 2019
Available online 10 April 2019
0167-6393/© 2019 Elsevier B.V. All rights reserved.
M. Jessen, G. Meir and Y.A. Solewicz Speech Communication 110 (2019) 101–107
2. Description and application of the tested Nuance systems
Two versions of the commercial software Nuance Forensics were
tested. The rst version was Nuance Forensics 9.2 (henceforth NF 9.2)
and the second was Nuance Forensics 11.1 (henceforth NF 11.1).
2
NF
9.2 has been available since 2014 and NF 11.1 since early 2018. The
main dierence between these systems lies in the “engine ”, including
the stages from the feature extraction level to the calculation of scores.
The subsequent score-to-likelihood ratio processing is the same in both
versions. Drawing from the information provided with the system de-
scriptions (manual/help les) – supplemented by information requested
from the developers –the dierences between the two versions will be
summarized in Section 2.1 and the common likelihood ratio calculation
method in Section 2.2 . This is followed in 2.3 by a description and mo-
tivation of the way the NF systems were used in order to perform the
tests specied in forensic_eval_01 .
2.1. NF 9.2 and NF 11.1 engines
NF 9.2 is an i-vector PLDA system ( Dehak et al., 2011 ). It has two
engines that run in parallel, one using Perceptual Linear Predictive pa-
rameters (PLP; Hermansky, 1990 ) as features and the other one Mel Fre-
quency Cepstral Coecients (MFCC; Davis and Mermelstein, 1980 ). The
PLP parameters consist of 19 cepstral coecients, 19 delta (velocity)
coecients and 7 double delta (acceleration) coecients ( Furui, 1986 ),
whereas the MFCC parameters consist of 12 cepstral coecients and
13 delta coecients. Window length for feature extraction is 20 ms and
step size is 10 ms. Feature warping ( Pelecanos and Sridharan, 2001 ) is
applied to the output of both sets of features. Voice Activity Detection
(VAD) is applied prior to feature extraction. The VAD model is based on
Neural Network phonetic decoding. The decoders are hybrid HMM-NN
(Hidden Markov Model-Neural Network) models trained to recognize 11
language-independent broad phone classes, such as liquids, nasals, frica-
tives etc. ( Castaldo et al., 2007 ) or detailed English-US acoustic units.
Each of the two engines independently produces a score. The scores
from the two engines are then fused. The fusion is a linear weighted
fusion of the systems based on the discrimination power of each en-
gine. The weights had been determined empirically by the developer,
testing with a large sweep of weight values to narrow down the best
available combination (this testing has no relation to forensic_eval_01
training data).
The acoustic data for each feature extraction method are modelled
as i-vectors (400-dimension vector), and scores for the comparison of
questioned-speaker and known-speaker recordings are calculated using
PLDA ( Prince and Elder, 2007 ). The data used for training the T matrix
and for training PLDA are part of the system and not subject to user
input. They are based on collections of several hundred audio les from
diverse sets of data including English and non-English language data.
These data are not domain-specic to forensic casework and they have
an emphasis on telephonic conditions. The user input focusses on Ref-
erence Populations and optionally Background Models, to be addressed
in Section 2.2 , and these user-provided recordings should be domain-
specic to the forensic conditions at hand.
NF 11.1 has three engines that run in parallel, whose scores are
subsequently fused. Two engines are classical i-vector systems –one
based on MFCCs and the other on PLPs. The third engine is an MFCC-
based DNN senone posterior i-vector system in which output from a DNN
(Deep Neural Network) is used to derive i-vectors. Instead of using clas-
sical GMMs (Gaussian Mixture Models), in which speech content is not
directly addressed, automatic speech recognition routines are used in or-
der to detect dierent senones (tied triphone states) and a DNN is trained
2 NF 11.1 is embedded into a larger software environment called Security Suite Foren-
sics, whereas NF 9.2 is more of a stand-alone piece of software. These and other structural
dierences that have no inuence on the automatic voice comparison methods will not
be mentioned further.
Fig. 1. Plot showing C
llr
mean versus 95% CI (left panel) and C
llr
pooled (right
panel).
to classify these senones (see Lei et al., 2014 , for more information on
the DNN method used). 57 MFCC coecients (19 basic/19 delta/19
double delta) are used with the rst engine, 60 PLP (20/20/20) with
the second, and 56 MFCCs (18/19/19) with the DNN/i-vector engine.
The default classier used with NF 11.1 is a Pairwise Support Vector
Machine (PSVM; Cumani et al., 2013 ). PLDA is available as an option,
but in the current study we tested PSVM. The VAD model used in NF
11.1 is the same as the one in 9.2.
2.2. NF score normalisation and likelihood ratio calculation method
The likelihood ratio of a trial (i.e. a comparison between a
questioned-speaker and a known-speaker recording) is expressed in NF
in term of a log
10
likelihood ratio. The numerator and the denominator
of the log
10
likelihood ratio (henceforth LLR) are calculated as follows.
In trials in which the known speakers are represented by a sin-
gle recording (which is the case in forensic_eval_01 ), the numerator
of the LLR is calculated based on an intra-speaker score distribution
that is supplied with the NF system and is xed. This intra-speaker
score distribution is the result of multiple same-speaker comparisons
(commonly known as “target trials ”) based on the NIST evaluations
SRE08 and SRE10 (see https://www.nist.gov/itl/iad/mig/speaker-
recognition ). These target trial data are modelled by a Gaussian distri-
bution with a mean score value of 4 and a standard deviation of 1.3.
The numerator of the LLR is then the likelihood of the latter model
evaluated at the score value for the comparison of the questioned- and
known-speaker recording (also known as the “evidence ”).
3
The denominator of the LLR is calculated based on an inter-speaker
score distribution. This inter-speaker distribution is built from compar-
isons between the questioned speaker and all the speakers in a Reference
Population. Since the questioned speaker varies from trial to trial, the
3 If more than one known-speaker recording exists, the intra-speaker score distribution
is adapted from the default distribution (a Gaussian with a mean of 4 and standard varia-
tion of 1.3) towards the score results obtained when the multiple known-speaker record-
ings are compared against each other. This procedure is based
on maximum-a-posterior
(MAP) adaptation. The result is also a Gaussian. A (user-adjustable) relevance factor is
used to determine what weight for the nal intra-speaker score distribution is given to the
system-provided distribution versus the distribution derived from comparing the multiple
known-speaker recordings. In forensic_eval_01 multiple recordings per known speaker do
in fact exist. However, according to the testing protocol, all comparisons were one-to-one.
In possible future variations of the protocol it could be tested which impact it would have
if multiple recordings of a known speaker can be used to model that speaker or to shift
the intra-speaker score distribution towards
that speaker.
102
M. Jessen, G. Meir and Y.A. Solewicz Speech Communication 110 (2019) 101–107
exact structure of the inter-speaker distribution will dier slightly for
each trial. The denominator of the LLR then corresponds to the intersec-
tion of the score of the trial with the probability density function of the
inter-speaker score distribution.
The Reference Population is a collection of 30 or more speakers with
one recording each. According to the manual/help information, the con-
ditions in the Reference Population should correspond to the conditions
found in the recording of the known speaker. Explicit mention is made
of the known speaker’s sex and the language spoken. The Reference Pop-
ulation can be provided by the user. NF also oers a set of pre-loaded
Reference Populations. In forensic_eval_01 , a Reference Population tai-
lored to the conditions of the case can and will be compiled based on
the known-speaker recordings in the training set.
As mentioned above, the purpose of the Reference Population in NF
is to enable the calculation of the denominator of the LLR. Another type
of data set used in NF is called the “Background Model ”. The Background
Model is used for score normalization. Whereas the selection of a Ref-
erence Population by the user is obligatory in NF (no comparisons are
made without explicitly selecting one), the user does not have to select a
Background Model and can work with the default. The user is, however,
given the option of creating a new Background Model corresponding to
case conditions. We conducted tests using a Background Model trained
from the forensic_eval_01 training data and using the default Background
Model ( Section 2.3 for details).
According to information provided by the developers, the princi-
ple behind the use of the Background Model is essentially as follows.
Score normalization is based on AS-norm (Adaptive S-norm) ( Sturim and
Reynolds, 2005 on adaptation; Shum et al., 2010 on S-norm). This is a
symmetric normalization where from the whole set of i-vectors included
in the Background Model (it is recommended that a few hundred record-
ings are provided for the Background Model), the top ones are selected
that are adjusted to the known-speaker recording and another set of
Fig. 2. Tippett plots (no precision) of the results of the tested systems. Left panels show tests using Nuance Forensics 9.2, right panels those using Nuance Forensics
11.1. The rst row shows Variant 1, using 105 speakers as Reference Population and the NF default Background Model, the second shows Variant 2, using 42 speakers
as Reference Population and the NF default Background Model, and the third shows Variant 3, using 42 speakers as Reference Population and a test-customized
Background Model.
103
M. Jessen, G. Meir and Y.A. Solewicz Speech Communication 110 (2019) 101–107
top ones that are adjusted to the questioned-speaker recording (selec-
tion size 100). In other words, from the overall cohort available in the
Background Model the “optimal ” cohort is selected for each of the two
elements of the trial. Based on these optimal cohorts, normalization is
applied to both elements of the trial.
In order to allow the system to select the “optimal ”recordings,
recordings reecting the conditions of the questioned-speaker recording
and recordings reecting the conditions of the known-speaker record-
ing should both be included in the data used to train the Background
Model.
2.3. Application of NF for forensic_eval_01
Three variations of NF 9.2 and NF 11.1 were tested. These are shown
in Table 1 .
The second column in Table 1 provides information about the record-
ings in the Reference Population, and the third column provides infor-
mation about the recordings in the Background Model.
Table 1
Variations of the forensic_eval_01 test.
Variant Reference population Background model
1 105-speaker set NF default
2 42-speaker set NF default
3 42-speaker set 290 recordings
In all three variants the questioned-speaker recordings are from the
call centre condition and the known-speaker recordings from the in-
terview condition, which corresponds to the case scenario presented in
forensic_eval_01 ( Morrison and Enzinger, 2016 ).
In the rst variant the Reference Population consists of one recording
from each of the 105 speakers in the training data. All the recordings are
from the known-speaker condition. The Background Model remained
at the default settings provided by Nuance. The recordings that were
used were, for each speaker, the rst chronologically available known-
speaker condition recording. Normally this was a recording containing
Fig. 3. Tippett plots (with precision) of the results of the tested systems.
104
M. Jessen, G. Meir and Y.A. Solewicz Speech Communication 110 (2019) 101–107
“(1) ”in the name of the audio le. Sometimes no (1)-recording was
available; in that case the (2)-recording was selected.
In the second variant a Reference Population with a lower number of
speakers was used. The other settings remained as in variant 1. The 42
speakers selected were all those from the training set that do not have
enough material to carry out a non-contemporaneous speech test. Impor-
tantly, the 42 speakers selected according to this criterion are also those
with generally the lowest number of recordings: for one speaker there
were three, for all the other 41 speakers there were just two recordings
per speaker. As will be explained shortly, as many recordings as possible
should remain for the Background Model of variant 3, and the 42 speak-
ers selected here were least useful for that purpose. (For the speakers to
be included in the Background Model the training set provided between
3 and 14 recordings per speaker –with one exception of just 2 record-
ings per speaker.) The 42-speaker set for the Reference Population in
variant 2 is fully contained in the 105-speaker set of variant 1.
In the third variant a Background Model was compiled from the
training data. The other variables are the same as in variant 2, includ-
ing the same Reference Population. All the recordings from the training
data were used for this purpose, with the following exceptions. First,
all recordings from all 42 speakers selected for the Reference Popula-
tion were excluded. Second, due to the principles of the Background
Model described in 2.2 it seemed reasonable to supply it with an equal
number of questioned-speaker condition recordings and known-speaker
condition recordings. Since the known-speaker condition recordings in
the training data slightly outnumber the questioned-speaker condition
recordings, the procedure used here was to select for each speaker an
equal number of questioned-speaker condition recordings and known-
speaker condition recordings. Any remaining les from the training set
were excluded. After applying these criteria, the les remaining from
the training set were 290, i.e. 145 from the questioned-speaker condi-
tion and the same number from the known-speaker condition. According
to the default settings of NF, a minimum of 150 recordings is necessary
to build a Background Model. With 290 recordings selected here, a Back-
ground Model can be created, although 290 might still be a low number
for the optimal operation of the Background Model. The total number
of dierent speakers contained in this set of 290 recordings is 64.
When comparing the results from variant 1 with those from vari-
ant 2 it will be possible to study the eect of the size of the Reference
Population. Zhang and Tang (2018) have shown in the forensic_eval_01
context that increasing the size of the Reference Population leads to an
increase of speaker-discriminatory performance. Although the function
of the Reference Population in Batvox 3.1 diers from the one in NF
(LR calculation and normalisation in Batvox, just LR calculation NF)
it is interesting to see whether a relatively large increase in Reference
Population size from about 40 to about 100 speakers has an eect.
When comparing the results from variant 2 with those from vari-
ant 3, it will be possible to examine whether a test-customized Back-
ground Model changes performance relative to the Nuance-provided de-
fault Background Model.
When these three variants are tested with the two versions of NF it
can be seen whether the inclusion of DNN technology leads to improve-
ment over a pure i-vector PLDA system.
3. Results
The results for the six tested variants (three variants shown in
Table 1 , tested with both NF 9.2. and 11.1) are shown in Figs. 1–5 and in
Table 2 . Fig. 1 provides a graphical representation of some accuracy and
precision metrics involving log likelihood ratio costs and credible inter-
vals, specically C
llr
mean
plotted against 95% CI as well as C
llr
pooled
. The
exact values for these and additional metrics are provided in Table 2 .
Fig. 2 shows Tippett plots, Fig. 3 the same Tippett plots with additional
precision information, Fig. 4 a Detection Error Trade-o (DET) plot and
Fig. 5 Empirical Cross Entropy (ECE) plots.
Fig. 4. Detection Error Trade-o (DET) plot.
Table 2
Exact values of the accuracy and precision metrics.
System variant C
llr
pooled C
llr
mean 95% CI C
llr
min C
llr
cal EER
NF 9.2 Variant 1 0.379 0.359 0.280 0.220 0.159 0.060
Variant 2 0.371 0.347 0.288 0.219 0.152 0.061
Variant 3 0.285 0.258 0.336 0.161 0.124 0.047
NF 11.1 Variant 1 0.353 0.342 0.243 0.176 0.177 0.043
Variant 2 0.340 0.325 0.252 0.161 0.180 0.040
Variant 3 0.255 0.234 0.309 0.124 0.130 0.031
4. Discussion and conclusion
In the interpretation of the results we will look separately at the
impact of the dierent uses of relevant populations (Variant 1 to 3) and
the performance dierences of the two versions of Nuance Forensics.
Results for the accuracy metric C
llr
pooled (and C
llr
mean
) as well as
for the pure discrimination metrics C
llr
min
and EER (cf. Meuwly et al.,
2017 ) show that, with the exception of EER (Equal Error Rate) in NF
9.2, performance is better in Variant 2 than Variant 1, i.e., when fewer
rather than more speakers are included in the Reference Population (42
rather than 105 speakers). This result is contrary to the expectation that
performance increases with the size of the Reference Population. The
dierence is, however, small. Knowing that the Reference Population is
saturated beyond about 40 speakers (or even less) is good news for foren-
sic voice comparison because supplying large amounts of case-relevant
population data is usually dicult and 40 is still a well-manageable
number.
By comparing the results of Variant 3 against those of Variants 1
and 2 it can be determined whether using a Background Model trained
from case-relevant data improves performance relative to the system-
internal default Background Model. Looking at C
llr
pooled
, C
llr
min
and EER
in Table 2 , as well as the C
llr
-measures in Fig. 1 , it can be seen that for
both versions of NF there is clear improvement. This outcome is broadly
consistent with the results of van der Vloed (2016) . Performing foren-
sic_eval_01 with Batvox 4.1 –an i-vector PLDA system –he examined
whether compiling an “imposter set ”from the training data (questioned-
speaker recording condition) leads to better results than working with-
out this set. The imposter set in Batvox (both 3.1 and 4.1) provides
Z-norm transformation ( van der Vloed, 2016; Zhang and Tang, 2018 ).
Accuracy and discrimination metrics showed that the impostor set was
105
M. Jessen, G. Meir and Y.A. Solewicz Speech Communication 110 (2019) 101–107
Fig. 5. Empirical Cross Entropy (ECE) plots of the results of the tested systems.
in fact advantageous. In NF, the AS-norm function carried out through
the Background Model involves elements of both of T-norm and Z-norm;
these two types of normalisation cannot be teased apart. (In Batvox T-
norm and Z-norm cannot be teased apart completely either, because
T-norm is used along with calibration there.) However, what can at
least be inferred across the tests of the i-vector PLDA systems in Batvox
and NF is that the Z-norm aspect most probably leads to improvement
of accuracy and discrimination when supplied from case-relevant data,
which here are the training resources of forensic_eval_01.
Performance dierences of the two versions of NF show a clear pat-
tern: for each of the three variants, performance in terms of C
llr
pooled,
C
llr
mean
, C
llr
min
and EER is better with NF 11.1 than with 9.2. This result
shows that including information derived from DNN – specically com-
bining a DNN senone posterior i-vector system with two i-vector PLDA
systems –leads to an improvement over the use of a system based on
i-vector PLDA alone.
So far the results have been interpreted for accuracy and discrimi-
nation metrics. With respect to the precision metric 95% CI, across all
three system variants, the pattern is the opposite of that for accuracy
and discrimination: performance in terms of 95% CI is worse (larger
numbers) in Variant 3 than in Variant 1 or 2.
Performance in terms of 95% CI is better for NF 11.1 than for NF
9.2. The inclusion of DNN technology therefore appears to result in bet-
ter performance with respect to all the performance metrics. Further
research and casework experience will show whether these are stable
trends.
106
M. Jessen, G. Meir and Y.A. Solewicz Speech Communication 110 (2019) 101–107
Conflict of interest
The authors declare that they have no known competing nancial
interests or personal relationships that could have appeared to inuence
the work reported in this paper.
References
Castaldo, F., Colibro, D., Dalmasso, E., Laface, P., Vair, C., 2007. Compensation of nuisance
factors for speaker and language recognition. IEEE Trans. Audio Speech Lang. Proc.
15, 1969–1978. http://dx.doi.org/10.1109/TASL.2007.901823 .
Cumani, S., Brümmer, N., Burget, L., 2013. Pairwise discriminative speaker verica-
tion in the i-vector space. IEEE Trans. Audio Speech Lang. Proc. 21, 1217–1227.
http://dx.doi.org/10.1109/TASL.2013.2245655 .
Davis, S., Mermelstein, P., 1980. Comparison of parametric representations for monosyl-
labic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech
Sig. Proc. 28, 357–366. http://dx.doi.org/10.1109/TASSP.1980.1163420 .
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P., 2011. Front-end factor
analysis for speaker verication.
IEEE Trans. Audio Speech Lang. Proc. 19, 788–798.
http://dx.doi.org/10.1109/TASL.2010.2064307 .
Furui, S., 1986. Speaker-independent isolated word recognition using dynamic fea-
tures of speech spectrum. IEEE Trans. Acoust. Speech Sig. Proc. 34, 52–59.
http://dx.doi.org/10.1109/TASSP.1986.1164788 .
Hermansky, H., 1990. Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc.
Am. 8, 1738–1752. http://doi.org/10.1121/1.399423 .
Lei, Y., Scheer, N., Ferrer, L., McLaren, M., 2014. A novel scheme for speaker recog-
nition using a phonetically-aware deep neural network. In: 2014 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1714–1718.
http://dx.doi.org/10.1109/ICASSP.2014.6853887 .
Meuwly, D., Ramos, D., Haraksim, R., 2017. A guideline for the validation of likelihood
ratio methods used for forensic evidence evaluation. Forensic Sci. Int. 276, 142–153.
http://dx.doi.org/10.1016/j.forsciint.2016.03.048 .
Morrison, G.S., Enzinger, E., 2016. Multi-laboratory evaluation of forensic
voice comparison systems under conditions reecting those of a real foren-
sic case (forensic_eval_01) – introduction. Speech Commun. 85, 119–126.
http://dx.doi.org/10.1016/j.specom.2016.07.006 .
Pelecanos, J. , Sridharan, S. , 2001. Feature warping for robust speaker verication. In:
Proceedings of Odyssey 2001: The Speaker and Language Recognition Workshop over
the years, Crete, Greece, pp. 213–218 .
Prince, S.J.D., Elder, J.H., 2007. Probabilistic linear discriminant analysis for inferences
about identity. In: IEEE 11th International Conference on Computer Vision (ICCV),
pp. 1–8. http://dx.doi.org/10.1109/ICCV.2007.4409052 .
Shum, S. , Dehak, N. , Dehak, R. , Glass, J. , 2010. Unsupervised speaker adaptation based
on the cosine similarity for text independent speaker verication. In: Proceedings
of Odyssey 2010: The Speaker and Language Recognition Workshop, Brno, Czech
Republic, pp. 76–82 .
Sturim, D.E., Reynolds, D.A., 2005. Speaker adaptive cohort selection for t-norm
in text-independent speaker verication. In: 2005 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing (ICASSP), pp. 741–744.
http://dx.doi.org/10.1109/ICASSP.2005.1415220
.
Van der Vloed, D., 2016. Evaluation of Batvox 4.1 under conditions reecting those
of a real forensic voice comparison case ( forensic_eval_01 ). Speech Commun.
85, 127–130 . http://dx.doi.org/10.1016/j.specom.2016.10.001 , see also Erratum
http://dx.doi.org/10.1016/j.specom.2017.04.005 .
Zhang, C., Tang, C., 2018. Evaluation of Batvox 3.1 under conditions reecting those of a
real forensic voice comparison case ( forensic_eval_01 ). Speech Commun. 100, 13–17.
http://dx.doi.org/10.1016/j.specom.2018.04.008 .
107
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
This work presents a new and efficient approach to discriminative speaker verification in the I-vector space. We illustrate the development of a linear discriminative classifier that is trained to discriminate between the hypothesis that a pair of feature vectors in a trial belong to the same speaker or to different speakers. This approach is alternative to the usual discriminative setup that discriminates between a speaker and all the other speakers. We use a discriminative classifier based on a Support Vector Machine (SVM) that is trained to estimate the parameters of a symmetric quadratic function approximating a log-likelihood ratio score without explicit modeling of the i-vector distributions as in the generative Probabilistic Linear Discriminant Analysis (PLDA) models. Training these models is feasible because it is not necessary to expand the i-vector pairs, which would be expensive or even impossible even for medium sized training sets. The results of experiments performed on the tel-tel extended core condition of the NIST 2010 Speaker Recognition Evaluation are competitive with the ones obtained by generative models, in terms of normalized Detection Cost Function and Equal Error Rate. Moreover, we show that it is possible to train a gender-independent discriminative model that achieves state-of-the-art accuracy, comparable to the one of a gender-dependent system, saving memory and execution time both in training and in testing.
Article
Full-text available
In this paper we discuss an extension to the widely used score normalization technique of test normalization (Tnorm) for text-independent speaker verification. A new method of speaker Adaptive-Tnorm that offers advantages over the standard Tnorm by adjusting the speaker set to the target model is presented. Examples of this improvement using the 2004 NIST SRE data are also presented.
Article
Full-text available
This paper presents an extension of our previous work which proposes a new speaker representation for speaker verification. In this modeling, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis. This space is named the total variability space because it models both speaker and channel variabilities. Two speaker verification systems are proposed which use this new representation. The first system is a support vector machine-based system that uses the cosine kernel to estimate the similarity between the input data. The second system directly uses the cosine similarity as the final decision score. We tested three channel compensation techniques in the total variability space, which are within-class covariance normalization (WCCN), linear discriminate analysis (LDA), and nuisance attribute projection (NAP). We found that the best results are obtained when LDA is followed by WCCN. We achieved an equal error rate (EER) of 1.12% and MinDCF of 0.0094 using the cosine distance scoring on the male English trials of the core condition of the NIST 2008 Speaker Recognition Evaluation dataset. We also obtained 4% absolute EER improvement for both-gender trials on the 10 s-10 s condition compared to the classical joint factor analysis scoring.
Conference Paper
Full-text available
Many current face recognition algorithms perform badly when the lighting or pose of the probe and gallery images differ. In this paper we present a novel algorithm designed for these conditions. We describe face data as resulting from a generative model which incorporates both within- individual and between-individual variation. In recognition we calculate the likelihood that the differences between face images are entirely due to within-individual variability. We extend this to the non-linear case where an arbitrary face manifold can be described and noise is position-dependent. We also develop a "tied" version of the algorithm that al- lows explicit comparison across quite different viewing con- ditions. We demonstrate that our model produces state of the art results for (i) frontal face recognition (ii) face recog- nition under varying pose.
Article
The present paper reports on an evaluation of Batvox 3.1 as part of the Speech Communication virtual special issue: Multi-laboratory evaluation of forensic voice comparison systems under conditions reflecting those of a real forensic case (forensic_eval_01). We were interested in the effect of the amount of training data on the performance of the system. We therefore tested Batvox 3.1 using different sized sets of data randomly selected from the forensic_eval_01 training data: one known-speaker-condition recording and one questioned-speaker-condition recording from each of 25, 50, 75, and 100 speakers. The results show that the performance of the system continued to improve as the amount of training data increased. Previously reported results for Batvox 4.1 (an i-vector PLDA system) were better than those for Batvox 3.1 (a GMM-UBM system).
Article
Batvox 4.1 by Agnitio was tested under conditions reflecting those of a real forensic voice comparison case. This is part of a validation of Batvox 4.1 that is being conducted at the Netherlands Forensic Institute. There was considerable mismatch between the known- and questioned-speaker-condition recordings in the training and test data. Batvox allows for training data in the form of a reference population sample in which the recordings should be like the known-speaker recording and in the form of imposters in which the recordings should be like the questioned-speaker recording. Four settings of Batvox were tested that used the training data differently. These were the factorial combination of: using versus not using imposters, and using all 105 speakers versus a selected subset of 30 speakers as the reference population sample. Better performance was achieved when recordings from all 105 speakers in the training set were used as reference population data than when a subset of 30 of these were selected by Batvox. Better performance was also achieved when an imposter set of recording from all 105 speaker in the training set was used than when no imposter set was used.
Article
There is increasing pressure on forensic laboratories to validate the performance of forensic analysis systems before they are used to assess strength of evidence for presentation in court. Different forensic voice comparison systems may use different approaches, and even among systems using the same general approach there can be substantial differences in operational details. From case to case, the relevant population, speaking styles, and recording conditions can be highly variable, but it is common to have relatively poor recording conditions and mismatches in speaking style and recording conditions between the known- and questioned-speaker recordings. In order to validate a system intended for use in casework, a forensic laboratory needs to evaluate the degree of validity and reliability of the system under forensically realistic conditions. The present paper is an introduction to a Virtual Special Issue consisting of papers reporting on the results of testing forensic voice comparison systems under conditions reflecting those of an actual forensic voice comparison case. A set of training and test data representative of the relevant population and reflecting the conditions of this particular case has been released, and operational and research laboratories are invited to use these data to train and test their systems. The present paper includes the rules for the evaluation and a description of the evaluation metrics and graphics to be used. The name of the evaluation is: forensic_eval_01
Conference Paper
We propose a novel framework for speaker recognition in which extraction of sufficient statistics for the state-of-the-art i-vector model is driven by a deep neural network (DNN) trained for automatic speech recognition (ASR). Specifically, the DNN replaces the standard Gaussian mixture model (GMM) to produce frame alignments. The use of an ASR-DNN system in the speaker recognition pipeline is attractive as it integrates the information from speech content directly into the statistics, allowing the standard backends to remain unchanged. Improvement from the proposed framework compared to a state-of-the-art system are of 30% relative at the equal error rate when evaluated on the telephone conditions from the 2012 NIST speaker recognition evaluation (SRE). The proposed framework is a successful way to efficiently leverage transcribed data for speaker recognition, thus opening up a wide spectrum of research directions.
Article
A new technique for the analysis of speech, the perceptual linear predictive (PLP) technique, is presented and examined. This technique uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum: (1) the critical-band spectral resolution, (2) the equal-loudness curve, and (3) the intensity-loudness power law. The auditory spectrum is then approximated by an autoregressive all-pole model. A 5th-order all-pole model is effective in suppressing speaker-dependent details of the auditory spectrum. In comparison with conventional linear predictive (LP) analysis, PLP analysis is more consistent with human hearing. The effective second formant F2' and the 3.5-Bark spectral-peak integration theories of vowel perception are well accounted for. PLP analysis is computationally efficient and yields a low-dimensional representation of speech. These properties are found to be useful in speaker-independent automatic-speech recognition.