Article

Evaluation of Phonexia automatic speaker recognition software under conditions reflecting those of a real forensic voice comparison case (forensic_eval_01)

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

As part of the Speech Communication virtual special issue “Multi-laboratory evaluation of forensic voice comparison systems under conditions reflecting those of a real forensic case (forensic_eval_01)” two automatic speaker recognition systems developed by the company Phonexia were tested. The first named SID (Speaker Identification)-XL3 is an i-vector PLDA system that works with two streams of features, one of them using MFCCs in a classical sense, the other using DNN-Stacked Bottle-Neck features based on correlated spectral-domain features as well as on information from voice/voiceless detection and fundamental frequency. The second system that was tested is called SID-BETA4. It uses MFCCs as input features (without deltas and double deltas) and employs a DNN-based speaker embedding architecture. Each of the two systems was tested in two variants. In the first, the system was used without including any domain-specific data, i.e. data from the training set of forensic_eval_01. In the second variant, training set data were used with a method called 10% FAR calibration. With this method scores are shifted in a way that 10% of the scores in the non-target distribution (based on training data) will have LLR > 0 and 90% will have LLR < 0. Results showed that the speaker embedding system SID-BETA4 leads to clear improvement over the use of the SID-XL3 in terms of accuracy, discrimination and precision measures. Use of the FAR calibration method turned out to leave precision unaffected but lead to improvement in discrimination. The accuracy measure Cllrpooled improved with the use vs. non-use of FAR calibration in SID-XL3 but not SID-BETA4.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... An example of a calibration model that would be inappropriate for evidential casework is described in Jessen et al. [29]: The calibration model included shifting the scores so that 10% of the different-source scores had values greater than 0. This may be appropriate in an investigative context in which one requires a 10% false-alarm rate, but, in the context of assessing strength of evidence for presentation in court, unless this accidently corresponds to the shift that minimizes C llr (and for the conditions tested in Ref. [29], it did not), this procedure deliberately miscalibrates the output of the system. ...
... An example of a calibration model that would be inappropriate for evidential casework is described in Jessen et al. [29]: The calibration model included shifting the scores so that 10% of the different-source scores had values greater than 0. This may be appropriate in an investigative context in which one requires a 10% false-alarm rate, but, in the context of assessing strength of evidence for presentation in court, unless this accidently corresponds to the shift that minimizes C llr (and for the conditions tested in Ref. [29], it did not), this procedure deliberately miscalibrates the output of the system. ...
Article
Full-text available
Forensic-evaluation systems should output likelihood-ratio values that are well calibrated. If they do not, their output will be misleading. Unless a forensic-evaluation system is intrinsically well-calibrated, it should be calibrated using a parsimonious parametric model that is trained using calibration data. The system should then be tested using validation data. Metrics of degree of calibration that are based on the pool-adjacent-violators (PAV) algorithm recalibrate the likelihood-ratio values calculated from the validation data. The PAV algorithm overfits on the validation data because it is both trained and tested on the validation data, and because it is a non-parametric model with weak constraints. For already-calibrated systems, PAV-based ostensive metrics of degree of calibration do not actually measure degree of calibration; they measure sampling variability between the calibration data and the validation data, and overfitting on the validation data. Monte Carlo simulations are used to demonstrate that this is the case. We therefore argue that, in the context of casework, PAV-based metrics are not meaningful metrics of degree of calibration; however, we also argue that, in the context of casework, a metric of degree of calibration is not required.
... • Procedures for conducting validation have been developed, along with graphics and metrics for representing the results, e.g., Tippett plots [2] and the log-likelihood-ratio cost (C llr ) [3]. 3 • An increasing number of papers are being published that include empirical validation of forensic-voice-comparison systems under conditions reflecting casework conditions, e.g., [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19]. ...
... In the context of forensic interpretation, a likelihood ratio provides the answer to a specific two-part question, for example: 17 (a) What is the likelihood of obtaining the observed properties of the voices of interest on the questioned-and known-speaker recordings if they were both produced by the same speaker, a speaker selected at random from the relevant population? versus (b) What is the likelihood of obtaining the observed properties of the voices of interest on the questioned-and known-speaker recordings if they were each produced by a different speaker, each speaker selected at random from the relevant population? ...
Article
Since the 1960s, there have been calls for forensic voice comparison to be empirically validated under casework conditions. Since around 2000, there have been an increasing number of researchers and practitioners who conduct forensic-voice-comparison research and casework within the likelihood-ratio framework. In recent years, this community of researchers and practitioners has made substantial progress toward validation under casework conditions becoming a standard part of practice: Procedures for conducting validation have been developed, along with graphics and metrics for representing the results, and an increasing number of papers are being published that include empirical validation of forensic-voice-comparison systems under conditions reflecting casework conditions. An outstanding question, however, is: In the context of a case, given the results of an empirical validation of a forensic-voice-comparison system, how can one decide whether the system is good enough for its output to be used in court? This paper provides a statement of consensus developed in response to this question. Contributors included individuals who had knowledge and experience of validating forensic-voice-comparison systems in research and/or casework contexts, and individuals who had actually presented validation results to courts. They also included individuals who could bring a legal perspective on these matters, and individuals with knowledge and experience of validation in forensic science more broadly. We provide recommendations on what practitioners should do when conducting evaluations and validations, and what they should present to the court. Although our focus is explicitly on forensic voice comparison, we hope that this contribution will be of interest to an audience concerned with validation in forensic science more broadly. Although not written specifically for a legal audience, we hope that this contribution will still be of interest to lawyers.
... The MFCC power comes from their capability for modeling the shape of the vocal tract under shorter time power spectrum. Generally, they calculated with psycho-acoustically motivated filter bank, logarithmic compression with discrete cosine transform (DCT) (Jessen et al. 2019;Geravanchizadeh et al. 2021). At last, the 12-15 least DCT coefficients are utilized for representing the Mel frequency cepstral coefficients feature vector. ...
Article
Speaker identification is the method of human voice identifying with the help of artificial intelligence (AI) method. The technology of speaker identification is broadly utilized in voice recognition, secure, surveillance, electronic voice eavesdropping, and the verification of identity. In the existing methods, it does not provide the sufficient accuracy and robustness of the speech signal. To overcome these issues, an efficient Speaker Identification framework based on Mask region based convolutional neural network (Mask R-CNN) classifier parameter optimized using Hosted Cuckoo Optimization (HCO) is proposed in this manuscript. The objective of the proposed method is “to increase the accuracy and to improve the robustness of the signal”. Initially, the input speech signals are taken from the real time dataset. From the input speech signal, there are four types of the features are extracted, they are Mel Frequency Differential Power Cepstral Coefficients (MFDPCC), Gamma tone Frequency Cepstral Coefficients (GFCC), Power Normalized Cepstral Coefficients (PNCC) and Spectral entropy for improving the robustness of the signal. Then, the speaker ID is classified by using the Mask R-CNN classifier. Similarly, the Mask R-CNN classifier parameters are optimized by using the HCO algorithm. This method is relevant in the real time application, such as telephone banking and the fax mailing. The simulation is executed in MATLAB. The simulation results shows that the proposed Mask-R-CNN-HCO method attains accuracy of 24.16%, 32.18%, 28.43%, 36.4%, 33.26%, Sensitivity of 37.68%, 33.80%, 24.16%, 32.18%, 28.43%, Precision of 35.88%, 24.16%, 32.18%, 28.43%, 26.77% higher than the existing methods, such as Automatic Classification of speaker identification using K-Nearest Neighbors algorithm (KNN), classification of speaker identification using multiclass support vector machine(MCSVM), classification of speaker identification using Gaussian Mixture Model–Convolutional Neural Network (GMMCNN) classifier, classification of speaker identification using Deep neural network (DNN) and classification of speaker identification using Gaussian Mixture Model–deep Neural Network (GMMDNN) classifier.
... The lower the C llr value, the better the performance of the system. In terms of C llr , E 3 FS 3 α performed equally as well as the best-performing system from the virtual special issue, Phonexia SID-BETA4 [40]. ...
Article
Full-text available
This paper reports on validations of an alpha version of the E³ Forensic Speech Science System (E³FS³) core software tools. This is an open-code human-supervised-automatic forensic-voice-comparison system based on x-vectors extracted using a type of Deep Neural Network (DNN) known as a Residual Network (ResNet). A benchmark validation was conducted using training and test data (forensic_eval_01) that have previously been used to assess the performance of multiple other forensic-voice-comparison systems. Performance equalled that of the best-performing system with previously published results for the forensic_eval_01 test set. The system was then validated using two different populations (male speakers of Australian English and female speakers of Australian English) under conditions reflecting those of a particular case to which it was to be applied. The conditions included three different sets of codecs applied to the questioned-speaker recordings (two mismatched with the set of codecs applied to the known-speaker recordings), and multiple different durations of questioned-speaker recordings. Validations were conducted and reported in accordance with the “Consensus on validation of forensic voice comparison”.
Preprint
Full-text available
This chapter describes a number of signal-processing and statistical-modeling techniques that are commonly used to calculate likelihood ratios in human-supervised automatic approaches to forensic voice comparison. Techniques described include mel-frequency cepstral coefficients (MFCCs) feature extraction, Gaussian mixture model - universal background model (GMM-UBM) systems, i-vector - probabilistic linear discriminant analysis (i-vector PLDA) systems, deep neural network (DNN) based systems (including senone posterior i-vectors, bottleneck features, and embeddings / x-vectors), mismatch compensation, and score-to-likelihood-ratio conversion (aka calibration). Empirical validation of forensic-voice-comparison systems is also covered. The aim of the chapter is to bridge the gap between general introductions to forensic voice comparison and the highly technical automatic-speaker-recognition literature from which the signal-processing and statistical-modeling techniques are mostly drawn. Knowledge of the likelihood-ratio framework for the evaluation of forensic evidence is assumed. It is hoped that the material presented here will be of value to students of forensic voice comparison and to researchers interested in learning about statistical modeling techniques that could potentially also be applied to data from other branches of forensic science.
Article
Voice recognition and sound classification is a hot-topic research area in the literature and many methods have been presented. Ambient recognition is very important problem by using voices or acoustic. Especially, digital forensics examiners need an automated ambient recognition system by using acoustical voices. In this study, a novel automatic ambient recognition method is presented by using acoustic. Firstly, an acoustical voice dataset is acquired. These voices are categorized in the 8 classes and these classes are kinder garden, ferryboat, airport, café, subway, bus, traffic and walking. Then, a sequential learning method is presented for ambient recognition using acoustical voices. The proposed method consists of dynamic center mirror local binary pattern (DCMLBP) and discrete wavelet transform (DWT), neighborhood component analysis (NCA) based feature selection and classification phases. By using DWT, a sequential learning method is proposed and the proposed feature extraction method has nine levels. Experiments clearly show that the proposed DCMLBP based method has high classification accuracy, precision, geometric mean, F-score for ambient recognition. According to results, the best accuracy rate was calculated as 99.97% ± 0.07% by using support vector machine and 128 features.
Article
This conclusion to the virtual special issue (VSI) “Multi-laboratory evaluation of forensic voice comparison systems under conditions reflecting those of a real forensic case (forensic_eval_01)” provides a brief summary of the papers included in the VSI, observations based on the results, and reflections on the aims and process. It also includes errata and acknowledgments.
Conference Paper
Full-text available
A new method named Null-Hypothesis LLR (H0LLR) is proposed for forensic automatic speaker recognition. The method takes into account the fact that forensically realistic data are difficult to collect and that inter-individual variation is generally better represented than intra-individual variation. According to the proposal, intra-individual variation is modeled as a projection from case-customized inter-individual variation. Calibrated log Likelihood Ratios (LLR) that are calculated on the basis of the H0LLR method were tested on two corpora of forensically-founded telephone interception test sets, German-based GFS 2.0 and Dutch-based NFI-FRITS. Five automatic speaker recognition systems were tested based on the scores or the LLRs provided by these systems which form the input to H0LLR. Speaker-discrimination and calibration performance of H0LLR is comparable to the performance indices of the system-internal LLR calculation methods. This shows that external data and strategies that work with data outside the forensic domain and without case customization are not necessary. It is also shown that H0LLR leads to a reduction in the diversity of LLR output patterns of different automatic systems. This is important for the credibility of the Likelihood Ratio framework in forensics, and its application in forensic automatic speaker recognition in particular.
Article
Full-text available
This paper presents an extension of our previous work which proposes a new speaker representation for speaker verification. In this modeling, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis. This space is named the total variability space because it models both speaker and channel variabilities. Two speaker verification systems are proposed which use this new representation. The first system is a support vector machine-based system that uses the cosine kernel to estimate the similarity between the input data. The second system directly uses the cosine similarity as the final decision score. We tested three channel compensation techniques in the total variability space, which are within-class covariance normalization (WCCN), linear discriminate analysis (LDA), and nuisance attribute projection (NAP). We found that the best results are obtained when LDA is followed by WCCN. We achieved an equal error rate (EER) of 1.12% and MinDCF of 0.0094 using the cosine distance scoring on the male English trials of the core condition of the NIST 2008 Speaker Recognition Evaluation dataset. We also obtained 4% absolute EER improvement for both-gender trials on the 10 s-10 s condition compared to the classical joint factor analysis scoring.
Article
There is increasing pressure on forensic laboratories to validate the performance of forensic analysis systems before they are used to assess strength of evidence for presentation in court. Different forensic voice comparison systems may use different approaches, and even among systems using the same general approach there can be substantial differences in operational details. From case to case, the relevant population, speaking styles, and recording conditions can be highly variable, but it is common to have relatively poor recording conditions and mismatches in speaking style and recording conditions between the known- and questioned-speaker recordings. In order to validate a system intended for use in casework, a forensic laboratory needs to evaluate the degree of validity and reliability of the system under forensically realistic conditions. The present paper is an introduction to a Virtual Special Issue consisting of papers reporting on the results of testing forensic voice comparison systems under conditions reflecting those of an actual forensic voice comparison case. A set of training and test data representative of the relevant population and reflecting the conditions of this particular case has been released, and operational and research laboratories are invited to use these data to train and test their systems. The present paper includes the rules for the evaluation and a description of the evaluation metrics and graphics to be used. The name of the evaluation is: forensic_eval_01
Article
The paper describes Brno University of Technology (BUT) ASR system for 2014 BABEL Surprise language evaluation (Tamil). While being largely based on our previous work, two original contributions were brought: (1) speaker-adapted bottle-neck neural network (BN) features were investigated as an input to DNN recognizer and semi-supervised training was found effective. (2) Adding of noise to training data outperformed a classical de-noising technique while dealing with noisy test data was found beneficial, and the performance of this approach was verified on a relatively clean training/test data setup from a different language. All results are reported on BABEL 2014 Tamil data.
Article
Logistic-regression calibration and fusion are potential steps in the calculation of forensic likelihood ratios. The present paper provides a tutorial on logistic-regression calibration and fusion at a practical conceptual level with minimal mathematical complexity. A score is log-likelihood-ratio like in that it indicates the degree of similarity of a pair of samples while taking into consideration their typicality with respect to a model of the relevant population. A higher-valued score provides more support for the same-origin hypothesis over the different-origin hypothesis than does a lower-valued score; however, the absolute values of scores are not interpretable as log likelihood ratios. Logistic-regression calibration is a procedure for converting scores to log likelihood ratios, and logistic-regression fusion is a procedure for converting parallel sets of scores from multiple forensic-comparison systems to log likelihood ratios. Logistic-regression calibration and fusion were developed for automatic speaker recognition and are popular in forensic voice comparison. They can also be applied in other branches of forensic science, a fingerprint/finger-mark example is provided.
Article
Auckenthaler, Roland, Carey, Michael, and Lloyd-Thomas, Harvey, Score Normalization for Text-Independent Speaker Verification Systems, Digital Signal Processing10(2000), 42–54.This paper discusses several aspects of score normalization for text-independent speaker verification. The theory of score normalization is explained using Bayes' theorem and detection error trade-off plots. Based on the theory, the world, cohort, and zero normalization techniques are explained. A novel normalization technique, test normalization, is introduced. Experiments showed significant improvements for this new technique compared to the standard techniques. Finally, there is a discussion of the use of additional knowledge to further improve the normalization methods. Here, the test normalization method is extended to use knowledge of the handset type.
This paper proposes a new isolated word recognition technique based on a combination of instantaneous and dynamic features of the speech spectrum. This technique is shown to be highly effective in speaker-independent speech recognition. Spoken utterances are represented by time sequences of cepstrum coefficients and energy. Regression coefficients for these time functions are extracted for every frame over an approximately 50 ms period. Time functions of regression coefficients extracted for cepstrum and energy are combined with time functions of the original cepstrum coefficients, and used with a staggered array DP matching algorithm to compare multiple templates and input speech. Speaker-independent isolated word recognition experiments using a vocabulary of 100 Japanese city names indicate that a recognition error rate of 2.4 percent can be obtained with this method. Using only the original cepstrum coefficients the error rate is 6.2 percent.
Several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system. The vocabulary included many phonetically similar monosyllabic words, therefore the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations. For each parameter set (based on a mel-frequency cepstrum, a linear frequency cepstrum, a linear prediction cepstrum, a linear prediction spectrum, or a set of reflection coefficients), word templates were generated using an efficient dynamic warping method, and test data were time registered with the templates. A set of ten mel-frequency cepstrum coefficients computed every 6.4 ms resulted in the best performance, namely 96.5 percent and 95.0 percent recognition with each of two speakers. The superior performance of the mel-frequency cepstrum coefficients may be attributed to the fact that they better represent the perceptually relevant aspects of the short-term speech spectrum.
Statistical models in forensic voice comparison
  • G S Morrison
  • E Enzinger
  • D Ramos
  • J González-Rodríguez
  • A Lozano-Díez
Morrison G.S., Enzinger, E., Ramos, D., González-Rodríguez, J., Lozano-Díez, A. Statistical models in forensic voice comparison. In: Banks, D.L., Kafadar, K., Kaye, D.H., Tackett, M. (Eds), Handbook of Forensic Statistics. Boca Raton, FL: CRC-Press. in press.
Phoneme recognition based on long temporal context
  • P Rose
  • Taylor
  • Francis
  • London
  • P Schwarz
Rose, P., 2002. Forensic Speaker Identification. Taylor and Francis, London. Schwarz, P., 2008. Phoneme recognition based on long temporal context. Ph.D. Dissertation. Brno University of Technology.