Article

Evaluation of Phonexia automatic speaker recognition software under conditions reflecting those of a real forensic voice comparison case (forensic_eval_01)

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

As part of the Speech Communication virtual special issue “Multi-laboratory evaluation of forensic voice comparison systems under conditions reflecting those of a real forensic case (forensic_eval_01)” two automatic speaker recognition systems developed by the company Phonexia were tested. The first named SID (Speaker Identification)-XL3 is an i-vector PLDA system that works with two streams of features, one of them using MFCCs in a classical sense, the other using DNN-Stacked Bottle-Neck features based on correlated spectral-domain features as well as on information from voice/voiceless detection and fundamental frequency. The second system that was tested is called SID-BETA4. It uses MFCCs as input features (without deltas and double deltas) and employs a DNN-based speaker embedding architecture. Each of the two systems was tested in two variants. In the first, the system was used without including any domain-specific data, i.e. data from the training set of forensic_eval_01. In the second variant, training set data were used with a method called 10% FAR calibration. With this method scores are shifted in a way that 10% of the scores in the non-target distribution (based on training data) will have LLR > 0 and 90% will have LLR < 0. Results showed that the speaker embedding system SID-BETA4 leads to clear improvement over the use of the SID-XL3 in terms of accuracy, discrimination and precision measures. Use of the FAR calibration method turned out to leave precision unaffected but lead to improvement in discrimination. The accuracy measure Cllrpooled improved with the use vs. non-use of FAR calibration in SID-XL3 but not SID-BETA4.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In the empirical research reported in the present paper, we conduct a series of experiments in which lay listeners are asked to make same-speaker/different-speaker judgements on pairs of recordings that reflect the conditions of an actual forensic case. The pairs of recordings are a subset of those from the forensic_eval_01 dataset [33], which has previously been used to perform benchmark validations of multiple forensic-voice-comparison systems [34][35][36][37][38][39][40]. The language and accent spoken on these recordings is Australian English. ...
... B llr is calculated using Equation (3). 38 If the B llr value is greater than 0, then, relative to the forensicvoice-comparison system, the human-listener's responses are biased toward giving larger likelihood-ratio response values (biased in favour of the same-speaker hypothesis), and if the B llr value is less than 0, then, relative to the forensic-voice-comparison system, the human-listener's responses are biased toward giving smaller likelihood-ratio response values (biased in favour of the differentspeaker hypothesis). A B llr value of +1 would indicate that, on average, the listener's likelihood-ratio responses are twice as large as those of the forensic-voice-comparison system, a B llr value of +2 that they are four times as large, a B llr value of +3 that they are eight times as large, etc. ...
... D llr and B llr are not costs measured in bits. 38 Note that Equation (2) and Equation (3) ...
Article
Full-text available
Expert testimony is only admissible in common law if it will potentially assist the trier of fact to make a decision that they would not be able to make unaided. The present paper addresses the question of whether speaker identification by an individual lay listener (such as a judge) would be more or less accurate than the output of a forensic-voice-comparison system that is based on state-of-the-art automatic-speaker-recognition technology. Listeners listen to and make probabilistic judgements on pairs of recordings reflecting the conditions of the questioned- and known-speaker recordings in an actual case. Reflecting different courtroom contexts, listeners with different language backgrounds are tested: Some are familiar with the language and accent spoken, some are familiar with the language but less familiar with the accent, and others are less familiar with the language. Also reflecting different courtroom contexts: In one condition listeners make judgements based only on listening, and in another condition listeners make judgements based on both listening to the recordings and considering the likelihood-ratio values output by the forensic-voice-comparison system.
... An example of a calibration model that would be inappropriate for evidential casework is described in Jessen et al. [29]: The calibration model included shifting the scores so that 10% of the different-source scores had values greater than 0. This may be appropriate in an investigative context in which one requires a 10% false-alarm rate, but, in the context of assessing strength of evidence for presentation in court, unless this accidently corresponds to the shift that minimizes C llr (and for the conditions tested in Ref. [29], it did not), this procedure deliberately miscalibrates the output of the system. ...
... An example of a calibration model that would be inappropriate for evidential casework is described in Jessen et al. [29]: The calibration model included shifting the scores so that 10% of the different-source scores had values greater than 0. This may be appropriate in an investigative context in which one requires a 10% false-alarm rate, but, in the context of assessing strength of evidence for presentation in court, unless this accidently corresponds to the shift that minimizes C llr (and for the conditions tested in Ref. [29], it did not), this procedure deliberately miscalibrates the output of the system. ...
Article
Full-text available
Forensic-evaluation systems should output likelihood-ratio values that are well calibrated. If they do not, their output will be misleading. Unless a forensic-evaluation system is intrinsically well-calibrated, it should be calibrated using a parsimonious parametric model that is trained using calibration data. The system should then be tested using validation data. Metrics of degree of calibration that are based on the pool-adjacent-violators (PAV) algorithm recalibrate the likelihood-ratio values calculated from the validation data. The PAV algorithm overfits on the validation data because it is both trained and tested on the validation data, and because it is a non-parametric model with weak constraints. For already-calibrated systems, PAV-based ostensive metrics of degree of calibration do not actually measure degree of calibration; they measure sampling variability between the calibration data and the validation data, and overfitting on the validation data. Monte Carlo simulations are used to demonstrate that this is the case. We therefore argue that, in the context of casework, PAV-based metrics are not meaningful metrics of degree of calibration; however, we also argue that, in the context of casework, a metric of degree of calibration is not required.
... • Procedures for conducting validation have been developed, along with graphics and metrics for representing the results, e.g., Tippett plots [2] and the log-likelihood-ratio cost (C llr ) [3]. 3 • An increasing number of papers are being published that include empirical validation of forensic-voice-comparison systems under conditions reflecting casework conditions, e.g., [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19]. ...
... In the context of forensic interpretation, a likelihood ratio provides the answer to a specific two-part question, for example: 17 (a) What is the likelihood of obtaining the observed properties of the voices of interest on the questioned-and known-speaker recordings if they were both produced by the same speaker, a speaker selected at random from the relevant population? versus (b) What is the likelihood of obtaining the observed properties of the voices of interest on the questioned-and known-speaker recordings if they were each produced by a different speaker, each speaker selected at random from the relevant population? ...
Article
Since the 1960s, there have been calls for forensic voice comparison to be empirically validated under casework conditions. Since around 2000, there have been an increasing number of researchers and practitioners who conduct forensic-voice-comparison research and casework within the likelihood-ratio framework. In recent years, this community of researchers and practitioners has made substantial progress toward validation under casework conditions becoming a standard part of practice: Procedures for conducting validation have been developed, along with graphics and metrics for representing the results, and an increasing number of papers are being published that include empirical validation of forensic-voice-comparison systems under conditions reflecting casework conditions. An outstanding question, however, is: In the context of a case, given the results of an empirical validation of a forensic-voice-comparison system, how can one decide whether the system is good enough for its output to be used in court? This paper provides a statement of consensus developed in response to this question. Contributors included individuals who had knowledge and experience of validating forensic-voice-comparison systems in research and/or casework contexts, and individuals who had actually presented validation results to courts. They also included individuals who could bring a legal perspective on these matters, and individuals with knowledge and experience of validation in forensic science more broadly. We provide recommendations on what practitioners should do when conducting evaluations and validations, and what they should present to the court. Although our focus is explicitly on forensic voice comparison, we hope that this contribution will be of interest to an audience concerned with validation in forensic science more broadly. Although not written specifically for a legal audience, we hope that this contribution will still be of interest to lawyers.
... The system generates x-vector speaker models for the 'evidential' comparison samples based on MFCC input. Scores for each comparison are then calculated using PLDA [9]. ...
Conference Paper
Full-text available
Validation of forensic voice comparison methods requires testing using speech samples that are representative of forensic casework conditions. Increasingly, around the world, forensic voice comparison casework is being undertaken using automatic speaker recognition (ASR) systems. However, multilingualism remains a key issue in applying automatic systems to forensic casework. This research aims to consider the effect of language on ASR performance, testing developers’ claims of ‘language independency’. Specifically, we examine the extent to which language mismatch either between the known and questioned samples, or between the evidential samples and the calibration data, affects overall system performance and the resulting strength of evidence (i.e., likelihood ratios for individual comparisons). Results indicate that mixed language trials produce more errors than single language trials which makes drawing evidential conclusions based on bilingual data challenging.
... For phoneticians, formants have been one of the key acoustic features: the 2011 international survey on forensic voice comparison practitioners (across 15 countries, 36 respondents) reported that 97% conducted some form of formant analysis [10]. While the field is increasingly shifting to techniques based on automatic speaker recognition [11][12][13][14][15][16][17], the capacity of formants to link acoustic information to vocal tract configurations is still attractive, as it enables incorporation of linguistic knowledge into interpretation and analysis. It is impossible to gain a full understanding of how, and to what extent, various linguistic and non-linguistic factors affect speech acoustics. ...
Conference Paper
Full-text available
This study presents a large scale, well-controlled examination of the effects of mobile phone transmission on the first four formants. We used 306 Japanese male speakers recorded simultaneously with a direct microphone and via a mobile phone network. We found that all four formants were significantly impacted by the mobile phone transmission. Further, we found the impact of mobile transmission was largely unpredictable, and the impacts appear to vary speaker to speaker. This may have significant implications in some application areas, such as forensic phonetics, and also for data collection of speech recorded over phones in general.
... The MFCC power comes from their capability for modeling the shape of the vocal tract under shorter time power spectrum. Generally, they calculated with psycho-acoustically motivated filter bank, logarithmic compression with discrete cosine transform (DCT) (Jessen et al. 2019;Geravanchizadeh et al. 2021). At last, the 12-15 least DCT coefficients are utilized for representing the Mel frequency cepstral coefficients feature vector. ...
Article
Full-text available
Speaker identification is the method of human voice identifying with the help of artificial intelligence (AI) method. The technology of speaker identification is broadly utilized in voice recognition, secure, surveillance, electronic voice eavesdropping, and the verification of identity. In the existing methods, it does not provide the sufficient accuracy and robustness of the speech signal. To overcome these issues, an efficient Speaker Identification framework based on Mask region based convolutional neural network (Mask R-CNN) classifier parameter optimized using Hosted Cuckoo Optimization (HCO) is proposed in this manuscript. The objective of the proposed method is “to increase the accuracy and to improve the robustness of the signal”. Initially, the input speech signals are taken from the real time dataset. From the input speech signal, there are four types of the features are extracted, they are Mel Frequency Differential Power Cepstral Coefficients (MFDPCC), Gamma tone Frequency Cepstral Coefficients (GFCC), Power Normalized Cepstral Coefficients (PNCC) and Spectral entropy for improving the robustness of the signal. Then, the speaker ID is classified by using the Mask R-CNN classifier. Similarly, the Mask R-CNN classifier parameters are optimized by using the HCO algorithm. This method is relevant in the real time application, such as telephone banking and the fax mailing. The simulation is executed in MATLAB. The simulation results shows that the proposed Mask-R-CNN-HCO method attains accuracy of 24.16%, 32.18%, 28.43%, 36.4%, 33.26%, Sensitivity of 37.68%, 33.80%, 24.16%, 32.18%, 28.43%, Precision of 35.88%, 24.16%, 32.18%, 28.43%, 26.77% higher than the existing methods, such as Automatic Classification of speaker identification using K-Nearest Neighbors algorithm (KNN), classification of speaker identification using multiclass support vector machine(MCSVM), classification of speaker identification using Gaussian Mixture Model–Convolutional Neural Network (GMMCNN) classifier, classification of speaker identification using Deep neural network (DNN) and classification of speaker identification using Gaussian Mixture Model–deep Neural Network (GMMDNN) classifier.
... The lower the C llr value, the better the performance of the system. In terms of C llr , E 3 FS 3 α performed equally as well as the best-performing system from the virtual special issue, Phonexia SID-BETA4 [40]. ...
Article
Full-text available
This paper reports on validations of an alpha version of the E³ Forensic Speech Science System (E³FS³) core software tools. This is an open-code human-supervised-automatic forensic-voice-comparison system based on x-vectors extracted using a type of Deep Neural Network (DNN) known as a Residual Network (ResNet). A benchmark validation was conducted using training and test data (forensic_eval_01) that have previously been used to assess the performance of multiple other forensic-voice-comparison systems. Performance equalled that of the best-performing system with previously published results for the forensic_eval_01 test set. The system was then validated using two different populations (male speakers of Australian English and female speakers of Australian English) under conditions reflecting those of a particular case to which it was to be applied. The conditions included three different sets of codecs applied to the questioned-speaker recordings (two mismatched with the set of codecs applied to the known-speaker recordings), and multiple different durations of questioned-speaker recordings. Validations were conducted and reported in accordance with the “Consensus on validation of forensic voice comparison”.
Article
Full-text available
In forensic comparison sciences, experts are required to compare samples of known and unknown origin to evaluate the strength of the evidence assuming they came from the same- and different-sources. The application of valid (if the method measures what it is intended to) and reliable (if that method produces consistent results) forensic methods is required across many jurisdictions, such as the England & Wales Criminal Practice Directions 19A and UK Crown Prosecution Service and highlighted in the 2009 National Academy of Sciences report and by the President’s Council of Advisors on Science and Technology in 2016. The current study uses simulation to examine the effect of number of speakers and sampling variability and on the evaluation of validity and reliability using different generations of automatic speaker recognition (ASR) systems in forensic voice comparison (FVC). The results show that the state-of-the-art system had better overall validity compared with less advanced systems. However, better validity does not necessarily lead to high reliability, and very often the opposite is true. Better system validity and higher discriminability have the potential of leading to a higher degree of uncertainty and inconsistency in the output (i.e. poorer reliability). This is particularly the case when dealing with small number of speakers, where the observed data does not adequately support density estimation, resulting in extrapolation, as is commonly expected in FVC casework.
Article
Full-text available
Исследована информативность параметров спектральной модели голосового источника в задаче автоматического распознавания личности по голосу. Для голосовых параметров ошибка распознавания личности составила 20.8%; совместное использование этих параметров с периодом основного тона понизило ошибку до 13.8%. Наконец, совместное использование параметров спектральной модели с периодом основного тона и мел-частотными кепстральными коэффициентами обеспечило наивысшую точность (ошибка распознавания составила 1.2%).
Article
There is increasing support for reporting evidential strength as a likelihood ratio (LR) and increasing interest in (semi-)automated LR systems. The log-likelihood ratio cost (Cllr) is a popular metric for such systems, penalizing misleading LRs further from 1 more. Cllr = 0 indicates perfection while Cllr = 1 indicates an uninformative system. However, beyond this, what constitutes a “good” Cllr is unclear. Aiming to provide handles on when a Cllr is “good”, we studied 136 publications on (semi-)automated LR systems. Results show Cllr use heavily depends on the field, e.g., being absent in DNA analysis. Despite more publications on automated LR systems over time, the proportion reporting Cllr remains stable. Noticeably, Cllr values lack clear patterns and depend on the area, analysis and dataset. As LR systems become more prevalent, comparing them becomes crucial. This is hampered by different studies using different datasets. We advocate using public benchmark datasets to advance the field.
Article
Full-text available
The information content of the parameters of a spectral voice source model in an automatic voice identity recognition problem is studied. For the voice parameters, the identity recognition error was 20.8%; using these parameters together with the pitch period reduced the error to 13.8%. Lastly, the combined use of the spectral model parameters with the pitch period and mel-frequency cepstral coefficients provided the highest accuracy (the recognition error was 1.2%).
Chapter
The human-supervised-automatic analytical approach to forensic voice comparison in conjunction with the likelihood-ratio interpretive framework is described. Practitioner tasks are described, including adoption of relevant hypotheses for the case, assessment of the conditions of the questioned-speaker and known-speaker recordings in the case, and selection of data representing the relevant population and reflecting the conditions for the case. Software tools are also described. An example is provided of a forensic-voice-comparison system based on state-of-the-art automatic-speaker-recognition technology. Also described are the calibration and validation of that system using a benchmark dataset reflecting the conditions of a real forensic case.
Article
This paper demonstrates the potential of the sub-band parametric cepstral distance (PCD) formulated by Clermont and Mokhtari (1994), as an alternative to formants in acoustic phonetic research. As a cepstrum-based measure, the PCD is automatically and reliably extracted from the speech signal. By contrast, formants are time-consuming and often difficult to estimate, a well-known bottleneck for studies based on large-scale datasets. The PCD measure gives flexibility in selecting the frequency limits of any sub-band of interest within the available full band. We suggest that, if sub-band selection were guided by the acoustic–phonetic theory of speech production, PCD analysis could facilitate phonetically meaningful cepstral comparisons without relying directly on formants. We evaluate this idea by exploiting the PCD properties in the context of forensic voice comparison as an application example. The cepstral data were obtained from the vowels uttered by 306 male Japanese speakers. Similar patterns of results were observed using formants and sub-band PCDs, the latter yielding better performance. This suggests that sub-band PCDs are able to capture the spectral characteristics that we normally quantify through formants, but with better reliability and efficiency. The PCD results reported here are encouraging for other types of acoustic phonetic studies in which comparisons of spectral characteristics are required.
Preprint
Full-text available
This chapter describes a number of signal-processing and statistical-modeling techniques that are commonly used to calculate likelihood ratios in human-supervised automatic approaches to forensic voice comparison. Techniques described include mel-frequency cepstral coefficients (MFCCs) feature extraction, Gaussian mixture model - universal background model (GMM-UBM) systems, i-vector - probabilistic linear discriminant analysis (i-vector PLDA) systems, deep neural network (DNN) based systems (including senone posterior i-vectors, bottleneck features, and embeddings / x-vectors), mismatch compensation, and score-to-likelihood-ratio conversion (aka calibration). Empirical validation of forensic-voice-comparison systems is also covered. The aim of the chapter is to bridge the gap between general introductions to forensic voice comparison and the highly technical automatic-speaker-recognition literature from which the signal-processing and statistical-modeling techniques are mostly drawn. Knowledge of the likelihood-ratio framework for the evaluation of forensic evidence is assumed. It is hoped that the material presented here will be of value to students of forensic voice comparison and to researchers interested in learning about statistical modeling techniques that could potentially also be applied to data from other branches of forensic science.
Article
Voice recognition and sound classification is a hot-topic research area in the literature and many methods have been presented. Ambient recognition is very important problem by using voices or acoustic. Especially, digital forensics examiners need an automated ambient recognition system by using acoustical voices. In this study, a novel automatic ambient recognition method is presented by using acoustic. Firstly, an acoustical voice dataset is acquired. These voices are categorized in the 8 classes and these classes are kinder garden, ferryboat, airport, café, subway, bus, traffic and walking. Then, a sequential learning method is presented for ambient recognition using acoustical voices. The proposed method consists of dynamic center mirror local binary pattern (DCMLBP) and discrete wavelet transform (DWT), neighborhood component analysis (NCA) based feature selection and classification phases. By using DWT, a sequential learning method is proposed and the proposed feature extraction method has nine levels. Experiments clearly show that the proposed DCMLBP based method has high classification accuracy, precision, geometric mean, F-score for ambient recognition. According to results, the best accuracy rate was calculated as 99.97% ± 0.07% by using support vector machine and 128 features.
Article
This conclusion to the virtual special issue (VSI) “Multi-laboratory evaluation of forensic voice comparison systems under conditions reflecting those of a real forensic case (forensic_eval_01)” provides a brief summary of the papers included in the VSI, observations based on the results, and reflections on the aims and process. It also includes errata and acknowledgments.
Conference Paper
Full-text available
A new method named Null-Hypothesis LLR (H0LLR) is proposed for forensic automatic speaker recognition. The method takes into account the fact that forensically realistic data are difficult to collect and that inter-individual variation is generally better represented than intra-individual variation. According to the proposal, intra-individual variation is modeled as a projection from case-customized inter-individual variation. Calibrated log Likelihood Ratios (LLR) that are calculated on the basis of the H0LLR method were tested on two corpora of forensically-founded telephone interception test sets, German-based GFS 2.0 and Dutch-based NFI-FRITS. Five automatic speaker recognition systems were tested based on the scores or the LLRs provided by these systems which form the input to H0LLR. Speaker-discrimination and calibration performance of H0LLR is comparable to the performance indices of the system-internal LLR calculation methods. This shows that external data and strategies that work with data outside the forensic domain and without case customization are not necessary. It is also shown that H0LLR leads to a reduction in the diversity of LLR output patterns of different automatic systems. This is important for the credibility of the Likelihood Ratio framework in forensics, and its application in forensic automatic speaker recognition in particular.
Article
Full-text available
The paper describes Brno University of Technology (BUT) ASR system for 2014 BABEL Surprise language evaluation (Tamil). While being largely based on our previous work, two original contributions were brought: (1) speaker-adapted bottle-neck neural network (BN) features were investigated as an input to DNN recognizer and semi-supervised training was found effective. (2) Adding of noise to training data outperformed a classical de-noising technique while dealing with noisy test data was found beneficial, and the performance of this approach was verified on a relatively clean training/test data setup from a different language. All results are reported on BABEL 2014 Tamil data.
Article
Full-text available
This paper presents an extension of our previous work which proposes a new speaker representation for speaker verification. In this modeling, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis. This space is named the total variability space because it models both speaker and channel variabilities. Two speaker verification systems are proposed which use this new representation. The first system is a support vector machine-based system that uses the cosine kernel to estimate the similarity between the input data. The second system directly uses the cosine similarity as the final decision score. We tested three channel compensation techniques in the total variability space, which are within-class covariance normalization (WCCN), linear discriminate analysis (LDA), and nuisance attribute projection (NAP). We found that the best results are obtained when LDA is followed by WCCN. We achieved an equal error rate (EER) of 1.12% and MinDCF of 0.0094 using the cosine distance scoring on the male English trials of the core condition of the NIST 2008 Speaker Recognition Evaluation dataset. We also obtained 4% absolute EER improvement for both-gender trials on the 10 s-10 s condition compared to the classical joint factor analysis scoring.
Article
There is increasing pressure on forensic laboratories to validate the performance of forensic analysis systems before they are used to assess strength of evidence for presentation in court. Different forensic voice comparison systems may use different approaches, and even among systems using the same general approach there can be substantial differences in operational details. From case to case, the relevant population, speaking styles, and recording conditions can be highly variable, but it is common to have relatively poor recording conditions and mismatches in speaking style and recording conditions between the known- and questioned-speaker recordings. In order to validate a system intended for use in casework, a forensic laboratory needs to evaluate the degree of validity and reliability of the system under forensically realistic conditions. The present paper is an introduction to a Virtual Special Issue consisting of papers reporting on the results of testing forensic voice comparison systems under conditions reflecting those of an actual forensic voice comparison case. A set of training and test data representative of the relevant population and reflecting the conditions of this particular case has been released, and operational and research laboratories are invited to use these data to train and test their systems. The present paper includes the rules for the evaluation and a description of the evaluation metrics and graphics to be used. The name of the evaluation is: forensic_eval_01
Article
Logistic-regression calibration and fusion are potential steps in the calculation of forensic likelihood ratios. The present paper provides a tutorial on logistic-regression calibration and fusion at a practical conceptual level with minimal mathematical complexity. A score is log-likelihood-ratio like in that it indicates the degree of similarity of a pair of samples while taking into consideration their typicality with respect to a model of the relevant population. A higher-valued score provides more support for the same-origin hypothesis over the different-origin hypothesis than does a lower-valued score; however, the absolute values of scores are not interpretable as log likelihood ratios. Logistic-regression calibration is a procedure for converting scores to log likelihood ratios, and logistic-regression fusion is a procedure for converting parallel sets of scores from multiple forensic-comparison systems to log likelihood ratios. Logistic-regression calibration and fusion were developed for automatic speaker recognition and are popular in forensic voice comparison. They can also be applied in other branches of forensic science, a fingerprint/finger-mark example is provided.
Article
Auckenthaler, Roland, Carey, Michael, and Lloyd-Thomas, Harvey, Score Normalization for Text-Independent Speaker Verification Systems, Digital Signal Processing10(2000), 42–54.This paper discusses several aspects of score normalization for text-independent speaker verification. The theory of score normalization is explained using Bayes' theorem and detection error trade-off plots. Based on the theory, the world, cohort, and zero normalization techniques are explained. A novel normalization technique, test normalization, is introduced. Experiments showed significant improvements for this new technique compared to the standard techniques. Finally, there is a discussion of the use of additional knowledge to further improve the normalization methods. Here, the test normalization method is extended to use knowledge of the handset type.
Article
This paper proposes a new isolated word recognition technique based on a combination of instantaneous and dynamic features of the speech spectrum. This technique is shown to be highly effective in speaker-independent speech recognition. Spoken utterances are represented by time sequences of cepstrum coefficients and energy. Regression coefficients for these time functions are extracted for every frame over an approximately 50 ms period. Time functions of regression coefficients extracted for cepstrum and energy are combined with time functions of the original cepstrum coefficients, and used with a staggered array DP matching algorithm to compare multiple templates and input speech. Speaker-independent isolated word recognition experiments using a vocabulary of 100 Japanese city names indicate that a recognition error rate of 2.4 percent can be obtained with this method. Using only the original cepstrum coefficients the error rate is 6.2 percent.
Article
Several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system. The vocabulary included many phonetically similar monosyllabic words, therefore the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations. For each parameter set (based on a mel-frequency cepstrum, a linear frequency cepstrum, a linear prediction cepstrum, a linear prediction spectrum, or a set of reflection coefficients), word templates were generated using an efficient dynamic warping method, and test data were time registered with the templates. A set of ten mel-frequency cepstrum coefficients computed every 6.4 ms resulted in the best performance, namely 96.5 percent and 95.0 percent recognition with each of two speakers. The superior performance of the mel-frequency cepstrum coefficients may be attributed to the fact that they better represent the perceptually relevant aspects of the short-term speech spectrum.
Statistical models in forensic voice comparison
  • G S Morrison
  • E Enzinger
  • D Ramos
  • J González-Rodríguez
  • A Lozano-Díez
Morrison G.S., Enzinger, E., Ramos, D., González-Rodríguez, J., Lozano-Díez, A. Statistical models in forensic voice comparison. In: Banks, D.L., Kafadar, K., Kaye, D.H., Tackett, M. (Eds), Handbook of Forensic Statistics. Boca Raton, FL: CRC-Press. in press.
Phoneme recognition based on long temporal context
  • P Rose
  • Taylor
  • Francis
  • London
  • P Schwarz
Rose, P., 2002. Forensic Speaker Identification. Taylor and Francis, London. Schwarz, P., 2008. Phoneme recognition based on long temporal context. Ph.D. Dissertation. Brno University of Technology.