While the forensic and judicial community advocate a unified methodology for the identification of speakers, the market with its need for differentiation and the scientific field with its vertiginous progress are opposed with indifference to that longing. This paper proposes a methodology that includes as broadly as possible all the technologies available in the current forensic laboratories. This covers the reception of the samples and the plan of the analysis strategy, continuing with the adequacy of the material and the identification itself, until the final report is drawn up and the samples returned. The validation of the technique used was incorporated in the procedure, so that it could also count on reliable and comparable results between different laboratories. The recommendation on using likelihood ratios is independent of the results of the technique used, and makes it more understandable for lawyers, judges and justice personnel who must make important decisions from them.
Reconocimiento automático de hablantes, empleando técnicas de Deep Learning, en peritajes informáticos.
En el año 1996, investigadores de los laboratorios AT&T Bell (Chin-Hui Lee, Frank K. Soong, Kuldip K. Paliwal, 1996), definieron “Reconocimiento Automático de Hablantes” -ASR por sus siglas en inglés- como “La ciencia y tecnología orientadas a implementar algoritmos para clasificar e identificar hablantes a través de sus registros de voz”.
En la actualidad, dicha actividad se encuentra dentro de las incumbencias del perito informático forense.
Conforme evolucionaron el poder de procesamiento computacional y las técnicas de aprendizaje profundo, se logró desarrollar un prototipo basado en una de las ramas de "Inteligencia Artificial", conocida como "Aprendizaje Profundo" (Deep Learning por su traducción al inglés) que identifica hablantes automáticamente, con métricas que hasta el momento eran alcanzadas por ejemplo con el método dactiloscópico o el método genético del ADN.
Esta prueba forense le otorga al perito informático, la facultad de emitir un dictamen, aceptando o rechazando la hipótesis: "La muestra dubitada de voz del hablante X " es compatible con una exactitud del ,* %" con la muestra indubitada de voz del hablante Y".
Luego, el juez o el jurado podrán usar ese dictamen como evidencia para elaborar la sentencia final.
Some of the most popular applications of speaker recognition are: financial, forensic and legal, security, audio and video indexing, surveillance, teleconferencing, and e-learning. This work focuses on forensic applications, and seeks to improve automatic speaker recognition systems by incorporating distinctive long-term features to actual short-term information.
We can summarize the overall objective of this thesis as the "incorporation of long-term information to an automatic speaker recognition system based on standard segmental parameters to be used in forensic applications."
The proposed hypothesis suggests that if you manage to find segmental characteristics of higher order and suprasegmental features that are useful for people discrimination, the multiparametric system will perform better in the recognition task, since it will have more information about the speaker. The main sources for the determination of these features will neuroscience and linguistics, considering that valuable knowledge can be extracted from the analysis of the innate ability of humans to recognize people by their voice.
Tanto la comunidad forense como la jurídica abogan por tener una metodología unificada para la interpretación de la fuerza de la evidencia científica. La recomendación para que el perito empleen cocientes de verosimilitudes (factor de Bayes) independiza los resultados del tipo de análisis forense y la técnica empleada, y permite hacerlos más comprensible para los abogados, jueces y personal de la justicia que son los responsables de tomar las decisiones del caso a partir de ellos. En el presente trabajo se explica cómo determinar el factor de Bayes para casos con una única evidencia y cuando se requiere la combinación de varias evidencias para el mismo caso. Finalmente se presentan algunas pautas para la elaboración del informe evaluativo, resaltando la importancia de la presentación numérica y verbal de la fuerza de la evidencia, y la importancia de la selección correcta de la población de referencia.
Resumen— Mientras la comunidad forense y judicial abogan por tener una metodología unificada para la identificación de hablantes, el mercado con su necesidad de diferenciación y el ámbito científico con su vertiginoso avance se contraponen con indiferencia a ese anhelo. En este trabajo se propone una metodología que incluya de la forma más amplia posible a todas las tecnologías disponibles en los actuales laboratorios forenses, pero que, por otra parte permita estandarizar las etapas del proceso para el cotejo de las muestras de habla solicitadas por el juez o la autoridad judicial competente. Se abarca desde la recepción de las muestras y el planteo de la estrategia de análisis, continuando con la adecuación del material recibido y la identificación propiamente dicha, hasta la redacción del informe final y devolución de las muestras entregadas. Se incorporó en el procedimiento la validación de la técnica utilizada, de manera de poder contar además, con resultados fiables y comparables entre diferentes laboratorios. La recomendación para que se empleen cocientes de verosimilitud independiza los resultados de la técnica empleada, y permite hacerlos más comprensible para los abogados, jueces y personal de la justicia que deben tomar decisiones importantes a partir de ellos.
This paper presents an overview of a state-of-the-art text-independent speaker verification system. First, an introduction proposes a modular scheme of the training and test phases of a speaker verification system. Then, the most commonly speech parameterization used in speaker verification, namely, cepstral analysis, is detailed. Gaussian mixture modeling, which is the speaker modeling technique used in most systems, is then explained. A few speaker modeling alternatives, namely, neural networks and support vector machines, are mentioned. Normalization of scores is then explained, as this is a very important step to deal with real-world data. The evaluation of a speaker verification system is then detailed, and the detection error trade-off (DET) curve is explained. Several extensions of speaker verification are then enumerated, including speaker tracking and segmentation by speakers. Then, some applications of speaker verification are proposed, including on-site applications, remote applications, applications relative to structuring audio information, and games. Issues concerning the forensic area are then recalled, as we believe it is very important to inform people about the actual performance and limitations of speaker verification systems. This paper concludes by giving a few research trends in speaker verification for the next couple of years.
Netherlands Forensic Science Institute Forensic Speech and Audio Analysis, Forensic Linguistics
13th INTERPOL Forensic Science Symposium, Lyon, France, 16-19 October 2001
Forensic Speech and Audio Analysis
Forensic Linguistics
1998-2001
A.P.A. Broeders
NFI Rijswijk, The Netherlands (t.broeders@nfi.minjus.nl)
Abstract
Although the development of state-of-the-art speaker recognition systems has shown considerable progress in the last decade, performance levels of these systems do not as yet seem to warrant large-scale introduction in anything other than relatively low-risk applications. Conditions typical of the forensic context such as differences in recording equipment and transmission channels, the presence of background noise and of variation due to differences in communicative context continue to pose a major challenge. Consequently, the impact of automatic speaker recognition technology on the forensic scene has been relatively modest and forensic speaker identification practice remains heavily dominated by the use of a wide variety of largely subjective procedures. While recent developments in the interpretation of the evidential value of forensic evidence clearly favour methods that make it possible for results to be expressed in terms of a likelihood ratio, unlike automatic procedures, traditional methods in the field of speaker identification do not generally meet this requirement. However, conclusions in the form of a binary yes/no-decision or a qualified statement of the probability of the hypothesis rather than the evidence are increasingly criticised for being logically flawed. Against this background, the need to put alternative validation procedures in place is becoming more widely accepted.
Although speaker identification by earwitnesses differs in some important respects from the much more widely studied field of eyewitness identification, there are sufficient parallels between the two for speaker identification by earwitnesses to benefit greatly from a close study of the guidelines that have been proposed for the administration of line-ups in the visual domain. Some of the central notions are briefly discussed.
Rapid technical developments in the world of telecommunications in which speech and data are increasingly transmitted through the same communication channels may soon blunt the efficacy of traditional telephone interception as an investigative and evidential tool. The gradual shift from analogue to digital recording media and the increasingly widespread availability of digital sound processing equipment as well as its ease of operation make certain types of manipulation of audio recordings comparatively easy to perform. If done competently, such manipulation may leave no traces and may therefore well be impossible to detect.
Authorship attribution is another forensic area that has had a relatively chequered history. The rapid increase in the use of electronic writing media including e-mail, SMS, and the use of ink jet printers at the expense of typewritten and to a lesser extent handwritten texts reduces the opportunities of authorship attribution by means of traditional document examination techniques and may create a greater demand for linguistic expertise in this area.
A survey is provided of ongoing work in the area, based on reactions to a questionnaire sent out earlier this year.
Every system of automatic voice recognition can be divided into three parts: the voice source, the measuring system and the classifier. The object of interest is the classifier, in which emphasis is laid upon the procedure capable of recognizing voices in open sets. The methodology of investigations, analysis of the problem and the recognizing algorithm have been presented. Experimental results accounting for choice optimization and extraneous voice discrimination as well as the problem of choosing the threshold values for a given recognition strategy has been discussed.
In this paper it is argued that one should not attempt to directly assess whether a forensic analysis technique is scientifically acceptable. Rather one should first specify what one considers to be appropriate principles governing acceptable practice, then consider any particular approach in light of those principles. This paper focuses on one principle: the validity and reliability of an approach should be empirically tested under conditions reflecting those of the case under investigation using test data drawn from the relevant population. Versions of this principle have been key elements in several reports on forensic science, including forensic voice comparison, published over the last four-and-a-half decades. The aural-spectrographic approach to forensic voice comparison (also known as "voiceprint" or "voicegram" examination) and the currently widely practiced auditory-acoustic-phonetic approach are considered in light of this principle (these two approaches do not appear to be mutually exclusive). Approaches based on data, quantitative measurements, and statistical models are also considered in light of this principle.
We propose a novel approach for noise-robust speaker recognition, where the model of distortions caused by additive and convolutive noises is integrated into the i-vector extraction framework. The model is based on a vector taylor series (VTS) approximation widely successful in noise robust speech recognition. The model allows for extracting “cleaned-up” i-vectors which can be used in a standard i-vector back end. We evaluate the proposed framework on the PRISM corpus, a NIST-SRE like corpus, where noisy conditions were created by artificially adding babble noises to clean speech segments. Results show that using VTS i-vectors present significant improvements in all noisy conditions compared to a state-of-the-art baseline speaker recognition. More importantly, the proposed framework is robust to noise, as improvements are maintained when the system is trained on clean data.