• Home
  • Jean-François Bonastre
Jean-François Bonastre

Jean-François Bonastre
  • PhD
  • Research Director at National Institute for Research in Digital Science and Technology

About

329
Publications
68,167
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
7,670
Citations
Introduction
I am a Research Director at Inria Defense & Security and an associate professor at LIA, University of Avignon. I am interested in audio & speech processing for defense and security applications and, more generally, to AI for defense and security applications. My main activity concerns voice/speaker characterization and recognition. I pay particular attention to explainability in relation to Speech Science knowledge.
Current institution
National Institute for Research in Digital Science and Technology
Current position
  • Research Director
Additional affiliations
September 1994 - present
University of Avignon
University of Avignon
Position
  • Professor (Full)
December 2008 - December 2015
University of Avignon
Position
  • University vice-president

Publications

Publications (329)
Chapter
Originating in game theory, Shapley values are widely used for explaining a machine learning model’s prediction by quantifying the contribution of each feature’s value to the prediction. This requires a scalar prediction as in binary classification, whereas a multiclass probabilistic prediction is a discrete probability distribution, living on a mu...
Article
Many sub-Saharan African languages are categorized as tone languages and for the most part, they are classified as low resource languages due to the limited resources and tools available to process these languages. Identifying the tone associated with a syllable is therefore a key challenge for speech recognition in these languages. We propose mode...
Preprint
Full-text available
Originating in game theory, Shapley values are widely used for explaining a machine learning model's prediction by quantifying the contribution of each feature's value to the prediction. This requires a scalar prediction as in binary classification, whereas a multiclass probabilistic prediction is a discrete probability distribution, living on a mu...
Conference Paper
Full-text available
Spoofing detection is today a mainstream research topic. Standard metrics can be applied to evaluate the performance of isolated spoofing detection solutions and others have been proposed to support their evaluation when they are combined with speaker detection. These either have well-known deficiencies or restrict the architectural approach to com...
Article
Full-text available
People’s perception of forensic evidence is greatly influenced by crime TV series. The analysis of the human voice is no exception. However, unlike fingerprints—with which fiction and popular beliefs draw an incorrect parallel—the human voice varies according to many factors, can be altered deliberately, and its potential uniqueness has yet to be p...
Technical Report
Full-text available
In the context of spoofing attacks in speaker recognition systems, we observed that the waveform probability mass function (PMF) of genuine speech differs significantly from the PMF of speech resulting from the attacks. This is true for synthesized or converted speech as well as replayed speech. We also noticed that this observation seems to have a...
Conference Paper
Full-text available
Deep neural networks have dominated speaker recognition, with a sharp increase in performance associated with increasingly complex models. This comes at the cost of transparency, which poses serious problems for informed decision making. In response, an intrinsically interpretable scoring approach, BA-LR, was recently presented. This method uses an...
Conference Paper
Full-text available
In order to stave off the effects of hypoxia, speech may become limited at elevated altitudes. This paper evaluates the role of speech on acoustic and physiological features used to detect hypoxia. Acoustic, cerebral blood oxygenation, and cardiac signals were recorded from participants who completed control and normobaric hypoxia experimental cond...
Article
Full-text available
Many questions remain with regards to how context affects perceptual and automatic speaker identification performance. To examine the effects of task design on perceptual speaker identification performance, three tasks were developed, including lineup and binary tasks, as well as a novel clustering task. Speech recordings of native French speakers...
Preprint
Full-text available
This paper presents a study on the use of federated learning to train an ASR model based on a wav2vec 2.0 model pre-trained by self supervision. Carried out on the well-known TED-LIUM 3 dataset, our experiments show that such a model can obtain, with no use of a language model, a word error rate of 10.92% on the official TED-LIUM 3 test set, withou...
Preprint
The use of modern vocoders in an analysis/synthesis pipeline allows us to investigate high-quality voice conversion that can be used for privacy purposes. Here, we propose to transform the speaker embedding and the pitch in order to hide the sex of the speaker. ECAPA-TDNN-based speaker representation fed into a HiFiGAN vocoder is protected using a...
Chapter
Full-text available
The challenges facing naïve listeners tasked with identifying or discriminating speakers are well documented. In addition to providing listeners with high-quality speech recordings that accurately represent the speakers, the perceptual task itself is equally important. Conventional perceptual speaker identification and discrimination tasks include...
Conference Paper
Full-text available
Plusieurs services intégrés dans notre vie quotidienne utilisent la reconnaissance automatique de la parole (Apple-Siri, Amazon-Alexa...). Ces services s'appuient sur des modèles entraînés sur une grande quantité de données pour assurer leur efficacité. Les données utilisées sont collectées via les applications, à partir des interactions des utilis...
Preprint
Full-text available
The VoicePrivacy Challenge aims to promote the development of privacy preservation tools for speech technology by gathering a new community to define the tasks of interest and the evaluation methodology, and benchmarking solutions through a series of challenges. In this document, we formulate the voice anonymization task selected for the VoicePriva...
Conference Paper
Full-text available
The presence of background noise and reverberation, especially in far distance speech utterances diminishes the performance of speaker recognition systems. This challenge is addressed on different levels from the signal level in the front end to the scoring technique adaptation in the back end. In this paper, two new variants of ResNet-based speake...
Preprint
Dans cet article, nous étudions la résistance des systèmes de reconnaissance du locuteur de l'état de l'art face aux variabilités acoustiques, telles que le bruit additif et la réverbération. Deux systèmes seront comparés, le premier est fondé sur TDNN tandis que le second est fondé sur ResNet. Nous montrerons que globalement et sans utilisation de...
Preprint
In this paper, a comprehensive exploration of noise robustness and noise compensation of ResNet and TDNN speaker recognition systems is presented. Firstly the robustness of the TDNN and ResNet in the presence of noise, reverberation, and both distortions is explored. Our experimental results show that in all cases the ResNet system is more robust t...
Conference Paper
Full-text available
Likelihood ratio (LR) is a widely adopted paradigm in forensic science to represent the conclusion of a practitioner report. However, with existing estimation methods, the LR does not fully facilitate the decision making by judges and juries. With an explained decomposition of the LR value together with the case information, the judge can more easi...
Preprint
Full-text available
For new participants - Executive summary: (1) The task is to develop a voice anonymization system for speech data which conceals the speaker's voice identity while protecting linguistic content, paralinguistic attributes, intelligibility and naturalness. (2) Training, development and evaluation datasets are provided in addition to 3 different basel...
Article
This paper presents the results and analyses stemming from the first VoicePrivacy 2020 Challenge which focuses on developing anonymization solutions for speech technology. We provide a systematic overview of the challenge design with an analysis of submitted systems and evaluation results. In particular, we describe the voice anonymization task and...
Preprint
Full-text available
The widespread of powerful personal devices capable of collecting voice of their users has opened the opportunity to build speaker adapted speech recognition system (ASR) or to participate to collaborative learning of ASR. In both cases, personalized acoustic models (AM), i.e. fine-tuned AM with specific speaker data, can be built. A question that...
Preprint
Full-text available
This paper investigates methods to effectively retrieve speaker information from the personalized speaker adapted neural network acoustic models (AMs) in automatic speech recognition (ASR). This problem is especially important in the context of federated learning of ASR acoustic models where a global model is learnt on the server based on the updat...
Preprint
In this paper, we discuss an important aspect of speech privacy: protecting spoken content. New capabilities from the field of machine learning provide a unique and timely opportunity to revisit speech content protection. There are many different applications of content privacy, even though this area has been under-explored in speech technology res...
Article
In this paper, we discuss an important aspect of speech privacy: protecting spoken content. New capabilities from the field of machine learning provide a unique and timely opportunity to revisit speech content protection. There are many different applications of content privacy, even though this area has been under-explored in speech technology res...
Preprint
Attribute-driven privacy aims to conceal a single user's attribute, contrary to anonymisation that tries to hide the full identity of the user in some data. When the attribute to protect from malicious inferences is binary, perfect privacy requires the log-likelihood-ratio to be zero resulting in no strength-of-evidence. This work presents an appro...
Chapter
Finding professional voice-actors for cultural productions is performed by a human operator and suffers from several difficulties. Researchers have therefore been interested for several years in mimicking the process of vocal casting to help human operators find new voices. However, voice casting appears to be an underdefined task with many difficu...
Chapter
While the natural voice is spontaneously generated by people, the acted voice is a controlled vocal interpretation, produced by professional actors and aimed at creating a desired effect on the listener. In this work, we pay attention to the aspects of the voice related to the character played. We particularly focus on actors playing the same video...
Preprint
Full-text available
For many decades, research in speech technologies has focused upon improving reliability. With this now meeting user expectations for a range of diverse applications, speech technology is today omni-present. As result, a focus on security and privacy has now come to the fore. Here, the research effort is in its relative infancy and progress calls f...
Preprint
Full-text available
This paper presents the results and analyses stemming from the first VoicePrivacy 2020 Challenge which focuses on developing anonymization solutions for speech technology. We provide a systematic overview of the challenge design with an analysis of submitted systems and evaluation results. In particular, we describe the voice anonymization task and...
Article
Anonymisation and pseudonymisation are two similar concepts used in privacy preservation for speech data. With no established definitions for these tasks, nor standard approaches to assessment, this paper provides definitions and presents two complementary assessment frameworks. The first is based on voice similarity matrices which provide both an...
Conference Paper
In speech technologies, speaker's voice representation is used in many applications such as speech recognition, voice conversion , speech synthesis and, obviously, user authentication. Modern vocal representations of the speaker are based on neural embeddings. In addition to the targeted information, these representations usually contain sensitive...
Conference Paper
Full-text available
Our study examined the performance of evaluators tasked to group natural and anonymised speech recordings into clusters based on their perceived similarities. Speech stimuli were selected from the VCTK corpus; two systems developed for the VoicePrivacy 2020 Challenge were used for anonymisation. The Baseline-1 (B1) system was developed by using x-v...
Conference Paper
Full-text available
Abstract : The performance of speaker recognition systems reduces dramatically in severe conditions in the presence of additive noise and/or reverberation. In some cases, there is only one kind of domain mismatch like additive noise or reverberation, but in many cases, there are more than one distortion. Finding a solution for domain adaptation in...
Conference Paper
Full-text available
The challenges facing listeners tasked to identify speakers are well documented. In addition to providing listeners with high-quality speech recordings that accurately represent the speakers, the method of presentation itself is equally important. Although many speaker discrimination studies have employed a binary approach, they require numerous te...
Conference Paper
Full-text available
Mounting privacy legislation calls for the preservation of privacy in speech technology, though solutions are gravely lacking. While evaluation campaigns are long-proven tools to drive progress, the need to consider a privacy adversary implies that traditional approaches to evaluation must be adapted to the assessment of privacy and privacy preserv...
Conference Paper
Full-text available
The VoicePrivacy initiative aims to promote the development of privacy preservation tools for speech technology by gathering a new community to define the tasks of interest and the evaluation methodology, and benchmarking solutions through a series of challenges. In this paper, we formulate the voice anonymization task selected for the VoicePrivacy...
Conference Paper
Full-text available
On speaker recognition identity impersonation recent challenges, we have observed that the probability mass function (PMF) of the waveform of genuine speech differs significantly from the PMF of identity theft extracts. In our previous works, we presented the analysis of the influence of the waveform on the logical access (LA) spoofing condition, w...
Conference Paper
The proliferation of speech technologies and rising privacy legislation calls for the development of privacy preservation solutions for speech applications. These are essential since speech signals convey a wealth of rich, personal and potentially sensitive information. Anonymisation, the focus of the recent VoicePrivacy initiative, is one strategy...
Preprint
Full-text available
The proliferation of speech technologies and rising privacy legislation calls for the development of privacy preservation solutions for speech applications. These are essential since speech signals convey a wealth of rich, personal and potentially sensitive information. Anonymisation, the focus of the recent VoicePrivacy initiative, is one strategy...
Preprint
With the increasing interest over speech technologies, numerous Automatic Speaker Verification (ASV) systems are employed to perform person identification. In the latter context, the systems rely on neural embeddings as a speaker representation. Nonetheless, such representations may contain privacy sensitive information about the speakers (e.g. age...
Preprint
Full-text available
Mounting privacy legislation calls for the preservation of privacy in speech technology, though solutions are gravely lacking. While evaluation campaigns are long-proven tools to drive progress, the need to consider a privacy adversary implies that traditional approaches to evaluation must be adapted to the assessment of privacy and privacy preserv...
Preprint
Full-text available
The VoicePrivacy initiative aims to promote the development of privacy preservation tools for speech technology by gathering a new community to define the tasks of interest and the evaluation methodology, and benchmarking solutions through a series of challenges. In this paper, we formulate the voice anonymization task selected for the VoicePrivacy...
Article
Full-text available
Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as “presentation attacks.” These vulnerabilities are generally unacceptable and call for spoofing countermeasures or “presentation...
Preprint
Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as ''presentation attacks.'' These vulnerabilities are generally unacceptable and call for spoofing countermeasures or "presentati...
Article
Full-text available
Speech recordings are a rich source of personal, sensitive data that can be used to support a plethora of diverse applications,from health profiling to biometric recognition. It is therefore essential that speech recordings are adequately protected so that they cannot be misused. Such protection, in the form of privacy-preserving technologies, is...
Chapter
In the neural network galaxy, the large majority of approaches and research effort is dedicated to defined tasks, like recognize an image of a cat or discriminate noise versus speech records. For these kind of tasks, it is easy to write a labeling reference guide in order to obtain training and evaluation data with a ground truth. But for a large s...
Preprint
Full-text available
The social media revolution has produced a plethora of web services to which users can easily upload and share multimedia documents. Despite the popularity and convenience of such services, the sharing of such inherently personal data, including speech data, raises obvious security and privacy concerns. In particular, a user's speech data may be ac...
Conference Paper
Full-text available
Dubbing contributes to a larger international distribution of multimedia documents. It aims to replace the original voice in a source language by a new one in a target language. For now, the target voice selection procedure, called voice casting, is manually performed by human experts. This selection is not exclusively based on acoustic similarity...
Conference Paper
Dubbing contributes to a larger international distribution of multi- media documents. It aims to replace the original voice in a source language by a new one in a target language. For now, the target voice selection procedure, called voice casting, is manually performed by human experts. This selection is not exclusively based on acous- tic similar...
Preprint
Full-text available
The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary obje...
Conference Paper
Full-text available
It is common to see voice recordings being presented as a forensic trace in court. Generally, a forensic expert is asked to analyze both suspect and criminal’s voice samples in order to indicate whether the evidence supports the prosecution (samespeaker) or defence (different-speakers) hypotheses. This process is known as Forensic Voice Comparison...
Conference Paper
Full-text available
The assessment of performance for any number of speech processing tasks calls for the use of a suitably large, representative dataset. Dataset design is crucial so as to ensure that any significant variation unrelated to the task in hand is adequately normalised or marginalised. Most datasets are partitioned into training, development and evaluatio...
Conference Paper
Full-text available
It is common to see voice recordings being presented as a forensic trace in court. Generally, a forensic expert is asked to analyze both suspect and criminals voice samples in order to indicate whether the evidence supports the prosecution (samespeaker) or defence (different-speakers) hypotheses. This process is known as Forensic Voice Comparison (...
Conference Paper
Full-text available
Forensic Voice Comparison (FVC) is increasingly using the likelihood ratio (LR) in order to indicate whether the evidence supports the prosecution (same-speaker) or defender (different-speakers) hypotheses. In the LR estimation done by ASpR systems, different factors are not taken into account such as the amount of information involved in the compa...
Conference Paper
Full-text available
Various temporal measures, such as the duration of vowels and consonants, have been proposed to characterize the rhythm of speech and thus classify languages, dialects or idiotic expressions. It is on this last role of the temporal parameters of speech that this study focuses on, using the FABIOLE database. Used for voice comparison, it is construc...
Article
Full-text available
The past decade has witnessed a significant improvement in speaker recognition technology (SR) in terms of performance with the introduction of the i-vectors framework. Despite these advances, the performance of SR systems considerably suffers in presence of acoustic nuisances and variabilities. In this paper, we develop a data driven nuisance comp...
Conference Paper
Full-text available
It is common to see mobile recordings being presented as a forensic trace in a court. In such cases, a forensic expert is asked to analyze both suspect and criminal's voice samples in order to determine the strength-of-evidence. This process is known as Forensic Voice Comparison (FVC). The Likelihood ratio (LR) framework is commonly used by the exp...
Conference Paper
Full-text available
The 2016 speaker recognition evaluation (SRE'16) is the latest edition in the series of benchmarking events conducted by the National Institute of Standards and Technology (NIST). I4U is a joint entry to SRE'16 as the result from the collaboration and active exchange of information among researchers from sixteen Institutes and Universities across 4...
Conference Paper
Full-text available
Forensic Voice Comparison (FVC) is increasingly using thelikeli-hood ratio(LR) in order to indicate whether the evidence supportsthe prosecution (same-speaker) or defender (different-speakers) hy-potheses. Nevertheless, theLRaccepts some practical limitationsdue both to its estimation process itself and to a lack of knowledgeabout the reliability o...
Article
Full-text available
Once the i-vector paradigm has been introduced in the field of speaker recognition, many techniques have been proposed to deal with additive noise within this framework. Due to the complexity of its effect in the i-vector space, a lot of effort has been put into dealing with noise in other domains (speech enhancement, feature compensation, robust i...
Article
Speaker diarization is a problem of separating unknown speakers in a conversation into homogeneous parts in the speaker sense. State-of-the-art diarization systems are based on i-vector methodologies. However, these approaches require large quantities of training data, which must be obtained from an environment that is similar to that of the conver...
Article
Full-text available
This paper describes the LIA speaker recognition system developed for the Speaker Recognition Evaluation (SRE) campaign. Eight sub-systems are developed, all based on a state-of-the-art approach: i-vector/PLDA which represents the mainstream technique in text-independent speaker recognition. These sub-systems differ: on the acoustic feature extract...
Conference Paper
Full-text available
Forensic Voice Comparison (FVC) is increasingly using the likelihood ratio (LR) in order to indicate whether the evidence supports the prosecution (same-speaker) or defender (different-speakers) hypotheses. In addition to support one hypothesis, the LR provides a theoretically founded estimate of the relative strength of its support. Despite this n...
Poster
Full-text available
Poster Interspeech 2016 : Probabilistic approach using joint long and short session i-vectors modeling to deal with short utterances for speaker recognition

Network

Cited By