Jesús Villalba

Jesús Villalba
Johns Hopkins University | JHU · Department of Electrical and Computer Engineering

Telecommunication Engineer

About

145
Publications
31,509
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,328
Citations
Citations since 2017
98 Research Items
2039 Citations
20172018201920202021202220230100200300400500
20172018201920202021202220230100200300400500
20172018201920202021202220230100200300400500
20172018201920202021202220230100200300400500
Additional affiliations
January 2015 - present
University of Zaragoza
Position
  • PostDoc Position
Description
  • Noise Robust Speaker Recognition with DNN Speaker Diarization Spoofing Detection
February 2004 - September 2014
University of Zaragoza
Position
  • Speech Researcher
Description
  • Speaker Recognition Research Speech Recognition NIST SRE Spoofing and Tampering Detection Quality Measures of Speech Acoustic Event Detection
Education
September 2006 - December 2014
University of Zaragoza
Field of study
  • Speaker Recognition
September 2004 - September 2006
University of Zaragoza
Field of study
  • Biomedical Engineering
September 1998 - February 2004
University of Zaragoza
Field of study
  • Telecommunication Engineer

Publications

Publications (145)
Preprint
Speech super-resolution/Bandwidth Extension (BWE) can improve downstream tasks like Automatic Speaker Verification (ASV). We introduce a simple novel technique called Self-FiLM to inject self-supervision into existing BWE models via Feature-wise Linear Modulation. We hypothesize that such information captures domain/environment information, which c...
Preprint
The recently proposed Joint Energy-based Model (JEM) interprets discriminatively trained classifier $p(y|x)$ as an energy model, which is also trained as a generative model describing the distribution of the input observations $p(x)$. The JEM training relies on "positive examples" (i.e. examples from the training data set) as well as on "negative e...
Article
Full-text available
Motor impairments are only one aspect of Parkinson's disease (PD), which also include cognitive and linguistic impairments. Speech-derived interpretable biomarkers may help clinicians diagnose PD at earlier stages and monitor the disorder's evolution over time. This study focuses on the multilingual evaluation of a composite array of biomarkers tha...
Conference Paper
Full-text available
Vowel space area (VSA) is an applicable metric for studying speech production deficits and intelligibility. Previous works suggest that the VSA accounts for almost 50% of the intelligibility variance, being an essential component of global intelligibility estimates. However, almost no study publishes a tool to estimate VSA automatically with public...
Conference Paper
Speech-based automatic approaches for evaluating neurological disorders (NDs) depend on feature extraction before the classification pipeline. It is preferable for these features to be interpretable to facilitate their development as diagnostic tools. This study focuses on the analysis of interpretable features obtained from the spoken responses of...
Preprint
Full-text available
Automatic Speaker Verification (ASV) technology has become commonplace in virtual assistants. However, its performance suffers when there is a mismatch between the train and test domains. Mixed bandwidth training, i.e., pooling training data from both domains, is a preferred choice for developing a universal model that works for both narrowband and...
Preprint
Considering the abundance of unlabeled speech data and the high labeling costs, unsupervised learning methods can be essential for better system development. One of the most successful methods is contrastive self-supervised methods, which require negative sampling: sampling alternative samples to contrast with the current sample (anchor). However,...
Preprint
Full-text available
Adversarial attacks are a threat to automatic speech recognition (ASR) systems, and it becomes imperative to propose defenses to protect them. In this paper, we perform experiments to show that K2 conformer hybrid ASR is strongly affected by white-box adversarial attacks. We propose three defenses--denoiser pre-processor, adversarially fine-tuning...
Preprint
Full-text available
Adversarial attacks pose a severe security threat to the state-of-the-art speaker identification systems, thereby making it vital to propose countermeasures against them. Building on our previous work that used representation learning to classify and detect adversarial attacks, we propose an improvement to it using AdvEst, a method to estimate adve...
Preprint
Full-text available
Speech systems developed for a particular choice of acoustic domain and sampling frequency do not translate easily to others. The usual practice is to learn domain adaptation and bandwidth extension models independently. Contrary to this, we propose to learn both tasks together. Particularly, we learn to map narrowband conversational telephone spee...
Article
Typically, unsupervised segmentation of speech into the phone- and wordlike units are treated as separate tasks and are often done via different methods which do not fully leverage the inter-dependence of the two tasks. Here, we unify them and propose a technique that can jointly perform both, showing that these two tasks indeed benefit from each o...
Chapter
The emergence of smart home assistants increased the need for robust Far-Field Speaker Identification models. Speaker Identification enables the assistants to perform personalized tasks. Smart home assistants face very challenging speech conditions, including various room shapes and sizes, various distances of the speaker from the microphone, vario...
Preprint
Full-text available
Typically, unsupervised segmentation of speech into the phone and word-like units are treated as separate tasks and are often done via different methods which do not fully leverage the inter-dependence of the two tasks. Here, we unify them and propose a technique that can jointly perform both, showing that these two tasks indeed benefit from each o...
Article
Adversarial examples are designed to fool the speaker recognition (SR) system by adding a carefully crafted human-imperceptible noise to the speech signals. Posing a severe security threat to state-of-the-art SR systems, it becomes vital to deep-dive and study their vulnerabilities. Moreover, it is of greater importance to propose countermeasures t...
Preprint
This technical report describes Johns Hopkins University speaker recognition system submitted to Voxceleb Speaker Recognition Challenge 2021 Track 3: Self-supervised speaker verification (closed). Our overall training process is similar to the proposed one from the first place team in the last year's VoxSRC2020 challenge. The main difference is a r...
Preprint
Full-text available
Speech emotion recognition is the task of recognizing the speaker's emotional state given a recording of their utterance. While most of the current approaches focus on inferring emotion from isolated utterances, we argue that this is not sufficient to achieve conversational emotion recognition (CER) which deals with recognizing emotions in conversa...
Preprint
Adversarial attacks have become a major threat for machine learning applications. There is a growing interest in studying these attacks in the audio domain, e.g, speech and speaker recognition; and find defenses against them. In this work, we focus on using representation learning to classify/detect attacks w.r.t. the attack algorithm, threat model...
Preprint
Full-text available
In this study, we analyze the use of speech and speaker recognition technologies and natural language processing to detect Alzheimer disease (AD) and estimate mini-mental status evaluation (MMSE) scores. We used speech recordings from Interspeech 2021 ADReSSo challenge dataset. Our work focuses on adapting state-of-the-art speaker recognition and l...
Preprint
Full-text available
Automatic detection of phoneme or word-like units is one of the core objectives in zero-resource speech processing. Recent attempts employ self-supervised training methods, such as contrastive predictive coding (CPC), where the next frame is predicted given past context. However, CPC only looks at the audio signal's frame-level structure. We overco...
Preprint
Full-text available
With the increase in the availability of speech from varied domains, it is imperative to use such out-of-domain data to improve existing speech systems. Domain adaptation is a prominent pre-processing approach for this. We investigate it for adapt microphone speech to the telephone domain. Specifically, we explore CycleGAN-based unpaired translatio...
Preprint
The ubiquitous presence of machine learning systems in our lives necessitates research into their vulnerabilities and appropriate countermeasures. In particular, we investigate the effectiveness of adversarial attacks and defenses against automatic speech recognition (ASR) systems. We select two ASR models - a thoroughly studied DeepSpeech model an...
Preprint
Full-text available
Research in automatic speaker recognition (SR) has been undertaken for several decades, reaching great performance. However, researchers discovered potential loopholes in these technologies like spoofing attacks. Quite recently, a new genre of attack, termed adversarial attacks, has been proved to be fatal in computer vision and it is vital to stud...
Preprint
Full-text available
Environmental noises and reverberation have a detrimental effect on the performance of automatic speech recognition (ASR) systems. Multi-condition training of neural network-based acoustic models is used to deal with this problem, but it requires many-folds data augmentation, resulting in increased training time. In this paper, we propose utterance...
Preprint
Full-text available
This paper introduces a novel method to diagnose the source-target attention in state-of-the-art end-to-end speech recognition models with joint connectionist temporal classification (CTC) and attention training. Our method is based on the fact that both, CTC and source-target attention, are acting on the same encoder representations. To understand...
Preprint
Full-text available
Data augmentation is a widely used strategy for training robust machine learning models. It partially alleviates the problem of limited data for tasks like speech emotion recognition (SER), where collecting data is expensive and challenging. This study proposes CopyPaste, a perceptually motivated novel augmentation procedure for SER. Assuming that...
Preprint
Full-text available
Deep learning based speech denoising still suffers from the challenge of improving perceptual quality of enhanced signals. We introduce a generalized framework called Perceptual Ensemble Regularization Loss (PERL) built on the idea of perceptual losses. Perceptual loss discourages distortion to certain speech properties and we analyze it using six...
Preprint
Full-text available
Zero-shot multi-speaker Text-to-Speech (TTS) generates target speaker voices given an input text and the corresponding speaker embedding. In this work, we investigate the effectiveness of the TTS reconstruction objective to improve representation learning for speaker verification. We jointly trained end-to-end Tacotron 2 TTS and speaker embedding n...
Preprint
Full-text available
Unsupervised spoken term discovery consists of two tasks: finding the acoustic segment boundaries and labeling acoustically similar segments with the same labels. We perform segmentation based on the assumption that the frame feature vectors are more similar within a segment than across the segments. Therefore, for strong segmentation performance,...
Preprint
Full-text available
We investigated an enhancement and a domain adaptation approach to make speaker verification systems robust to perturbations of far-field speech. In the enhancement approach, using paired (parallel) reverberant-clean speech, we trained a supervised Generative Adversarial Network (GAN) along with a feature mapping loss. For the domain adaptation app...
Conference Paper
The promise of new neuroprotective treatments to stop or slow the advance of Parkinson's Disease (PD) urges for new biomarkers or detection schemes that can deliver a faster diagnosis. Given that speech is affected by PD, the combination of deep neural networks and speech processing can provide automatic detection schemes. Accordingly, in this stud...
Preprint
Full-text available
The promise of new neuroprotective treatments to stop or slow the advance of Parkinson's Disease (PD) urges for new biomark-ers or detection schemes that can deliver a faster diagnosis. Given that speech is affected by PD, the combination of deep neural networks and speech processing can provide automatic detection schemes. Accordingly, in this stu...
Preprint
Full-text available
In this work, we explore the dependencies between speaker recognition and emotion recognition. We first show that knowledge learned for speaker recognition can be reused for emotion recognition through transfer learning. Then, we show the effect of emotion on speaker recognition. For emotion recognition, we show that using a simple linear model is...
Preprint
Full-text available
Data augmentation is conventionally used to inject robustness in Speaker Verification systems. Several recently organized challenges focus on handling novel acoustic environments. Deep learning based speech enhancement is a modern solution for this. Recently, a study proposed to optimize the enhancement network in the activation space of a pre-trai...
Article
Full-text available
The hallmark of the information age is the ease with which information is stored, accessed, and shared throughout the globe. This is enabled, in large part, by the simplicity of duplicating digital information without error. Unfortunately, an ever-growing consequence is the global threat to security and privacy enabled by our digital reliance. Spec...
Article
Full-text available
Abstract We present a novel model adaptation approach to deal with data variability for speaker diarization in a broadcast environment. Expensive human annotated data can be used to mitigate the domain mismatch by means of supervised model adaptation approaches. By contrast, we propose an unsupervised adaptation method which does not need for in-do...
Preprint
Full-text available
This paper presents the problems and solutions addressed at the JSALT workshop when using a single microphone for speaker detection in adverse scenarios. The main focus was to tackle a wide range of conditions that go from meetings to wild speech. We describe the research threads we explored and a set of modules that was successful for these scenar...
Preprint
Full-text available
Recently very deep transformers start showing outperformed performance to traditional bi-directional long short-term memory networks by a large margin. However, to put it into production usage, inference computation cost and latency are still serious concerns in real scenarios. In this paper, we study a novel non-autoregressive transformers structu...
Preprint
In this paper, we propose to improve emotion recognition by combining acoustic information and conversation transcripts. On the one hand, an LSTM network was used to detect emotion from acoustic features like f0, shimmer, jitter, MFCC, etc. On the other hand, a multi-resolution CNN was used to detect emotion from word sequences. This CNN consists o...
Conference Paper
Parkinson’s disease (PD) is a neurodegenerative disorder that severely affects motor functions. Symptoms include dysarthria and this fact has been the basis for PD detection from speech in several works. Machine learning-based technologies have made significant strides in automatic speech recognition, but their use is fairly limited for clinical di...
Preprint
Full-text available
Speaker Verification still suffers from the challenge of generalization to novel adverse environments. We leverage on the recent advancements made by deep learning based speech enhancement and propose a feature-domain supervised denoising based solution. We propose to use Deep Feature Loss which optimizes the enhancement network in the hidden activ...