Najim Dehak's research while affiliated with Johns Hopkins University and other places

Publications (228)

Preprint
Speech super-resolution/Bandwidth Extension (BWE) can improve downstream tasks like Automatic Speaker Verification (ASV). We introduce a simple novel technique called Self-FiLM to inject self-supervision into existing BWE models via Feature-wise Linear Modulation. We hypothesize that such information captures domain/environment information, which c...
Preprint
The recently proposed Joint Energy-based Model (JEM) interprets discriminatively trained classifier $p(y|x)$ as an energy model, which is also trained as a generative model describing the distribution of the input observations $p(x)$. The JEM training relies on "positive examples" (i.e. examples from the training data set) as well as on "negative e...
Article
Full-text available
Motor impairments are only one aspect of Parkinson's disease (PD), which also include cognitive and linguistic impairments. Speech-derived interpretable biomarkers may help clinicians diagnose PD at earlier stages and monitor the disorder's evolution over time. This study focuses on the multilingual evaluation of a composite array of biomarkers tha...
Conference Paper
Speech-based automatic approaches for evaluating neurological disorders (NDs) depend on feature extraction before the classification pipeline. It is preferable for these features to be interpretable to facilitate their development as diagnostic tools. This study focuses on the analysis of interpretable features obtained from the spoken responses of...
Conference Paper
Full-text available
Vowel space area (VSA) is an applicable metric for studying speech production deficits and intelligibility. Previous works suggest that the VSA accounts for almost 50% of the intelligibility variance, being an essential component of global intelligibility estimates. However, almost no study publishes a tool to estimate VSA automatically with public...
Article
Background: Similar to other neurological diseases, subtle early neurological changes that occur in Alzheimer's Disease (AD) are difficult to quantify and track objectively. One relevant bio-signal is eye movement, as it presents a unique window into cognitive processes as a direct measure of real-time inputs to the brain. Eye-tracking allows us t...
Article
Background: Speech and language problems are one of the earliest signs of neurodegenerative diseases including Alzheimer's disease (AD), but they are often difficult to detect especially in the early stages. Utility of assessing speech and language problems using artificial intelligence (AI) has been examined in the past. However, the accuracy of...
Article
In recent studies, self-supervised pre-trained models tend to outperform supervised pre-trained models in transfer learning. In particular, self-supervised learning of utterance-level speech representation can be used in speech applications that require discriminative representation of consistent attributes within an utterance: speaker, language, e...
Preprint
Full-text available
Automatic Speaker Verification (ASV) technology has become commonplace in virtual assistants. However, its performance suffers when there is a mismatch between the train and test domains. Mixed bandwidth training, i.e., pooling training data from both domains, is a preferred choice for developing a universal model that works for both narrowband and...
Preprint
In recent studies, self-supervised pre-trained models tend to outperform supervised pre-trained models in transfer learning. In particular, self-supervised learning (SSL) of utterance-level speech representation can be used in speech applications that require discriminative representation of consistent attributes within an utterance: speaker, langu...
Preprint
Considering the abundance of unlabeled speech data and the high labeling costs, unsupervised learning methods can be essential for better system development. One of the most successful methods is contrastive self-supervised methods, which require negative sampling: sampling alternative samples to contrast with the current sample (anchor). However,...
Preprint
Full-text available
Adversarial attacks are a threat to automatic speech recognition (ASR) systems, and it becomes imperative to propose defenses to protect them. In this paper, we perform experiments to show that K2 conformer hybrid ASR is strongly affected by white-box adversarial attacks. We propose three defenses--denoiser pre-processor, adversarially fine-tuning...
Preprint
Full-text available
Adversarial attacks pose a severe security threat to the state-of-the-art speaker identification systems, thereby making it vital to propose countermeasures against them. Building on our previous work that used representation learning to classify and detect adversarial attacks, we propose an improvement to it using AdvEst, a method to estimate adve...
Preprint
Full-text available
Speech systems developed for a particular choice of acoustic domain and sampling frequency do not translate easily to others. The usual practice is to learn domain adaptation and bandwidth extension models independently. Contrary to this, we propose to learn both tasks together. Particularly, we learn to map narrowband conversational telephone spee...
Article
The high cost of data acquisition makes Automatic Speech Recognition (ASR) model training problematic for most existing languages, including languages that do not even have a written script, or for which the phone inventories remain unknown. Past works explored multilingual training, transfer learning, as well as zero-shot learning in order to buil...
Preprint
Full-text available
The high cost of data acquisition makes Automatic Speech Recognition (ASR) model training problematic for most existing languages, including languages that do not even have a written script, or for which the phone inventories remain unknown. Past works explored multilingual training, transfer learning, as well as zero-shot learning in order to buil...
Article
Full-text available
Diagnosing Parkinson’s Disease (PD) necessitates monitoring symptom progression. Unfortunately, diagnostic confirmation often occurs years after disease onset. A more sensitive and objective approach is paramount to the expedient diagnosis and treatment of persons with PD (PwPDs). Recent studies have shown that we can train accurate models to detec...
Preprint
Full-text available
The pervasiveness of intra-utterance Code-switching (CS) in spoken content has enforced ASR systems to handle mixed input. Yet, designing a CS-ASR has many challenges, mainly due to the data scarcity, grammatical structure complexity, and mismatch along with unbalanced language usage distribution. Recent ASR studies showed the predominance of E2E-A...
Article
Typically, unsupervised segmentation of speech into the phone- and wordlike units are treated as separate tasks and are often done via different methods which do not fully leverage the inter-dependence of the two tasks. Here, we unify them and propose a technique that can jointly perform both, showing that these two tasks indeed benefit from each o...
Article
Full-text available
Dialog acts can be interpreted as the atomic units of a conversation, more fine-grained than utterances, characterized by a specific communicative function. The ability to structure a conversational transcript as a sequence of dialog acts—dialog act recognition, including the segmentation—is critical for understanding dialog. We apply two pre-train...
Chapter
The emergence of smart home assistants increased the need for robust Far-Field Speaker Identification models. Speaker Identification enables the assistants to perform personalized tasks. Smart home assistants face very challenging speech conditions, including various room shapes and sizes, various distances of the speaker from the microphone, vario...
Preprint
Full-text available
Typically, unsupervised segmentation of speech into the phone and word-like units are treated as separate tasks and are often done via different methods which do not fully leverage the inter-dependence of the two tasks. Here, we unify them and propose a technique that can jointly perform both, showing that these two tasks indeed benefit from each o...
Article
Adversarial examples are designed to fool the speaker recognition (SR) system by adding a carefully crafted human-imperceptible noise to the speech signals. Posing a severe security threat to state-of-the-art SR systems, it becomes vital to deep-dive and study their vulnerabilities. Moreover, it is of greater importance to propose countermeasures t...
Preprint
This technical report describes Johns Hopkins University speaker recognition system submitted to Voxceleb Speaker Recognition Challenge 2021 Track 3: Self-supervised speaker verification (closed). Our overall training process is similar to the proposed one from the first place team in the last year's VoxSRC2020 challenge. The main difference is a r...
Preprint
Full-text available
Speech emotion recognition is the task of recognizing the speaker's emotional state given a recording of their utterance. While most of the current approaches focus on inferring emotion from isolated utterances, we argue that this is not sufficient to achieve conversational emotion recognition (CER) which deals with recognizing emotions in conversa...
Preprint
Full-text available
Capitalization and punctuation are important cues for comprehending written texts and conversational transcripts. Yet, many ASR systems do not produce punctuated and case-formatted speech transcripts. We propose to use a multi-task system that can exploit the relations between casing and punctuation to improve their prediction performance. Whereas...
Preprint
Adversarial attacks have become a major threat for machine learning applications. There is a growing interest in studying these attacks in the audio domain, e.g, speech and speaker recognition; and find defenses against them. In this work, we focus on using representation learning to classify/detect attacks w.r.t. the attack algorithm, threat model...
Preprint
Full-text available
Dialog acts can be interpreted as the atomic units of a conversation, more fine-grained than utterances, characterized by a specific communicative function. The ability to structure a conversational transcript as a sequence of dialog acts -- dialog act recognition, including the segmentation -- is critical for understanding dialog. We apply two pre...
Preprint
Full-text available
This paper introduces WaveGrad 2, a non-autoregressive generative model for text-to-speech synthesis. WaveGrad 2 is trained to estimate the gradient of the log conditional density of the waveform given a phoneme sequence. The model takes an input phoneme sequence, and through an iterative refinement process, generates an audio waveform. This contra...
Preprint
Full-text available
In this study, we analyze the use of speech and speaker recognition technologies and natural language processing to detect Alzheimer disease (AD) and estimate mini-mental status evaluation (MMSE) scores. We used speech recordings from Interspeech 2021 ADReSSo challenge dataset. Our work focuses on adapting state-of-the-art speaker recognition and l...
Preprint
Full-text available
Automatic detection of phoneme or word-like units is one of the core objectives in zero-resource speech processing. Recent attempts employ self-supervised training methods, such as contrastive predictive coding (CPC), where the next frame is predicted given past context. However, CPC only looks at the audio signal's frame-level structure. We overco...
Preprint
Full-text available
With the increase in the availability of speech from varied domains, it is imperative to use such out-of-domain data to improve existing speech systems. Domain adaptation is a prominent pre-processing approach for this. We investigate it for adapt microphone speech to the telephone domain. Specifically, we explore CycleGAN-based unpaired translatio...
Article
Parkinson's Disease (PD) affects speech in the form of dysphonia and hypokinetic dysarthria. Multiple studies have evaluated PD's influence on different aspects of speech, showing differences between speakers with and without PD. Most recent studies are focused on the proposal of new automatic and objective tools to help in the diagnosis and severi...
Preprint
The ubiquitous presence of machine learning systems in our lives necessitates research into their vulnerabilities and appropriate countermeasures. In particular, we investigate the effectiveness of adversarial attacks and defenses against automatic speech recognition (ASR) systems. We select two ASR models - a thoroughly studied DeepSpeech model an...
Preprint
Full-text available
Research in automatic speaker recognition (SR) has been undertaken for several decades, reaching great performance. However, researchers discovered potential loopholes in these technologies like spoofing attacks. Quite recently, a new genre of attack, termed adversarial attacks, has been proved to be fatal in computer vision and it is vital to stud...
Chapter
Several aspects of a person’s speech can be influenced by Parkinson’s Disease (PD), namely, phonatory, articulatory, prosodic, and cognitive-linguistic aspects. Several studies in literature extract information about these aspects to automatically detect or assess PD employing corpora containing different speech tasks from patients and control subj...
Chapter
Our preliminary data and experiments show the potentiality of the artificial intelligence techniques to identify Parkinson’s disease and to assess its extent using the information extracted from the speech. Our conclusions are based on several prospective studies with different corpora of parkinsonian and control subjects modelled following differe...
Article
Very deep transformers outperform conventional bi-directional long short-term memory networks for automatic speech recognition (ASR) by a significant margin. However, being autoregressive models, their computational complexity is still a prohibitive factor in their deployment into production systems. To amend this problem, we study two different no...
Preprint
Full-text available
This paper introduces a novel method to diagnose the source-target attention in state-of-the-art end-to-end speech recognition models with joint connectionist temporal classification (CTC) and attention training. Our method is based on the fact that both, CTC and source-target attention, are acting on the same encoder representations. To understand...
Preprint
Full-text available
Multimodal word discovery (MWD) is often treated as a byproduct of the speech-to-image retrieval problem. However, our theoretical analysis shows that some kind of alignment/attention mechanism is crucial for a MWD system to learn meaningful word-level representation. We verify our theory by conducting retrieval and word discovery experiments on MS...
Preprint
Full-text available
Data augmentation is a widely used strategy for training robust machine learning models. It partially alleviates the problem of limited data for tasks like speech emotion recognition (SER), where collecting data is expensive and challenging. This study proposes CopyPaste, a perceptually motivated novel augmentation procedure for SER. Assuming that...