About
711
Publications
166,768
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
15,753
Citations
Publications
Publications (711)
Because a reference signal is often unavailable in real-world scenarios, reference-free speech quality and intelligibility assessment models are important for many speech processing applications. Despite a great number of deep-learning models that have been applied to build non-intrusive speech assessment approaches and achieve promising performanc...
Privacy is a hot topic for policymakers across the globe, including the United States. Evolving advances in AI and emerging concerns about the misuse of personal data have pushed policymakers to draft legislation on trustworthy AI and privacy protection for its citizens. This paper presents the state of the privacy legislation at the U.S. Congress...
Advances in automatic speaker verification (ASV) promote research into the formulation of spoofing detection systems for real-world applications. The performance of ASV systems can be degraded severely by multiple types of spoofing attacks, namely, synthetic speech (SS), voice conversion (VC), replay, twins and impersonation, especially in the case...
Privacy is a hot topic for policymakers across the globe, including the United States. Evolving advances in AI and emerging concerns about the misuse of personal data have pushed policymakers to draft legislation on trustworthy AI and privacy protection for its citizens. This paper presents the state of the privacy legislation at the U.S. Congress...
In speech synthesis, modeling of rich emotions and prosodic variations present in human voice are crucial to synthesize natural speech. Although speaker embeddings have been widely used in personalized speech synthesis as conditioning inputs, they are designed to lose variation to optimize speaker recognition accuracy. Thus, they are suboptimal for...
Bilingual children at a young age can benefit from exposure to dual language, impacting their language and literacy development. Speech technology can aid in developing tools to accurately quantify children's exposure to multiple languages, thereby helping parents, teachers, and early-childhood practitioners to better support bilingual children. Th...
The ability to accurately classify accents and assess accentedness in non-native speakers are challenging tasks due primarily to the complexity and diversity of accent and dialect variations. In this study, embeddings from advanced pretrained language identification (LID) and speaker identification (SID) models are leveraged to improve the accuracy...
Speech and language development are early indicators of overall analytical and learning ability in children. The preschool classroom is a rich language environment for monitoring and ensuring growth in young children by measuring their vocal interactions with teachers and classmates. Early childhood researchers are naturally interested in analyzing...
Accurately classifying accents and assessing accentedness in non-native speakers are both challenging tasks due to the complexity and diversity of accent and dialect variations. In this study, embeddings from advanced pre-trained language identification (LID) and speaker identification (SID) models are leveraged to improve the accuracy of accent cl...
Reverberation and background noise can degrade speech quality and intelligibility when captured by a distant microphone. In recent years, researchers have developed several deep learning (DL)-based single-channel speech dereverberation systems that aim to minimize distortions introduced into speech captured in naturalistic environments. A majority...
The presence of background noise or competing talkers is one of the main communication challenges for cochlear implant (CI) users in speech understanding in naturalistic spaces. These external factors distort the time-frequency (T-F) content including magnitude spectrum and phase of speech signals. While most existing speech enhancement (SE) soluti...
Advances in automatic speaker verification (ASV) promote research into the formulation of spoofing detection systems for real-world applications. The performance of ASV systems can be degraded severely by multiple types of spoofing attacks, namely, synthetic speech (SS), voice conversion (VC), replay, twins and impersonation, especially in the case...
Observational approaches may limit researchers' ability to comprehensively capture preschool classroom conversations, including the use of wh-words. In the current proof-of-concept study, we present descriptive results using an automated speech recognition (ASR) system coupled with location sensors to quantify teachers' wh-words by preschool teache...
This study is focused on understanding and quantifying the change in phoneme and prosody information encoded in the Self-Supervised Learning (SSL) model, brought by an accent identification (AID) fine-tuning task. This problem is addressed based on model probing. Specifically, we conduct a systematic layer-wise analysis of the representations of th...
Recently, Transformer-based architectures have been explored for speaker embedding extraction. Although the Transformer employs the self-attention mechanism to efficiently model the global interaction between token embeddings, it is inadequate for capturing short-range local context, which is essential for the accurate extraction of speaker informa...
Previously, selection of l channels was prioritized according to formant frequency locations in an l-of-n-of-m-based signal processing strategy to provide important voicing information independent of listening environments for cochlear implant (CI) users. In this study, ideal, or ground truth, formants were incorporated into the selection stage to...
While speech understanding for cochlear implant (CI) users in quiet is relatively effective, listeners experience difficulty in identification of speaker and sound location. To assist for better residual hearing abilities and speech intelligibility support, bilateral and bimodal forms of assisted hearing is becoming popular among CI users. Effectiv...
With the rapid development of intelligent vehicles and advanced driver-assistance systems (ADAS), a new trend is that intelligent vehicles with different levels of automation will coexist in large-scale traffic scenarios. In such scenarios, the automation level of intelligent vehicles could range from the Society of Automotive Engineers level 0 (i....
With emerging trends in advanced driver-assistance systems and autopilot vehicle development, future vehicle advancements are expected to monitor, assess, and predict or anticipate driver characteristics and intentions. Being able to achieve proactive warning abilities would allow for a more harmonizing vehicle-driver engagement and lead to improve...
Speech and language development are early indicators of overall analytical and learning ability in pre-school children. Early childhood researchers are interested in analyzing naturalistic versus controlled lab recordings to assess both quality and quantity of such communication interactions between children and adults/teachers. Unfortunately, pres...
Recently, Transformer-based architectures have been explored for speaker embedding extraction. Although the Transformer employs the self-attention mechanism to efficiently model the global interaction between token embeddings, it is inadequate for capturing short-range local context, which is essential for the accurate extraction of speaker informa...
In the area of speech processing, human speaker identification under naturalistic environments is a challenging task, especially for hearing-impaired individuals with cochlear implants (CIs) or hearing aids (HAs). Motivated by the fact that electrodograms reflect direct CI stimulation of input audio, this study proposes a speaker identification (ID...
Learning an effective speaker representation is crucial for achieving reliable performance in speaker verification tasks. Speech signals are high-dimensional, long, and variable-length sequences containing diverse information at each time-frequency (TF) location. The standard convolutional layer that operates on neighboring local regions often fail...
Speech activity detection (SAD) serves as a crucial front-end system to several downstream Speech and Language Technology (SLT) tasks such as speaker diarization (SD), speaker identification (SID), and speech recognition (ASR). Recent years have seen deep learning (DL)-based SAD systems designed to improve robustness against static background noise...
One important obstacle to optimizing fitting and sound coding for auditory implants is lack of flexible, powerful and portable platforms that can be used in real-world listening environments by implanted patients. The clinical processors and the typically available research tools either do not have sufficient computational power and flexibility or...
Several speech processing systems have demonstrated considerable performance improvements when deep complex neural networks (DCNN) are coupled with self-attention (SA) networks. However, the majority of DCNN-based studies on speech dereverberation that employ self-attention do not explicitly account for the inter-dependencies between real and imagi...
In the context of keyword spotting (KWS), the replacement of handcrafted speech features by learnable features has not yielded superior KWS performance. In this study, we demonstrate that filterbank learning outperforms handcrafted speech features for KWS whenever the number of filterbank channels is severely decreased. Reducing the number of chann...
Automatic speaker verification systems are vulnerable to a variety of access threats, prompting research into the formulation of effective spoofing detection systems to act as a gate to filter out such spoofing attacks. This study introduces a simple attention module to infer 3-dim attention weights for the feature map in a convolutional layer, whi...
Adapting speaker recognition systems to new environments is a widely-used technique to improve a well-performing model learned from large-scale data towards a task-specific small-scale data scenarios. However, previous studies focus on single domain adaptation, which neglects a more practical scenario where training data are collected from multiple...
Naturalistic sounds encode salient acoustic content that provides situational context or subject/system properties essential for acoustic awareness, autonomy, safety, and improved quality of life for individuals with sensorineural hearing loss. Cochlear implants (CIs) are an assistive hearing device that restores auditory function in hearing impair...
The Fearless Steps Challenge 2019 Phase-1 (FSC-P1) is the inaugural Challenge of the Fearless Steps Initiative hosted by the Center for Robust Speech Systems (CRSS) at the University of Texas at Dallas. The goal of this Challenge is to evaluate the performance of state-of-the-art speech and language systems for large task-oriented teams with natura...
Monitoring child development in terms of speech/language skills has a long-term impact on their overall growth. As student diversity continues to expand in US classrooms, there is a growing need to benchmark social-communication engagement, both from a teacher-student perspective, as well as student-student content. Given various challenges with di...
Neural network pruning can be effectively applied to compress automatic speech recognition (ASR) models. However, in multilingual ASR, performing language-agnostic pruning may lead to severe performance degradation on some languages because language-agnostic pruning masks may not fit all languages and discard important language-specific parameters....
A supportive environment is vital for overall cognitive development in children. Challenges with direct observation and limitations of access to data driven approaches often hinder teachers or practitioners in early childhood research to modify or enhance classroom structures. Deploying sensor based tools in naturalistic preschool classrooms will t...
While speech understanding for cochlear implant (CI) and Hearing aid (HA) users in quiet is relatively effective, listeners experience difficulty in identification of speakers and sound location. Previous studies have reported improved localization and better speech perception when the unilateral CI is coupled with a HA in the contralateral ear. Th...
Recently, attention mechanisms have been applied successfully in neural network-based speaker verification systems. Incorporating the Squeeze-and-Excitation block into convolutional neural networks has achieved remarkable performance. However, it uses global average pooling (GAP) to simply average the features along time and frequency dimensions, w...
Convolutional neural networks (CNN) have improved speech recognition performance greatly by exploiting localized time-frequency patterns. But these patterns are assumed to appear in symmetric and rigid kernels by the conventional CNN operation. It motivates the question: What about asymmetric kernels? In this study, we illustrate adaptive views can...
Self-supervised learning representations (SSLR) have resulted in robust features for downstream tasks in many fields. Recently, several SSLRs have shown promising results on automatic speech recognition (ASR) benchmark corpora. However, previous studies have only shown performance for solitary SSLRs as an input feature for ASR models. In this study...
Current leading mispronunciation detection and diagnosis (MDD) systems achieve promising performance via end-to-end phoneme recognition. One challenge of such end-to-end solutions is the scarcity of human-annotated phonemes on natural L2 speech. In this work, we leverage unlabeled L2 speech via a pseudo-labeling (PL) procedure and extend the fine-t...
Adult-child interaction is an important component for language development in young children. However, such development varies based on the quality and quantity of these conversations. Teachers responsible for the language acquisition of their students have a vested interest in improving such conversation in their classrooms. Advancements in speech...
Assessing child growth in terms of speech and language development is a critical indicator of long term learning ability and life-long development progress. The earlier an at-risk child is identified, the earlier support can be provided to reduce the social impact of the speech or language issue. The preschool classroom provides an opportunity for...
Natural compensation of speech production in challenging listening environments is referred to as the Lombard
effect (LE). The resulting acoustic differences between neutral and Lombard speech have been shown to provide
intelligibility benefits for normal hearing (NH) and cochlear implant (CI) listeners alike. Motivated by this outcome,
three LE pe...
Audio analysis for forensic speaker verification offers unique challenges in system performance due in part to data collected in naturalistic field acoustic environments where location/scenario uncertainty is common in the forensic data collection process. Forensic speech data as potential evidence can be obtained in random naturalistic environment...
With the advancements in deep learning approaches, the performance of speech enhancing systems in the presence of background noise have shown significant improvements. However, improving the system’s robustness against reverberation is still a work in progress, as reverberation tends to cause loss of formant structure due to smearing effects in tim...
Training Automatic Speech Recognition (ASR) systems with sequentially incoming data from alternate domains is an essential milestone in order to reach human intelligibility level in speech recognition. It is possible to incorporate an evident sequential learning approach that combines and stores the entire incoming data, then retrains the model to...
With the rapid development of intelligent vehicles and Advanced Driver-Assistance Systems (ADAS), a new trend is that mixed levels of human driver engagements will be involved in the transportation system. Therefore, necessary visual guidance for drivers is vitally important under this situation to prevent potential risks. To advance the developmen...
The goal of speech separation is to extract multiple speech sources from a single microphone recording. Recently, with the advancement of deep learning and availability of large datasets, speech separation has been formulated as a supervised learning problem. These approaches aim to learn discriminative patterns of speech, speakers, and background...
Most current speech technology systems are designed to operate well even in the presence of multiple active speakers. However, most solutions assume that the number of co-current speakers is known. Unfortunately, this information might not always be available in real-world applications. In this study, we propose a real-time, single-channel attentio...
Experimental hardware-research interfaces form a crucial role during the developmental stages of any medical, signal-monitoring system as it allows researchers to test and optimize output results before perfecting the design for the actual FDA approved medical device and large-scale production. These testing platforms, intake the raw signal through...
Many communities which are experiencing increased gun violence are turning to acoustic gunshot detection systems (GSDS) with the hope that their deployment would provide increased 24/7 monitoring and the potential for more rapid response by law enforcement to the scene. In addition to real-time monitoring, data collected by gunshot detection system...
Young children's friendships fuel essential developmental outcomes (e.g., social-emotional competence) and are thought to provide even greater benefits to children with or at-risk for disabilities. Teacher and parent report and sociometric measures are commonly used to measure friendships, and ecobehavioral assessment has been used to capture its f...
In this study, we propose to investigate triplet loss for the purpose of an alternative feature representation for ASR. We consider a general non-semantic speech representation, which is trained with a self-supervised criteria based on triplet loss called TRILL, for acoustic modeling to represent the acoustic characteristics of each audio. This str...
While speech understanding for cochlear implant (CI) users in quiet is relatively effective, listeners experience difficulty in identification of speaker and sound location. Previous studies have reported improved localization and better speech perception when the CI is coupled with a second CI or hearing aid in the contralateral ear. This is refer...
An objective metric that predicts speech intelligibility under different types of noise and distortion would be desirable in voice communication. To date, the majority of studies concerning speech intelligibility metrics have focused on predicting the effects of individual noise or distortion mechanisms. This study proposes an objective metric, the...
Intelligent vehicles and Advanced Driver Assistance Systems (ADAS) are being developed rapidly over the past few years. Many applications such as vehicle localization, environment perception, and path planning have shown promising potentialities. While there is great interest in migrating from complete human-controlled vehicles towards fully autono...
Part 1: Broad Description: This Early-concept Grants for Exploratory Research (EAGER) funding project focuses on exploring and developing a novel operational collection of speech, language and perception-based measures to objectively assess speech intelligibility for second language (L2) speech production, as well as providing effective learner-spe...
With the rapid development of intelligent vehicles and Advanced Driver-Assistance Systems (ADAS), a new trend is that mixed levels of human driver engagements will be involved in the transportation system. Therefore, necessary visual guidance for drivers is vitally important under this situation to prevent potential risks. To advance the developmen...
Speech, speaker, and language systems have traditionally relied on carefully collected speech material for training acoustic models. There is an enormous amount of freely accessible audio content. A major challenge, however, is that such data is not professionally recorded, and therefore may contain a wide diversity of background noise, nonlinear d...
The Generation power of Generative Adversarial Neural Networks (GANs) has shown great promise to learn representations from unlabelled data while guided by a small amount of labelled data. We aim to utilise the generation power of GANs to learn Audio Representations. Most existing studies are, however, focused on images. Some studies use GANs for s...
It has been previously shown that advantages in auditory processing exist when the following situational context traits or subject/system properties are present: (i) availability of a wider radial range up to 360 degrees, (ii) intolerance towards acoustic or visual obstruction, (iii) distant event horizon, (iv) availability for quick neural process...
Most current speech technology systems are designed to operate well even in the presence of multiple active speakers. However, most solutions assume that the number of co-current speakers is known. Unfortunately, this information might not always be available in real-world applications. In this study, we propose a real-time, single-channel attentio...
Performance of Automatic Speech Recognition (ASR) systems is known to suffer considerable degradation when exposed to Far-Field speech data capture. Consequently, far-field ASR has received considerable attention in recent years. Motivated by our recent work using Curriculum Learning (CL) based strategies to improve Speaker Identification (SID) und...