Visar Berisha’s research while affiliated with Arizona State University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (230)


The Impact of Decorrelation on Transformer Interpretation Methods: Applications to Clinical Speech AI
  • Conference Paper

April 2025

Lingfeng Xu

·

·

Julie Liss

·

Visar Berisha



Comparison of a standard alignment workflow (top row) to our trainable alignment workflow (bottom row). In the standard alignment workflow, the acoustic model is first adapted to the target speech corpus, speaker adaptation is applied, and alignments are generated. In contrast, a trainable alignment workflow allows for gold-standard task-specific alignment rules to be learned by the acoustic model. A manual annotator provides task-specific alignments for a subset of the corpus. Speaker adaptation and acoustic model adaptation can still be performed, but the model is also trained using the manual alignments to produce a model that has learned task-specific alignment rules. Thus, final generated forced alignments are generated with a model that has learned the gold-standard alignment rules. GMM = Gaussian mixture model; VTLN = vocal tract length normalization; fMLLR = feature space maximum likelihood linear regression.
Block diagram of speaker adaptive Wav2Vec2 forced alignment system. The system requires an audio file and its corresponding transcript. A grapheme-to-phoneme conversion is used to generate the expected phoneme sequence for alignment. From each input audio file, a speaker embedding is extracted, frame-wise Wav2Vec2 features are calculated, and the speaker embedding is appended to each per-frame feature vector. A per-frame softmax provides a probability distribution over all phonemes for each frame. To produce the final alignment, dynamic time warping uses the per-frame phoneme probabilities to assign each phoneme in the expected phoneme sequence to a frame.
(A) Heatmap of threshold accuracy by phoneme type and alignment method for the child speech corpus used in this study. Threshold accuracy is calculated in accordance with Kreuk et al., 2020, and Zhu et al., 2022, measuring the percentage of onset boundary errors smaller than 20 ms threshold. (B) Heatmap of threshold accuracy (percentage of onset errors less than a 20 ms threshold) by phoneme type and alignment method for the TIMIT data set.
(A) Empirical cumulative distribution function of onset error and offset error for all phoneme intervals in the child speech corpus. The x-axis is the error threshold in milliseconds. The y-axis denotes the fraction of phoneme intervals that have onset or offset error less than the threshold. (B) Empirical cumulative distribution of onset error and offset errors for by phoneme subtype. MFA = Montreal Forced Aligner; SAT = speaker adaptive training; W2V2 = Wav2Vec2.
Midpoint accuracy as a function of age for the child speech corpus plotted for MFA_Adapt, W2V2TrainedSAT, and W2V2Trained. A linear fit for each method shows the average accuracy at a given age. Each point represents one speaker. MFA = Montreal Forced Aligner; SAT = speaker adaptive training; W2V2 = Wav2Vec2.

+1

A Tunable Forced Alignment System Based on Deep Learning: Applications to Child Speech
  • Article
  • Publisher preview available

March 2025

·

20 Reads

Purpose Phonetic forced alignment has a multitude of applications in automated analysis of speech, particularly in studying nonstandard speech such as children's speech. Manual alignment is tedious but serves as the gold standard for clinical-grade alignment. Current tools do not support direct training on manual alignments. Thus, a trainable speaker adaptive phonetic forced alignment system, Wav2TextGrid, was developed for children's speech. The source code for the method is publicly available along with a graphical user interface at https://github.com/pkadambi/Wav2TextGrid. Method We propose a trainable, speaker-adaptive, neural forced aligner developed using a corpus of 42 neurotypical children from 3 to 6 years of age. Evaluation on both child speech and on the TIMIT corpus was performed to demonstrate aligner performance across age and dialectal variations. Results The trainable alignment tool markedly improved accuracy over baseline for several alignment quality metrics, for all phoneme categories. Accuracy for plosives and affricates in children's speech improved more than 40% over baseline. Performance matched existing methods using approximately 13 min of labeled data, while approximately 45–60 min of labeled alignments yielded significant improvement. Conclusion The Wav2TextGrid tool allows alternate alignment workflows where the forced alignments, via training, are directly tailored to match clinical-grade, manually provided alignments. Supplemental Material https://doi.org/10.23641/asha.28593971

View access options

360 Using machine learning to analyze voice and detect aspiration

March 2025

·

6 Reads

Journal of Clinical and Translational Science

Objectives/Goals: Aspiration causes or aggravates lung diseases. While bedside swallow evaluations are not sensitive/specific, gold standard tests for aspiration are invasive, uncomfortable, expose patients to radiation, and are resource intensive. We propose the development and validation of an AI model that analyzes voice to noninvasively predict aspiration. Methods/Study Population: Retrospectively recorded [i] phonations from 163 unique ENT patients were analyzed for acoustic features including jitter, shimmer, harmonic to noise ratio (HNR), etc. Patients were classified into three groups: aspirators (Penetration-Aspiration Scale, PAS 6–8), probable (PAS 3–5), and non-aspirators (PAS 1–2) based on video fluoroscopic swallow (VFSS) findings. Multivariate analysis evaluated patient demographics, history of head and neck surgery, radiation, neurological illness, obstructive sleep apnea, esophageal disease, body mass index, and vocal cord dysfunction. Supervised machine learning using five folds cross-validated neural additive network modelling (NAM) was performed on the phonations of aspirator versus non-aspirators. The model was then validated using an independent, external database. Results/Anticipated Results: Aspirators were found to have quantifiably worse quality of sound with higher jitter and shimmer but lower harmonics noise ratio. NAM modeling classified aspirators and non-aspirators as distinct groups (aspirator NAM risk score 0.528+0.2478 (mean + std) vs. non-aspirator (control) risk score of 0.252+0.241 (mean + std); p Discussion/Significance of Impact: We report the use of voice as a novel, noninvasive biomarker to detect aspiration risk using machine learning techniques. This tool has the potential to be used for the safe and early detection of aspiration in a variety of clinical settings including intensive care units, wards, outpatient clinics, and remote monitoring.


Figure 1: The radar pulses mmWaves and receives the reflection from neck displacement to recover the vibration signal. The microphone captures the speech pressure wave that has been shaped by the vocal tract.
Descriptive Statistics
A Speech Production Model for Radar: Connecting Speech Acoustics with Radar-Measured Vibrations

March 2025

·

38 Reads

Millimeter Wave (mmWave) radar has emerged as a promising modality for speech sensing, offering advantages over traditional microphones. Prior works have demonstrated that radar captures motion signals related to vocal vibrations, but there is a gap in the understanding of the analytical connection between radar-measured vibrations and acoustic speech signals. We establish a mathematical framework linking radar-captured neck vibrations to speech acoustics. We derive an analytical relationship between neck surface displacements and speech. We use data from 66 human participants, and statistical spectral distance analysis to empirically assess the model. Our results show that the radar-measured signal aligns more closely with our model filtered vibration signal derived from speech than with raw speech itself. These findings provide a foundation for improved radar-based speech processing for applications in speech enhancement, coding, surveillance, and authentication.


Fig. 1. The Cookie Theft picture and CIUs (marked in red) [5], [17].
Automated Extraction of Spatio-Semantic Graphs for Identifying Cognitive Impairment

February 2025

·

18 Reads

Existing methods for analyzing linguistic content from picture descriptions for assessment of cognitive-linguistic impairment often overlook the participant's visual narrative path, which typically requires eye tracking to assess. Spatio-semantic graphs are a useful tool for analyzing this narrative path from transcripts alone, however they are limited by the need for manual tagging of content information units (CIUs). In this paper, we propose an automated approach for estimation of spatio-semantic graphs (via automated extraction of CIUs) from the Cookie Theft picture commonly used in cognitive-linguistic analyses. The method enables the automatic characterization of the visual semantic path during picture description. Experiments demonstrate that the automatic spatio-semantic graphs effectively differentiate between cognitively impaired and unimpaired speakers. Statistical analyses reveal that the features derived by the automated method produce comparable results to the manual method, with even greater group differences between clinical groups of interest. These results highlight the potential of the automated approach for extracting spatio-semantic features in developing clinical speech models for cognitive impairment assessment.



Potential Applications of Artificial Intelligence for Cross-language Intelligibility Assessment of Dysarthric Speech

January 2025

·

6 Reads

Purpose: This commentary introduces how artificial intelligence (AI) can be leveraged to advance cross-language intelligibility assessment of dysarthric speech. Method: We propose a dual-component framework consisting of a universal module that generates language-independent speech representations and a language-specific intelligibility model that incorporates linguistic nuances. Additionally, we identify key barriers to cross-language intelligibility assessment, including data scarcity, annotation complexity, and limited linguistic insights, and present AI-driven solutions to overcome these challenges. Conclusion: Advances in AI offer transformative opportunities to enhance cross-language intelligibility assessment for dysarthric speech by balancing scalability across languages and adaptability by languages.


Figure 1: The sequential label-efficient framework
Figure 2: An example of the proposed framework and the baseline.
Advanced Tutorial: Label-Efficient Two-Sample Tests

January 2025

·

9 Reads

Hypothesis testing is a statistical inference approach used to determine whether data supports a specific hypothesis. An important type is the two-sample test, which evaluates whether two sets of data points are from identical distributions. This test is widely used, such as by clinical researchers comparing treatment effectiveness. This tutorial explores two-sample testing in a context where an analyst has many features from two samples, but determining the sample membership (or labels) of these features is costly. In machine learning, a similar scenario is studied in active learning. This tutorial extends active learning concepts to two-sample testing within this \textit{label-costly} setting while maintaining statistical validity and high testing power. Additionally, the tutorial discusses practical applications of these label-efficient two-sample tests.


Citations (57)


... Valuing and rewarding perceived novelty and potential impact over basic rigor and responsible reporting can lead researchers to inflate claims in hopes of acceptance in the most prestigious venues. It can also skew the literature in other ways, leading to so-called "publication bias" (Saidi, Dasarathy, and Berisha 2024). Here, in addition to spin and the aforementioned selective reporting of (usually positive) results, the role of the peer review system is also in question, given known biases on the part of reviewers that can lead to preferential treatment for researchers from specific regions, institutions, or demographics, or for certain types of research (Lee et al. 2013). ...

Reference:

Reproducibility in machine‐learning‐based research: Overview, barriers, and drivers
Unraveling overoptimism and publication bias in ML-driven science
  • Citing Article
  • February 2025

Patterns

... Children's ASR remains a considerable difficulty, as evidenced by the decline in performance of contemporary state-ofthe-art systems compared to adult speech. The decline can be partly ascribed to the significant acoustic heterogeneity in the speech of kids resulting from developmental alterations in the language production equipment (Berisha & Liss, 2024). These physical alterations result in variations in formant and basic frequency positions. ...

Responsible development of clinical speech AI: Bridging the gap between clinical research and technology

npj Digital Medicine

... Therefore, speech assessment may provide clues for differential diagnosis among neurological diseases with differing pathophysiology but similar clinical manifestations. Based on crosssectional design, the digital speech biomarkers were researched mainly in Huntington's disease, multiple sclerosis, cerebellar ataxia, amyotrophic lateral sclerosis, multiple system atrophy, and progressive supranuclear palsy (Neumann et al., 2024;Noffs, 2020;Simmatis, 2023;Stegmann, 2024;Stegmann et al., 2020) (Table 3; see also Supplementary Material for associated references). Interestingly, in Huntington's disease and cerebellar ataxia, subliminal speech impairment has been detected in prodromal periods (Vogel et al., , 2022Kouba et al., 2023). ...

Automated speech analytics in ALS: higher sensitivity of digital articulatory precision over the ALSFRS-R
  • Citing Article
  • June 2024

... This may be the result of the richness of the data used 566 in the studies. Also, it is important to remember that due to the complex 567 spatial-temporal data structure, EEG research can often suffer from non-reproducibility 568 due to inadequate data analysis methods and overfitting [92]. 569 HRV focuses on changes in the beat-to-beat interval which are ANS 570 activity-dependent [93,94], while ECG signal morphology should be understood as the 571 shape of the voltage curve over time. ...

Achieving Reproducibility in EEG-Based Machine Learning
  • Citing Conference Paper
  • June 2024

... It necessitates automating CIU annotation to efficiently extract the features from spatiosemantic graphs. Given the Cookie Theft picture's usage in the largest Alzheimer's disease and dementia cohort studies to date, the availability of these automatically extracted spatiosemantic features will serve as a great aid to clinicians in automatically assessing cognitive impairment [18]- [21]. ...

Operationalizing Clinical Speech Analytics: Moving From Features to Measures for Real-World Clinical Impact

... The approach presented in this work does not assume knowledge of constraints yet allows recovery of interpretable constraints. This work also deals with error detection in ML models or "metacognition" [17,21,31]. Recent efforts include neurosymbolic methods [7,9,16] that rely on and is a metacognitive approach for detecting errors in the result of a trained machine learning modelˆthat assigns a label for some sample . ...

Machine Learning Qualification Process and Impact to System Assurance
  • Citing Conference Paper
  • January 2024

... In[110], acoustic-based speech measures-such as the duration ratio of different syllable sequences, variability in syllable duration, and F2 slope-were compared with validated perceptual ratings of coordination, consistency, and speed in ALS speech. The Pearson and Spearman correlation coefficients confirmed the validity of these speech measures for profiling articulatory deficits in motor speech disorders.Similarly, in[111], Yawer et al. evaluated the external validity of a publicly available speech AI tool designed to assess stress. The tool's stress measures were compared with the well-established ...

Reliability and validity of a widely-available AI tool for assessment of stress based on speech

... First is the increasing availability of Big Data that enables deployment of data-hungry ML techniques. Big Data sources include large, curated databases such as the AphasiaBank or PhonBank (MacWhinney & Fromm, 2016;Rose & MacWhinney, 2014) as well as more ad hoc data sources that have emerged as speech-language researchers increasingly employ remote monitoring approaches (Cordella et al., 2022;Kadambi et al., 2023;Liu et al., 2023) or pool years of research data on a specific population or theme . These approaches include wearable sensors (Cao et al., 2023;Coyle & Sejdić, 2020;Van Stan et al., 2015), smartphone digital recordings (Connaghan et al., 2019;Kadambi et al., 2023;Van Stan et al., 2017), and ecological momentary assessment (Hester et al., 2023;Marks et al., 2021) to name a few. ...

Wav2DDK: Analytical and Clinical Validation of an Automated Diadochokinetic Rate Estimation Algorithm on Remotely Collected Speech

... When considering the ALS subtype, bulbar or spinal, the analysis showed that individuals with bulbar ALS retain a percentage of this function for a shorter period than those with spinal ALS. One study indicated that only 30% of individuals with ALS begin with bulbar onset (20) , which corroborates the results of our study, considering the sample size, which shows a higher percentage of individuals with spinal ALS. However, bulbar-type subjects experience speech loss over a short period (4) , as bulbar ALS involves the upper motor neurons (UMNs), lower motor neurons (LMNs), or both, located in the cortex and brainstem, causing difficulties in verbal fluency and voice, making speech slower, breathy, and hypernasal (1,5,10) . ...

A speech-based prognostic model for dysarthria progression in ALS
  • Citing Article
  • June 2023

... Thus, they are adopted as powerful tools for multiple applications, in public health and education. Therefore, the use of LLMs has been investigated for applications in the field of mental health as supportive tools [8] in the medical/clinical field: for diagnosis of mental distress [9], for cognitive impairments such as Alzheimer's disease [10], to predict the mini-mental state examination score related to cognitive impairments [11], for ASD detection [12] and more. ...

Decorrelating Language Model Embeddings for Speech-Based Prediction of Cognitive Impairment
  • Citing Conference Paper
  • June 2023