
Sudarsana Reddy Kadiri- Doctor of Philosophy
- Research Scientist at University of Southern California
Sudarsana Reddy Kadiri
- Doctor of Philosophy
- Research Scientist at University of Southern California
About
101
Publications
26,112
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,381
Citations
Introduction
For details about me and my research work, visit my web page:
https://sites.google.com/view/sudarsanareddykadiri
Current institution
Publications
Publications (101)
Vocal intensity is quantified by sound pressure level (SPL). The SPL can be measured by either using a sound level meter or by comparing the energy of the recorded speech signal with the energy of the recorded calibration tone of a known SPL. Neither of these approaches can be used if speech is recorded in real-life conditions using a device that i...
Brain-computer interfaces (BCI) offer numerous human-centered application possibilities, particularly affecting people with neurological disorders. Text or speech decoding from brain activities is a relevant domain that could augment the quality of life for people with impaired speech perception. We propose a novel approach to enhance listened spee...
Formant tracking is an area of speech science that has recently undergone a technology shift from classical model-driven signal processing methods to modern data-driven deep learning methods. In this study, these two domains are combined in formant tracking by refining the formants estimated by a data-driven deep neural network (DNN) with formant e...
Variability in speech pronunciation is widely observed across different linguistic backgrounds, which impacts modern automatic speech recognition performance. Here, we evaluate the performance of a self-supervised speech model in phoneme recognition using direct articulatory evidence. Findings indicate significant differences in phoneme recognition...
Large Language Models (LLMs) have shown significant potential in understanding human communication and interaction. However, their performance in the domain of child-inclusive interactions, including in clinical settings, remains less explored. In this work, we evaluate generic LLMs' ability to analyze child-adult dyadic interactions in a clinicall...
Objectives: ncreased prevalence of social creak particularly among female speakers has been reported in several studies. The study of social creak has been previously conducted by combining perceptual evaluation of speech with conventional acoustical parameters such as the harmonic-to-noise ratio and cepstral peak prominence. In the current study,...
Objective: This study aims to evaluate the effectiveness of the Death-Brief Implicit Association Task (D-BIAT) in distinguishing between control, depressed, and suicidal states among undergraduate students. Given the significant mental health challenges posed by depression and suicidal ideation, particularly among young adults in educational enviro...
Objective: This study proposes a Death Association-Reaction (DAR)-score to distinguish associations among control, depressed, and suicidal participants using reaction times in a sentence classification task, similar to the Death-Implicit Association Task (D-IAT). Participants classify (agree/disagree) a series of stimuli, and their reaction times a...
The ability to reliably transcribe child-adult conversations in a clinical setting is valuable for diagnosis and understanding of numerous developmental disorders such as Autism Spectrum Disorder. Recent advances in deep learning architectures and availability of large scale transcribed data has led to development of speech foundation models that h...
Clinical videos in the context of Autism Spectrum Disorder are often long-form interactions between children and caregivers/clinical professionals, encompassing complex verbal and non-verbal behaviors. Objective analyses of these videos could provide clinicians and researchers with nuanced insights into the behavior of children with Autism Spectrum...
Stuttering is a common speech impediment that is caused by irregular disruptions in speech production, affecting over 70 million people across the world. Standard automatic speech processing tools do not take speech ailments into account and are thereby not able to generate meaningful results when presented with stuttered speech as input. The autom...
Speech decoding from EEG signals is a challenging task, where brain activity is modeled to estimate salient characteristics of acoustic stimuli. We propose FESDE, a novel framework for Fully-End-to-end Speech Decoding from EEG signals. Our approach aims to directly reconstruct listened speech waveforms given EEG signals, where no intermediate acous...
Many acoustic features and machine learning models have been studied to build automatic detection systems to distinguish dysarthric speech from healthy speech. These systems can help to improve the reliability of diagnosis. However, speech recorded for diagnosis in real-life clinical conditions can differ from the training data of the detection sys...
In this paper, we present our effort to develop an automatic speaker verification (ASV) system for low resources children’s data. For the children’s speakers, very limited amount of speech data is available in majority of the languages for training the ASV system. Developing an ASV system under low resource conditions is a very challenging problem....
In this paper, we propose a new method for the accurate estimation and tracking of formants in speech signals using time-varying quasi-closed-phase (TVQCP) analysis. Conventional formant tracking methods typically adopt a two-stage estimate-and-track strategy wherein an initial set of formant candidates are estimated using short-time analysis (e.g....
Developing objective methods for assessing the severity of Parkinson's disease (PD) is crucial for improving the diagnosis and treatment. This study proposes two sets of novel features derived from the single frequency filtering (SFF) method: (1) SFF cepstral coefficients (SFFCC) and (2) MFCCs from the SFF (MFCC-SFF) for the severity classification...
In this study, formant tracking is investigated by refining the formants tracked by an existing data-driven tracker, DeepFormants, using the formants estimated in a model-driven manner by linear prediction (LP)-based methods. As LP-based formant estimation methods, conventional covariance analysis (LP-COV) and the recently proposed quasi-closed pha...
Prior studies in the automatic classification of voice quality have mainly studied the use of the acoustic speech signal as input. Recently, a few studies have been carried out by jointly using both speech and neck surface accelerometer (NSA) signals as inputs, and by extracting MFCCs and glottal source features. This study examines simultaneously-...
Previous studies on the automatic classification of voice disorders have mostly investigated the binary classification task, which aims to distinguish pathological voice from healthy voice. Using multi-class classifiers, however, more fine-grained identification of voice disorders can be achieved, which is more helpful for clinical practitioners. U...
In singing, the perceptual term “voice quality” is used to describe expressed emotions and singing styles. In voice physiology research, specific voice qualities are discussed by the term phonation modes and are related directly to the voicing produced by the vocal folds. The control and awareness of phonation modes is vital for professional singer...
In low resource children automatic speech recognition (ASR) the performance is degraded due to limited acoustic and speaker variability available in small datasets. In this paper, we propose a spectral warping based data augmentation method to capture more acoustic and speaker variability. This is carried out by warping the linear prediction (LP) s...
The events of recent years have highlighted the importance of telemedicine solutions which could potentially allow remote treatment and diagnosis. Relatedly, Computational Paralinguistics, a unique subfield of Speech Processing, aims to extract information about the speaker and form an important part of telemedicine applications. In this work, we f...
The major impulse-like excitation in the speech signal is due to abrupt closure of the vocal folds, which takes place at the glottal closure instant (GCI) or epoch in each cycle. GCIs are used in many areas of speech science and technology, such as in prosody modification, voice source analysis, formant extraction and speech synthesis. It is diffic...
Understanding of the perception of emotions or affective states in humans is important to develop emotion-aware systems that work in realistic scenarios. In this paper, the perception of emotions in naturalistic human interaction (audio–visual data) is studied using perceptual evaluation. For this purpose, a naturalistic audio–visual emotion databa...
Automatic voice pathology detection is a research topic, which has gained increasing interest recently. Although methods based on deep learning are becoming popular, the classical pipeline systems based on a two-stage architecture consisting of a feature extraction stage and a classifier stage are still widely used. In these classical detection sys...
The goal of this study is to investigate advanced signal processing approaches [single frequency filtering (SFF) and zero-time windowing (ZTW)] with modern deep neural networks (DNNs) [convolution neural networks (CNNs), temporal convolution neural networks (TCN), time-delay neural network (TDNN), and emphasized channel attention, propagation and a...
Formant tracking is investigated in this study by using trackers based on dynamic programming (DP) and deep neural nets (DNNs). Using the DP approach, six formant estimation methods were first compared. The six methods include linear prediction (LP) algorithms, weighted LP algorithms and the recently developed quasi-closed phase forward-backward (Q...
Speakers exhibit dialectal traits in speech at sub-segmental, segmental, and supra-segmental levels. Any feature representation for dialect classification should appropriately represent these dialectal traits. Traditional segmental features such as mel-frequency cepstral coefficients (MFCCs) fail to represent sub-segmental and supra-segmental diale...
Differences in acoustic characteristics between children’s and adults’ speech degrade performance of automatic speech recognition systems when systems trained using adults’ speech are used to recognize children’s speech. This performance degradation is due to the acoustic mismatch between training and testing. One of the main sources of the acousti...
Speech production can be regarded as a process where a time-varying vocal tract system (filter) is excited by a time-varying excitation. In addition to its linguistic message, the speech signal also carries information about, for example, the gender and age of the speaker. Moreover, the speech signal includes acoustical cues about several speaker t...
Formant tracking is investigated in this study by using trackers based on dynamic programming (DP) and deep neural nets (DNNs). Using the DP approach, six formant estimation methods were first compared. The six methods include linear prediction (LP) algorithms, weighted LP algorithms and the recently developed quasi-closed phase forward-backward (Q...
Current ASR systems show poor performance in recognition of children’s speech in noisy environments because recognizers are typically trained with clean adults’ speech and therefore there are two mismatches between training and testing phases (i.e., clean speech in training vs. noisy speech in testing and adult speech in training vs. child speech i...
In this paper, we propose spectral modification by sharpening formants and by reducing the spectral tilt to recognize children’s speech by automatic speech recognition (ASR) systems developed using adult speech. In this type of mismatched condition, the ASR performance is degraded due to the acoustic and linguistic mismatch in the attributes betwee...
Glottal source characteristics vary between phonation types due to the tension of laryngeal muscles with the respiratory effort. Previous studies in the classification of phonation type have mainly used speech signals recorded by microphone. Recently, two studies were published in the classification of phonation type using neck surface acceleromete...
In generation of emotional speech, there are deviations in the speech production features when compared to neutral (non-emotional) speech. The objective of this study is to capture the deviations in features related to the excitation component of speech and to develop a system for automatic recognition of emotions based on these deviations. The emo...
In this study, we propose Mel-weighted single frequency filtering (SFF) spectrograms for dialect identification. The spectrum derived using SFF has high spectral resolution for harmonics and resonances while simultaneously maintaining good time-resolution of some speech excitation features such as impulse-like events. The SFF spectrum can represent...
End-to-end neural network models (E2E) have shown significant performance benefits on different INTERSPEECH ComParE tasks. Prior work has applied either a single instance of an E2E model for a task or the same E2E architecture for different tasks. However, applying a single model is unstable or using the same architecture under-utilizes task-specif...
In this paper, we propose a new method for the accurate estimation and tracking of formants in speech signals using time-varying quasi-closed-phase (TVQCP) analysis. Conventional formant tracking methods typically adopt a two-stage estimate-and-track strategy wherein an initial set of formant candidates are estimated using short-time analysis (e.g....
A method is proposed for time delay estimation (TDE) from mixed source (speaker) signals collected at two spatially separated microphones. The key idea in this proposal is that the crosscorrelation between corresponding segments of the mixed source signals is computed using the outputs of single frequency filtering (SFF) obtained at several frequen...
A new approach for determining the glottal activity from speech signals is presented in this paper. The approach is based on the use of single frequency filtering (SFF), proposed recently for voice activity detection. The variance (across frequency) of the spectral envelopes at each sampling instant is derived using the SFF of speech signal. The va...
In this article, we study emotion detection from speech in a speaker-specific scenario. By parameterizing the excitation component of voiced speech, the study explores deviations between emotional speech (e.g., speech produced in anger, happiness, sadness, etc.) and neutral speech (i.e., non-emotional) to develop an automatic emotion detection syst...
Both in speech and singing, humans are capable of generating sounds of different phonation types (e.g., breathy, modal and pressed). Previous studies in the analysis and classification of phonation types have mainly used voice source features derived using glottal inverse filtering (GIF). Even though glottal source features are useful in discrimina...
Aperiodicity in the voice source is caused by changes in the vocal fold vibrations, other than the normal quasi-periodicity and the turbulence at the glottis. The aperiodicity appears to be one of the main properties that is responsible for conveying the emotion in artistic voices. In this paper, the feasibility of representing the excitation sourc...
Existing studies in classification of phonation types in singing use voice source features and Mel-frequency cepstral coefficients (MFCCs) showing poor performance due to high pitch in singing. In this study, high-resolution spectra obtained using the zero-time windowing (ZTW) method is utilized to capture the effect of voice excitation. ZTW does n...
This paper proposes an approach using spectral flatness measure to detect the glottal closure instant (GCI) and the glottal open region (GOR) within each glottal cycle in voiced speech. The spectral flatness measure is derived from the instantaneous spectra obtained in the analysis of speech using single frequency filtering (SFF) and zero time wind...
The brain is the most complex biological system that exists. Timbre, in its very nature, is a multidimensional concept with several levels of abstraction thus rendering the investigation of its processing in the brain extremely challenging. Timbre processing can be discussed in relation to levels of abstraction. Low- to mid-level representations ca...
Telephone speech is one of the degradations involved in building
speech systems in practical environments. The potential use of the
speech systems depends on the speech analysis algorithms that can
handle different acoustic variations and degradations often found in
the human speech communication. Detection of epochs/glottal clo-
sure instants (GCI...
This paper presents a method for modifying speech to enhance its intelligibility in noise. The features contributing to intelligibility are analyzed using the recently proposed single frequency filtering (SFF) analysis of speech signals. In the SFF method, the spectral and temporal resolutions can be controlled using a single parameter of the filte...
Studies on phase component of signals are important due to complementary information it provides besides the amplitude information. Though most studies focused on the phase of the short-time Fourier transform (STFT), there are other forms of phase like the phase of an analytic signal and the phase of the signals obtained through filtering operation...
Automatic speaker verification systems are more vulnerable to
spoofing attacks. Recently, various countermeasures have been
developed for detecting high technology attacks such as speech
synthesis and voice conversion. However, there is a wide gap in
dealing with replay attacks. In this paper, we propose a new fea-
ture for replay attack detection...
Epochs are instants of significant excitation of the vocal tract system during production of voiced speech. Existing methods for epoch extraction provide good results on neutral speech. But effectiveness of these methods has not been examined carefully for analysis of emotional speech, where the emotion characteristics are embedded mainly in the so...
Speech carries information not only about the lexical content, but also
about the age, gender, signature and emotional state of the speaker. Speech in differ-
ent emotional states is accompanied by distinct changes in the production mechanism.
In this chapter, we present a review of analysis methods used for emotional speech. In
particular, we focu...
During production of emotional speech there are deviations in the components of speech production mechanism when compared to normal speech. The objective of this study is to capture the deviations in features related to the excitation source component of speech, and to develop a system for automatic recognition of emotions based on these deviations...
The objective of this work is to develop a rule-based emotion conversion method for a better emotional perception. In this work, performance of emotion conversion using the linear modification model is improved by using vowel-based non-uniform prosody modification. In the present approach, attempts were made to integrate features like position and...
In this paper, the non-uniform duration modifica-
tion is exploited along with other prosody features for neutral
speech to anger speech conversion. The non-uniform duration
modification method modifies the durations of vowel and pause
segments by different modification factors. Vowel segments are
modified by factors based on their identities, and...
In this paper, we address the issue of speaker-specific emotion detection (neutral vs emotion) from speech signals with models for neutral speech as reference. As emotional speech is produced by the human speech production mechanism, the emotion information is expected to lie in the features of both excitation source and the vocal tract system. Lin...
The progress in the areas of research like emotion recognition, identification, synthesis , etc., relies heavily on the development and structure of the database. This paper addresses some of the key issues in development of the emotion databases. A new audiovisual emotion (AVE) database is developed. The database consists of audio , video and audi...
Cognitive loaded speech is produced when a speaker experiences the load imposed by a certain task on the cognitive system and it can be regarded as deviation from neutral speech. The objective of the present study is to explore the deviations in the excitation source features of cognitive loaded speech compared to neutral speech. The excitation sou...
Studies on the emotion recognition task indicate that there is confusion in discrimination among higher activation states like 'anger' and 'happy'. In this study, features related to excita-tion source of speech are examined for discriminating 'anger' and 'happy' emotions. The objective is to explore the features which are independent of lexical co...