About
39
Publications
5,916
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
588
Citations
Introduction
Nagaraj Adiga received his B.E. degree in Electronics and Communication Engineering from University Visvesvaraya College of Engineering, Bengaluru, India, in 2008. He was a Software Engineer in Alcatel-Lucent India Private Limited, Bengaluru, India, from 2008 to 2011, mainly focusing on next-generation high-leverage optical transport networks. He then obtained his Ph.D. degree from the Department of Electronics and Electrical Engineering, Indian Institute of Technology Guwahati, in 2017. He is currently pursuing PostDoc in the Department of Computer science, University of Crete, Greece. His research interests are speech processing, speech synthesis, voice conversion, speech recognition, voice pathology and machine learning.
Publications
Publications (39)
Detecting and recovering out-of-vocabulary (OOV) words is always challenging for Automatic Speech Recognition (ASR) systems. Many existing methods focus on modeling OOV words by modifying acoustic and language models and integrating context words cleverly into models. To train such complex models, we need a large amount of data with context words,...
Our efforts towards developing an automatic speaker verification (ASV) system for child speakers are presented in this paper. For the majority of the languages, children's speech data for training the ASV system is either unavailable (zero-resource) or very limited (low-resource). Under low- and zero-resource conditions, developing an ASV system be...
In this paper, we suggest a new parallel, non-causal and shallow waveform domain architecture for speech enhancement based on FFTNet, a neural network for generating high quality audio waveform. In contrast to other waveform based approaches like WaveNet, FFTNet uses an initial wide dilation pattern. Such an architecture better represents the long...
In this paper, the effect of prosody-modification-based data augmentation is explored in the context of automatic speech recognition (ASR). The primary motive is to develop ASR systems that are less affected by speaker-dependent acoustic variations. Two factors contributing towards inter-speaker variability that are focused on in this paper are pit...
In the context of automatic speech recognition (ASR), the power spectrum is generally warped to the Mel-scale during front-end speech parameterization. This is motivated by the fact that human perception of sound is nonlinear. The Mel-filterbank provides better resolution for low-frequency contents, while a greater degree of averaging happens in th...
The primary motive of this study is to develop an automatic speech recognition (ASR) system using limited amount of speech data such that it is least affected by speaker-dependent acoustic variations. The two factors contributing towards inter-speaker variability that are focused upon in this work are pitch and speaking-rate variations. In order to...
The presence of velopharyngeal dysfunction in individuals with cleft palate (CP) nasalizes the voiced stops. Due to this, voiced stops (/b/, /d/, /g/) tend to be perceive like nasal consonants (/m/, /n/, /ng/). In this work, a novel algorithm is proposed for the detection of nasalized voiced stops in CP speech using epoch-synchronous features. Spee...
The objective of this paper is to demonstrate the significance of combining different features present in the glottal activity region for statistical parametric speech synthesis (SPSS). Different features present in the glottal activity regions are broadly categorized as F0, system, and source features, which represent the quality of speech. F0 fea...
In this paper, we have explored the role of combining prosodic variables with the existing acoustic features in the context of children's speech recognition using acoustic models trained on adults' speech. The explored acoustic features are Mel-frequency cepstral coefficients (MFCC) and perceptual linear prediction cepstral coefficients (PLPCC) whi...
In the context of automatic speech recognition (ASR) systems, the front-end acoustic features should not be affected by signal periodicity (pitch period). Motivated by this fact, we have studied the role of pitch-synchronous spectrum estimation approach, referred to as TANDEM STRAIGHT, in this paper. TANDEM STRAIGHT results in a smoother spectrum d...
The objective of this paper is to present a detailed review of modelling various acoustic features employed in statistical parametric speech synthesis (SPSS). As reported in the literature, many acoustic features have been modelled in SPSS to enhance the synthesis quality. This work studies those approaches that add to the quality of SPSS by includ...
Transcribing children's speech using acoustic models trained on adults' speech is very challenging. In such conditions, a highly degraded recognition performance is reported due to large mismatch in the acoustic/linguistic attributes of the training and test data. The differences in pitch (or fundamental frequency) between the two groups of speaker...
A method to improve voicing decision using glottal activity features proposed for statistical parametric speech synthesis. In existing methods, voicing decision relies mostly on fundamental frequency F0, which may result in errors when the prediction is inaccurate. Even though F0 is a glottal activity feature, other features that characterize this...
In this study we captured oral and nasal signals using a close-talk, head-worn condenser microphone and a con- tact microphone on the nose. Native speakers of Assamese with no history of any speech disorder were recorded reading three English passages containing phonetically balanced nasal and oral consonants (Rainbow passage), a second passage con...
The objective of this work is to establish the importance of speaker information present in the glottal regions of speech signal. In addition, its robustness for degraded data and significance for limited data is sought for the task of speaker verification. An adaptive threshold method is proposed to use on zero frequency filtered signal to get the...
The major activity during speech production is glottal activity and is earlier detected using strength of excitation (SoE). This work uses the normalized autocorrelation peak strength (NAPS) and higher order statistics (HOS) as additional features for detecting glottal activity. The three features, namely, SoE, NAPS, and HOS, are, respectively indi...
This paper presents a hybrid Text-to-Speech synthesis (TTS) approach by combining advantages present in both Hidden Markov model speech synthesis (HTS) and Unit selection speech synthesis (USS). In hybrid TTS, speech sound units are classified into vowel like regions (VLRs) and non vowel like regions (NVLRs) for selecting the units. The VLRs here r...
Epoch refers to instant of significant excitation in speech [1]. Prosody modification is the process of manipulating the pitch and duration of speech by fixed or dynamic modification factors. In epoch based prosody modification, the prosodic features of the speech signal are modified by anchoring around the epochs location in speech. The objective...
The objective of this work is to reduce the data rate of LP residual for source modeling in Text-to-speech synthesis (TTS) using the knowledge of epochs present in the speech signal. Epochs here refers to glottal closure, glottal opening, onset of bursts and some high amplitude instants in fricatives. Epochs are identified using both zero frequency...
In this paper, we discuss a consortium effort on building text to speech (TTS) systems for 13 Indian languages. There are about 1652 Indian languages. A unified framework is therefore attempted required for building TTSes for Indian languages. As Indian languages are syllable-timed, a syllable-based framework is developed. As quality of speech synt...
The objective of this work is to demonstrate the significance of instants of significant excitation for source modeling. Instants of significant excitation correspond to the glottal closure, glottal opening, onset of burst, frication and a small number of excitation instants around them. The speech signal is processed independently by zero frequenc...
In this paper, we discuss a consortium effort on
building text to speech (TTS) systems for 13 Indian languages.
There are about 1652 Indian languages. A unified framework
is therefore attempted required for building TTSes for Indian
languages. As Indian languages are syllable-timed, a syllablebased
framework is developed. As quality of speech synth...