
Nirmesh ShahSony Research India · Multimedia Analysis Department
Nirmesh Shah
Doctor of Philosophy
Research Scientist, Sony Research India
About
42
Publications
9,565
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
269
Citations
Citations since 2017
Introduction
Speech and Signal Processing Researcher | Ph.D. in Voice Conversion
Additional affiliations
Education
July 2011 - April 2013
July 2006 - May 2010
Government Engineering College, Surat
Field of study
- Electronics and Communication
Publications
Publications (42)
In this paper, use of Viterbi-based algorithm and spectral transition measure (STM)-based algorithm for the task of speech data labeling is being attempted. In the STM framework, we propose use of several spectral features such as recently proposed cochlear filter cepstral coefficients (CFCC), perceptual linear prediction cepstral coefficients (PLP...
The generalized statistical framework of Hidden Markov Model (HMM) has been successfully applied from the field of speech recognition to speech synthesis. In this work, we have applied HMM-based Speech Synthesis System (HTS) method to Gujarati language. Adaption and evaluation of HTS for Gujarati language has been done here. Evaluation of HTS syste...
Text-to-speech (TTS) synthesizer has been proved to be an aiding tool for many visually challenged people for reading through hearing feedback. There are TTS synthesizers available in English, however, it has been observed that people feel more comfortable in hearing their own native language. Keeping this point in mind, Gujarati TTS synthesizer ha...
Text-to-speech (TTS) synthesizer has been an effective tool for many visually challenged people for reading through hearing feedback. TTS synthesizers build through the festival framework requires a large speech corpus. This corpus needs to be labeled. The labeling can be done at phoneme-level or at syllable-level. TTS systems are mostly available...
In this paper, we discuss a consortium effort on building text to speech (TTS) systems for 13 Indian languages. There are about 1652 Indian languages. A unified framework is therefore attempted required for building TTSes for Indian languages. As Indian languages are syllable-timed, a syllable-based framework is developed. As quality of speech synt...
Emotion Recognition in Conversations (ERC) is crucial in developing sympathetic human-machine interaction. In conversational videos, emotion can be present in multiple modalities, i.e., audio, video, and transcript. However, due to the inherent characteristics of these modalities, multi-modal ERC has always been considered a challenging undertaking...
Performance of Voice Assistant (VA) deteriorates notably when tested on the whispered speech. Hence, separate systems are being developed for the whisper. To that effect, detecting the incoming signal as to whether it is a whisper or a speech (especially with a low latency) in the noisy environments is more desirable from the model switching point...
Several Neural Network (NN)-based representation techniques have already been proposed for Query-by-Example Spoken Term Detection (QbE-STD) task. The recent advancement in Generative Adversarial Network (GAN) for several speech technology applications, motivated us to explore the GAN in QbE-STD. In this work, we propose to exploit GAN with the regu...
Several Neural Network (NN)-based representation
techniques have already been proposed for Query-by-Example
Spoken Term Detection (QbE-STD) task. The recent advancement in Generative Adversarial Network (GAN) for several
speech technology applications, motivated us to explore the
GAN in QbE-STD. In this work, we propose to exploit GAN
with the regu...
In the absence of vocal fold vibrations, movement of articulators which produced the respiratory sound can be captured by the soft tissue of the head using the nonaudible murmur (NAM) microphone. NAM is one of the silent speech interface techniques, which can be used by the patients who are suffering from the vocal fold-related disorders. Though NA...
Understanding how a particular speaker is producing speech, and mimicking one‘s voice is a difficult research problem due to the sophisticated mechanism involved in speech production. Voice Conversion (VC) is a technique that modifies the perceived speaker identity in a given speech utterance from a source speaker to a particular target speaker wit...
Voice Conversion (VC) converts the speaking style of a source speaker to the speaking style of a target speaker by preserving the linguistic content of a given speech utterance. Recently, Cycle Consistent Adversarial Network (CycleGAN), and its variants have become popular for non-parallel VC tasks. However, CycleGAN uses two different generators a...
Recently, Convolutional Neural Networks (CNN)-based Gener-ative Adversarial Networks (GANs) are used for Whisper-to-Normal Speech (i.e., WHSP2SPCH) conversion task. These CNN-based GANs are significantly difficult to train in terms of computational complexity. Goal of the generator in GAN is to map the features of the whispered speech to that of th...
Though whisper is a typical way of natural speech communication, it is different from normal speech w.r.t. to speech production and perception perspective. Recently, authors have proposed Generative Adversarial Network (GAN)-based architecture (namely, DiscoGAN) to discover such cross-domain relationships for whisper-to-normal speech (WHSP2SPCH) co...
Nearest Neighbor (NN)-based alignment techniques are pop-
ular in non-parallel Voice Conversion (VC). The performance
of NN-based alignment improves with the information about
phone boundary. However, estimating the exact phone bound-
ary is a challenging task. If text corresponding to the utterance
is available, the Hidden Markov Model (HMM) can b...
Recently, Deep Neural Network (DNN)-based Voice Conver-
sion (VC) techniques have become popular in the VC literature.
These techniques suffer from the issue of overfitting due to less
amount of available training data from a target speaker. To al-
leviate this, pre-training is used for better initialization of the
DNN parameters, which leads to fa...
Obtaining aligned spectral pairs in case of non-parallel data for stand-alone Voice Conversion (VC) technique is a challenging research problem. Unsupervised alignment algorithm, namely, an Iterative combination of a Nearest Neighbor search step and a Conversion step Alignment (INCA) iteratively tries to align the spectral features by minimizing th...
Alignment is a key step before learning a mapping function between a source and a target speaker’s spectral features in various state-of-the-art parallel data Voice Conversion (VC) techniques. After alignment, some corresponding pairs are still inconsistent with the rest of the data and are considered outliers. These outliers shift the parameters o...
Voice Conversion (VC) requires an alignment of
the spectral features before learning the mapping function, due
to the speaking rate variations across the source and target
speakers. To address this issue, the idea of training two parallel
networks with the use of speaker-independent representation was
proposed. In this paper, we explore the unsuper...
This study presents a significant extension to our recently proposed MMSE-GAN architecture (accepted in INTERSPEECH 2018 and ICASSP 2018) in the framework of Discover GAN (DiscoGAN) for the cross-domain whisper-to-speech conversion task.
In the non-parallel Voice Conversion (VC) with the Iterative combination of Nearest Neighbor search step and Conversion step Alignment (INCA) algorithm, the occurrence of one-tomany and many-to-one pairs in the training data will deteriorate the performance of the stand-alone VC system. The work on handling these pairs during the training is less e...
Non-parallel Voice Conversion (VC) has gained significant attention since last one decade. Obtaining corresponding speech frames from both the source and target speakers before learning the mapping function in the non-parallel VC is a key step in the standalone VC task. Obtaining such corresponding pairs, is more challenging due to the fact that bo...
The murmur produced by the speaker and captured by the Non-Audible Murmur (NAM)-one of the Silent Speech Interface (SSI) technique, suffers from the speech quality degradation. This is due to the lack of radiation effect at the lips and lowpass nature of the soft tissue, which attenuates the high frequency-related information. In this work, a novel...
We propose a novel F0 estimation algorithm that initially estimates the glottal closure instants (GCIs) or pitch and then computes the corresponding fundamental frequency (F0). The proposed method eliminates the assumption that F0 is constant over a segment of short duration (i.e., 20-30 ms). We use our previously proposed novel filtering-based app...
Development of text-independent Voice Conversion
(VC) has gained more research interest for last one decade.
Alignment of the source and target speakers’ spectral features
before learning the mapping function is the challenging step for
the development of the text-independent VC as both the speakers
have uttered different utterances from the same o...
Voice Conversion (VC) is a technique that convert the perceived speaker identity from a source speaker to a target speaker. Given a source and target speakers’ parallel training speech database in the text-dependent VC, first task is to align source and target speakers’ spectral features at frame-level before learning the mapping function. The accu...
Voice conversion (VC) technique modifies the speech utterance spoken by a source speaker to make it sound like a target speaker is speaking. Gaussian Mixture Model (GMM)-based VC is a state-of-the-art method. It finds the mapping function by modeling the joint density of source and target speakers using GMM to convert spectral features framewise. A...
We propose a novel application based on acoustic-to-articulatory inversion
towards quality assessment of voice converted speech. The ability of humans to
speak effortlessly requires coordinated movements of various articulators,
muscles, etc. This effortless movement contributes towards naturalness,
intelligibility and speakers identity which is pa...
Phonetic segmentation plays a key role in
developing various speech applications. In this work, we
propose to use various features for automatic phonetic
segmentation task for forced Viterbi alignment and compare
their effectiveness. We propose to use novel multiscale
fractal dimension-based features concatenated with Mel-
Frequency Cepstral Coeffi...
The generalized statistical framework of Hidden
Markov Model (HMM) has been successfully applied from
the field of speech recognition to speech synthesis. In this
paper, we have applied HMM-based Speech Synthesis (HTS)
method to Gujarati (one of the official language of India).
Adaption and evaluation of HTS for Gujarati language has
been done here...
We propose to use multiscale fractal dimension (MFD) as
components of feature vectors for automatic speech
recognition (ASR) especially in low resource languages.
Speech, which is known to be a nonlinear process, can be
efficiently represented by extracting some nonlinear
properties, such as fractal dimension, from the speech
segment. During speech...
Hidden Markov Models (HMM) have been applied successfully to Automatic Speech Recognition (ASR) problems and are currently applied in speech synthesis applications. In this thesis, HMM-based Speech Synthesis System (HTS) for TTS is understood in detail and applied to Gujarati language. In particular, for HTS implementation, issues related to charac...
Questions
Questions (5)
The question is in context of Bilinear Frequency warping method. I would like to know why it is called Bilinear ??
Thanks in advance.
In preference test, where subjects have to prefer one item among two and based on that when we present our result how to calculate 95% interval as here values are preference score,, i.e., either 1 or 0 based on preference.
What is the difference between speech/non-speech detection and voice activity detection ??
Can anyone please teach me difference between Mel Frequency Cepstral Coefficients (MFCC) and Mel Cepstral Coefficients (MCC)?
In GMM
How can we understand that linear combination of diagonal covariance basis Gaussian is capable of modelling the correlations between feature vectors?
How can one visualize this?
Projects
Projects (5)
Using deep learning approaches, increase intelligibility of speech signal