Nirmesh Shah

Nirmesh Shah
Sony Research India · Multimedia Analysis Department

Doctor of Philosophy
Research Scientist, Sony Research India

About

42
Publications
9,565
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
269
Citations
Citations since 2017
28 Research Items
229 Citations
20172018201920202021202220230102030405060
20172018201920202021202220230102030405060
20172018201920202021202220230102030405060
20172018201920202021202220230102030405060
Additional affiliations
January 2016 - October 2018
Dhirubhai Ambani Institute of Information and Communication Technology
Position
  • PhD Student
January 2016 - October 2018
Dhirubhai Ambani Institute of Information and Communication Technology
Position
  • Tutor and Teaching Assistant
May 2012 - present
Dhirubhai Ambani Institute of Information and Communication Technology
Position
  • Research Assistant
Description
  • Here, we are developing TTS systems for Gujarati (One of the official languages of India) using various state of the art methods.
Education
July 2006 - May 2010
Government Engineering College, Surat
Field of study
  • Electronics and Communication

Publications

Publications (42)
Conference Paper
In this paper, use of Viterbi-based algorithm and spectral transition measure (STM)-based algorithm for the task of speech data labeling is being attempted. In the STM framework, we propose use of several spectral features such as recently proposed cochlear filter cepstral coefficients (CFCC), perceptual linear prediction cepstral coefficients (PLP...
Conference Paper
The generalized statistical framework of Hidden Markov Model (HMM) has been successfully applied from the field of speech recognition to speech synthesis. In this work, we have applied HMM-based Speech Synthesis System (HTS) method to Gujarati language. Adaption and evaluation of HTS for Gujarati language has been done here. Evaluation of HTS syste...
Conference Paper
Text-to-speech (TTS) synthesizer has been proved to be an aiding tool for many visually challenged people for reading through hearing feedback. There are TTS synthesizers available in English, however, it has been observed that people feel more comfortable in hearing their own native language. Keeping this point in mind, Gujarati TTS synthesizer ha...
Conference Paper
Text-to-speech (TTS) synthesizer has been an effective tool for many visually challenged people for reading through hearing feedback. TTS synthesizers build through the festival framework requires a large speech corpus. This corpus needs to be labeled. The labeling can be done at phoneme-level or at syllable-level. TTS systems are mostly available...
Conference Paper
Full-text available
In this paper, we discuss a consortium effort on building text to speech (TTS) systems for 13 Indian languages. There are about 1652 Indian languages. A unified framework is therefore attempted required for building TTSes for Indian languages. As Indian languages are syllable-timed, a syllable-based framework is developed. As quality of speech synt...
Preprint
Full-text available
Emotion Recognition in Conversations (ERC) is crucial in developing sympathetic human-machine interaction. In conversational videos, emotion can be present in multiple modalities, i.e., audio, video, and transcript. However, due to the inherent characteristics of these modalities, multi-modal ERC has always been considered a challenging undertaking...
Conference Paper
Performance of Voice Assistant (VA) deteriorates notably when tested on the whispered speech. Hence, separate systems are being developed for the whisper. To that effect, detecting the incoming signal as to whether it is a whisper or a speech (especially with a low latency) in the noisy environments is more desirable from the model switching point...
Conference Paper
Full-text available
Several Neural Network (NN)-based representation techniques have already been proposed for Query-by-Example Spoken Term Detection (QbE-STD) task. The recent advancement in Generative Adversarial Network (GAN) for several speech technology applications, motivated us to explore the GAN in QbE-STD. In this work, we propose to exploit GAN with the regu...
Preprint
Full-text available
Several Neural Network (NN)-based representation techniques have already been proposed for Query-by-Example Spoken Term Detection (QbE-STD) task. The recent advancement in Generative Adversarial Network (GAN) for several speech technology applications, motivated us to explore the GAN in QbE-STD. In this work, we propose to exploit GAN with the regu...
Chapter
In the absence of vocal fold vibrations, movement of articulators which produced the respiratory sound can be captured by the soft tissue of the head using the nonaudible murmur (NAM) microphone. NAM is one of the silent speech interface techniques, which can be used by the patients who are suffering from the vocal fold-related disorders. Though NA...
Thesis
Full-text available
Understanding how a particular speaker is producing speech, and mimicking one‘s voice is a difficult research problem due to the sophisticated mechanism involved in speech production. Voice Conversion (VC) is a technique that modifies the perceived speaker identity in a given speech utterance from a source speaker to a particular target speaker wit...
Conference Paper
Full-text available
Voice Conversion (VC) converts the speaking style of a source speaker to the speaking style of a target speaker by preserving the linguistic content of a given speech utterance. Recently, Cycle Consistent Adversarial Network (CycleGAN), and its variants have become popular for non-parallel VC tasks. However, CycleGAN uses two different generators a...
Conference Paper
Full-text available
Recently, Convolutional Neural Networks (CNN)-based Gener-ative Adversarial Networks (GANs) are used for Whisper-to-Normal Speech (i.e., WHSP2SPCH) conversion task. These CNN-based GANs are significantly difficult to train in terms of computational complexity. Goal of the generator in GAN is to map the features of the whispered speech to that of th...
Conference Paper
Full-text available
Though whisper is a typical way of natural speech communication, it is different from normal speech w.r.t. to speech production and perception perspective. Recently, authors have proposed Generative Adversarial Network (GAN)-based architecture (namely, DiscoGAN) to discover such cross-domain relationships for whisper-to-normal speech (WHSP2SPCH) co...
Conference Paper
Full-text available
Nearest Neighbor (NN)-based alignment techniques are pop- ular in non-parallel Voice Conversion (VC). The performance of NN-based alignment improves with the information about phone boundary. However, estimating the exact phone bound- ary is a challenging task. If text corresponding to the utterance is available, the Hidden Markov Model (HMM) can b...
Conference Paper
Full-text available
Recently, Deep Neural Network (DNN)-based Voice Conver- sion (VC) techniques have become popular in the VC literature. These techniques suffer from the issue of overfitting due to less amount of available training data from a target speaker. To al- leviate this, pre-training is used for better initialization of the DNN parameters, which leads to fa...
Conference Paper
Full-text available
Obtaining aligned spectral pairs in case of non-parallel data for stand-alone Voice Conversion (VC) technique is a challenging research problem. Unsupervised alignment algorithm, namely, an Iterative combination of a Nearest Neighbor search step and a Conversion step Alignment (INCA) iteratively tries to align the spectral features by minimizing th...
Article
Alignment is a key step before learning a mapping function between a source and a target speaker’s spectral features in various state-of-the-art parallel data Voice Conversion (VC) techniques. After alignment, some corresponding pairs are still inconsistent with the rest of the data and are considered outliers. These outliers shift the parameters o...
Conference Paper
Full-text available
Voice Conversion (VC) requires an alignment of the spectral features before learning the mapping function, due to the speaking rate variations across the source and target speakers. To address this issue, the idea of training two parallel networks with the use of speaker-independent representation was proposed. In this paper, we explore the unsuper...
Conference Paper
Full-text available
This study presents a significant extension to our recently proposed MMSE-GAN architecture (accepted in INTERSPEECH 2018 and ICASSP 2018) in the framework of Discover GAN (DiscoGAN) for the cross-domain whisper-to-speech conversion task.
Conference Paper
Full-text available
In the non-parallel Voice Conversion (VC) with the Iterative combination of Nearest Neighbor search step and Conversion step Alignment (INCA) algorithm, the occurrence of one-tomany and many-to-one pairs in the training data will deteriorate the performance of the stand-alone VC system. The work on handling these pairs during the training is less e...
Conference Paper
Full-text available
Non-parallel Voice Conversion (VC) has gained significant attention since last one decade. Obtaining corresponding speech frames from both the source and target speakers before learning the mapping function in the non-parallel VC is a key step in the standalone VC task. Obtaining such corresponding pairs, is more challenging due to the fact that bo...
Conference Paper
Full-text available
The murmur produced by the speaker and captured by the Non-Audible Murmur (NAM)-one of the Silent Speech Interface (SSI) technique, suffers from the speech quality degradation. This is due to the lack of radiation effect at the lips and lowpass nature of the soft tissue, which attenuates the high frequency-related information. In this work, a novel...
Conference Paper
Full-text available
We propose a novel F0 estimation algorithm that initially estimates the glottal closure instants (GCIs) or pitch and then computes the corresponding fundamental frequency (F0). The proposed method eliminates the assumption that F0 is constant over a segment of short duration (i.e., 20-30 ms). We use our previously proposed novel filtering-based app...
Conference Paper
Full-text available
Development of text-independent Voice Conversion (VC) has gained more research interest for last one decade. Alignment of the source and target speakers’ spectral features before learning the mapping function is the challenging step for the development of the text-independent VC as both the speakers have uttered different utterances from the same o...
Conference Paper
Voice Conversion (VC) is a technique that convert the perceived speaker identity from a source speaker to a target speaker. Given a source and target speakers’ parallel training speech database in the text-dependent VC, first task is to align source and target speakers’ spectral features at frame-level before learning the mapping function. The accu...
Conference Paper
Full-text available
Voice conversion (VC) technique modifies the speech utterance spoken by a source speaker to make it sound like a target speaker is speaking. Gaussian Mixture Model (GMM)-based VC is a state-of-the-art method. It finds the mapping function by modeling the joint density of source and target speakers using GMM to convert spectral features framewise. A...
Article
We propose a novel application based on acoustic-to-articulatory inversion towards quality assessment of voice converted speech. The ability of humans to speak effortlessly requires coordinated movements of various articulators, muscles, etc. This effortless movement contributes towards naturalness, intelligibility and speakers identity which is pa...
Conference Paper
Phonetic segmentation plays a key role in developing various speech applications. In this work, we propose to use various features for automatic phonetic segmentation task for forced Viterbi alignment and compare their effectiveness. We propose to use novel multiscale fractal dimension-based features concatenated with Mel- Frequency Cepstral Coeffi...
Conference Paper
The generalized statistical framework of Hidden Markov Model (HMM) has been successfully applied from the field of speech recognition to speech synthesis. In this paper, we have applied HMM-based Speech Synthesis (HTS) method to Gujarati (one of the official language of India). Adaption and evaluation of HTS for Gujarati language has been done here...
Conference Paper
Full-text available
We propose to use multiscale fractal dimension (MFD) as components of feature vectors for automatic speech recognition (ASR) especially in low resource languages. Speech, which is known to be a nonlinear process, can be efficiently represented by extracting some nonlinear properties, such as fractal dimension, from the speech segment. During speech...
Thesis
Full-text available
Hidden Markov Models (HMM) have been applied successfully to Automatic Speech Recognition (ASR) problems and are currently applied in speech synthesis applications. In this thesis, HMM-based Speech Synthesis System (HTS) for TTS is understood in detail and applied to Gujarati language. In particular, for HTS implementation, issues related to charac...

Questions

Questions (5)
Question
The question is in context of Bilinear Frequency warping method. I would like to know why it is called Bilinear ??
Thanks in advance.
Question
In preference test, where subjects have to prefer one item among two and based on that when we present our result how to calculate 95% interval as here values are preference score,, i.e., either 1 or 0 based on preference.
Question
What is the difference between speech/non-speech detection and voice activity detection ??
Question
Can anyone please teach me difference between Mel Frequency Cepstral Coefficients (MFCC) and Mel Cepstral Coefficients (MCC)?
Question
In GMM
How can we understand that linear combination of diagonal covariance basis Gaussian is capable of modelling the correlations between feature vectors?
How can one visualize this? 

Network

Cited By

Projects

Projects (5)
Project
Using deep learning approaches, increase intelligibility of speech signal
Archived project
Develoment of TTS