Hemant Patil

Hemant Patil
Dhirubhai Ambani Institute of Information and Communication Technology | DA-IICT

About

244
Publications
47,388
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,975
Citations

Publications

Publications (244)
Article
Full-text available
Emotional contagion is said to occur when an origin (i.e., any sensory stimuli) emanating emotions causes the observer to feel the same emotions. In this paper, we explore the identification and quantification of emotional contagion produced by music in human beings. We survey 50 subjects who answer: what type of music they hear when they are happy...
Conference Paper
Full-text available
Automatic Speech Recognition (ASR) usually works well with close-talking microphone environment rather than in far-field conditions. A major challenge in the far-field ASR systems is to handle the background noise, multipath reflections, and reverberation, that leads to decrease in the quality of the speech signal. To that effect, we propose Teager...
Article
Extensive use of Intelligent Personal Assistants (IPA) and biometrics in our day-to-day life asks for privacy preservation while dealing with personal data. To that effect, efforts have been made to preserve the personally identifiable characteristics from human voice using different speaker anonymization techniques. In this paper, we propose Cycle...
Chapter
Voice Liveness Detection (VLD) has emerged as a successful technique to detect spoofing attacks in Automatic Speaker Verification (ASV) system. Presence of pop noise in the speech signal of live speaker provides the basic cue to distinguish between genuine and spoofed speech. Pop noise is produced due to the spontaneous breathing while uttering a c...
Chapter
In this paper, authors propose spectral root cepstral coefficients (SRCC) feature set to develop the effective countermeasure system for replay attacks on voice assistants (VAs). Experiments are performed on ReMASC dataset, which is specifically designed for the replay attack detection task. Logarithm operation in MFCC extraction is replaced by pow...
Article
In the scope of voice biometrics, the term replay attack, (RA) refers to the dishonest attempt made by an impostor to spoof someone else’s identity by replaying the subject’s previously recorded speech close to the Automatic Speaker Verification (ASV) system under attack. State-of-the-art strategies for RA detection, such as the Enhanced Teager Ene...
Article
In this article, we propose Cochlear Filter Cepstral Coefficient-Instantaneous Frequency feature set using Energy Separation Algorithm (CFCCIF-ESA) feature set to detect the speech synthesis (SS) and voice conversion (VC)-based spoofing attacks. The SS- and VC-based spoof generation techniques predominantly uses the magnitude spectrum information,...
Conference Paper
Performance of Voice Assistant (VA) deteriorates notably when tested on the whispered speech. Hence, separate systems are being developed for the whisper. To that effect, detecting the incoming signal as to whether it is a whisper or a speech (especially with a low latency) in the noisy environments is more desirable from the model switching point...
Article
Objective quality assessment aims towards evaluating the perceptual quality of a signal using a machine-based algorithm. Due to different challenges involved in the subjective evaluation of speech quality, it is necessary to develop objective measures. The goal of any non-intrusive quality assessment metric for noise-suppressed speech is to assess...
Article
Recently, we have witnessed Deep Learning methodologies gaining significant attention for severity-based classification of dysarthric speech. Detecting dysarthria, quantifying its severity, is of paramount importance in various real-life applications, such as the assessment of patients’ progression in treatments, which includes an adequate planning...
Conference Paper
Full-text available
Several Neural Network (NN)-based representation techniques have already been proposed for Query-by-Example Spoken Term Detection (QbE-STD) task. The recent advancement in Generative Adversarial Network (GAN) for several speech technology applications, motivated us to explore the GAN in QbE-STD. In this work, we propose to exploit GAN with the regu...
Preprint
Full-text available
Several Neural Network (NN)-based representation techniques have already been proposed for Query-by-Example Spoken Term Detection (QbE-STD) task. The recent advancement in Generative Adversarial Network (GAN) for several speech technology applications, motivated us to explore the GAN in QbE-STD. In this work, we propose to exploit GAN with the regu...
Article
Replay attack poses a great threat to the Automatic Speaker Verification (ASV) system. This paper introduces Amplitude Modulation and Frequency Modulation-based features for replay Spoof Speech Detection (SSD) task. In this context, we propose Instantaneous Amplitude (IA) and Instantaneous Frequency (IF) features using Energy Separation Algorithm (...
Preprint
Full-text available
Recently, Generative Adversarial Networks (GAN)-based methods have shown remarkable performance for the Voice Conversion and WHiSPer-to-normal SPeeCH (WHSP2SPCH) conversion. One of the key challenges in WHSP2SPCH conversion is the prediction of fundamental frequency (F0). Recently, authors have proposed state-of-the-art method Cycle-Consistent Gene...
Article
Infants are difficult to understand as they cannot communicate their requirements. This motivates us to decode their language in meaningful interpretations so that adults can understand the requirements of their children. In this chapter, the cry analysis techniques used so far are discussed and some experiments in this direction are reported. Spec...
Chapter
The infant cry classification is a socially relevant problem where the task is to classify the normal versus pathological cry signals. Since the cry signals are very different from the speech signals, there is a need of better feature representation for infant cry signals. Recently, representation learning is very popular in various signal processi...
Chapter
Sound plays a crucial role in the development and evolution of nature, where animals protect their species from the other animals via alarming sounds and learn to identify their species. In human beings, the linguistic development takes place after the infant is 2 years old, before which soothing music forms a part of their early arrival in this wo...
Chapter
Infants are difficult to understand as they cannot communicate their requirements. This motivates us to decode their language in meaningful interpretations so that adults can understand the requirements of their children. In this chapter, the cry analysis techniques used so far are discussed and some experiments in this direction are reported. Spec...
Article
Full-text available
In this paper, we propose the combination of Amplitude Modulation and Frequency Modulation (AM-FM) features for replay Spoof Speech Detection (SSD) task. The AM components are known to be affected by noise (in this case, due to replay mechanism). In particular, we exploit this damage in AM component to corresponding Instantaneous Frequency (IF) for...
Article
The vulnerability of Automatic Speaker Verification (ASV) systems to spoofing or presentation attacks is still an open security issue. In this context, replay spoofing attacks pose a great threat to an ASV system since they can be easily performed (using a playback device, and without needing any technical skill). In this paper, we analyze replay s...
Conference Paper
In this paper, we present a multi-domain speech conversion technique by proposing a Multi-domain Speech Conversion Network (MSpeC-Net) architecture for solving the less-explored area of Non-Audible Murmur-to-SPeeCH (NAM2-SPCH) conversion. The murmur produced by the speaker and captured by the NAM microphone undergoes speech quality degradation. Hen...
Chapter
In the absence of vocal fold vibrations, movement of articulators which produced the respiratory sound can be captured by the soft tissue of the head using the nonaudible murmur (NAM) microphone. NAM is one of the silent speech interface techniques, which can be used by the patients who are suffering from the vocal fold-related disorders. Though NA...
Article
Full-text available
In recent years, automatic speaker verification (ASV) is used extensively for voice biometrics. This leads to an increased interest to secure these voice biometric systems for real-world applications. The ASV systems are vulnerable to various kinds of spoofing attacks, namely, synthetic speech (SS), voice conversion (VC), replay, twins, and imperso...
Conference Paper
Full-text available
Voice Conversion (VC) converts the speaking style of a source speaker to the speaking style of a target speaker by preserving the linguistic content of a given speech utterance. Recently, Cycle Consistent Adversarial Network (CycleGAN), and its variants have become popular for non-parallel VC tasks. However, CycleGAN uses two different generators a...
Chapter
Full-text available
Acoustic Scene Classification (ASC) is the task of assigning a semantic label for a given audio sample recorded in different acoustic environments. Sounds carry a significant information about everyday environment scenes, such as bus, tram, airport, concert hall, etc. Thus, extracting the sound signals of these acoustic scenes can be useful to dete...
Article
A speech spectrum is known to be changed by the variations in the length of the vocal tract of a speaker. This is because of the fact that speech formants are inversely related to the vocal tract length (VTL). The process of compensating spectral variation due to the length of the vocal tract is known as Vocal Tract Length Normalization (VTLN). VTL...
Conference Paper
Full-text available
Recently, Convolutional Neural Networks (CNN)-based Gener-ative Adversarial Networks (GANs) are used for Whisper-to-Normal Speech (i.e., WHSP2SPCH) conversion task. These CNN-based GANs are significantly difficult to train in terms of computational complexity. Goal of the generator in GAN is to map the features of the whispered speech to that of th...
Conference Paper
Full-text available
Though whisper is a typical way of natural speech communication, it is different from normal speech w.r.t. to speech production and perception perspective. Recently, authors have proposed Generative Adversarial Network (GAN)-based architecture (namely, DiscoGAN) to discover such cross-domain relationships for whisper-to-normal speech (WHSP2SPCH) co...
Conference Paper
Full-text available
Nearest Neighbor (NN)-based alignment techniques are pop- ular in non-parallel Voice Conversion (VC). The performance of NN-based alignment improves with the information about phone boundary. However, estimating the exact phone bound- ary is a challenging task. If text corresponding to the utterance is available, the Hidden Markov Model (HMM) can b...
Conference Paper
Full-text available
Recently, Deep Neural Network (DNN)-based Voice Conver- sion (VC) techniques have become popular in the VC literature. These techniques suffer from the issue of overfitting due to less amount of available training data from a target speaker. To al- leviate this, pre-training is used for better initialization of the DNN parameters, which leads to fa...
Conference Paper
Full-text available
Obtaining aligned spectral pairs in case of non-parallel data for stand-alone Voice Conversion (VC) technique is a challenging research problem. Unsupervised alignment algorithm, namely, an Iterative combination of a Nearest Neighbor search step and a Conversion step Alignment (INCA) iteratively tries to align the spectral features by minimizing th...
Article
Alignment is a key step before learning a mapping function between a source and a target speaker’s spectral features in various state-of-the-art parallel data Voice Conversion (VC) techniques. After alignment, some corresponding pairs are still inconsistent with the rest of the data and are considered outliers. These outliers shift the parameters o...
Conference Paper
Full-text available
In this paper, we propose the use of Amplitude Modulation and Frequency Modulation (AM-FM) features for replay detection task. In AM-FM signal, AM component is known to be severely affected by noise (in this case, due to replay mechanism) which is exploited in proposed feature extraction. In particular, we explore this damage in AM component to cor...
Conference Paper
In this study, we explore the use of Convolutional Neural Networks (CNN) for replay spoof detection in Automatic Speaker Verification (ASV) system. The Amplitude and Frequency Modulation (AM-FM) feature sets obtained from the Hilbert transform (HT) and Energy Separation Algorithm (ESA) are used as the front end. We have observed the effect of maxpo...
Conference Paper
Full-text available
Voice Conversion (VC) requires an alignment of the spectral features before learning the mapping function, due to the speaking rate variations across the source and target speakers. To address this issue, the idea of training two parallel networks with the use of speaker-independent representation was proposed. In this paper, we explore the unsuper...
Conference Paper
Full-text available
This study presents a significant extension to our recently proposed MMSE-GAN architecture (accepted in INTERSPEECH 2018 and ICASSP 2018) in the framework of Discover GAN (DiscoGAN) for the cross-domain whisper-to-speech conversion task.
Conference Paper
Full-text available
In the non-parallel Voice Conversion (VC) with the Iterative combination of Nearest Neighbor search step and Conversion step Alignment (INCA) algorithm, the occurrence of one-tomany and many-to-one pairs in the training data will deteriorate the performance of the stand-alone VC system. The work on handling these pairs during the training is less e...
Conference Paper
Full-text available
Non-parallel Voice Conversion (VC) has gained significant attention since last one decade. Obtaining corresponding speech frames from both the source and target speakers before learning the mapping function in the non-parallel VC is a key step in the standalone VC task. Obtaining such corresponding pairs, is more challenging due to the fact that bo...
Conference Paper
The infant cry classification is a socially-relevant problem where the task is to classify the normal vs. pathological cry signals. Since the cry signals are very different from the speech signals in terms of temporal and spectral content, there is a need for better feature representation for infant cry signals. In this paper, we propose to use uns...
Conference Paper
The increased use of voice biometrics for various security applications, motivated authors to investigate different countermeasures for the hazard of spoofing attacks, where the attacker tries to imitate the genuine speaker. The replay is the most accessible spoofing attack. Past studies have ignored phase information for various speech processing...
Conference Paper
Full-text available
The advances in Automatic Speaker Verification (ASV) system for voice biometric purpose comes with the danger of spoofing attacks. The replay attack is the most accessible attack, where the attacker imitates speaker's identity by replaying the pre-recorded speech samples of the target speaker. Most of the conventional features, such as Mel Frequenc...
Conference Paper
Full-text available
Among various types of spoofing attacks replay poses a greater threat to the Automatic Speaker Verification (ASV) system. In our previous study, we found that the replay spoof detection is effective when human auditory system is modeled by power law nonlinearity. In this paper, we design the replay spoof detection system using power function-based...
Conference Paper
Full-text available
In this paper, we present a brief survey of various approaches for recently introduced replay attack detection for Automatic Speaker Verification (ASV). The replay spoofing attack is the most challenging task to detect as only few minutes of audio samples are required to replay genuine speaker’s voice to get access to the ASV systems. Due to large...
Conference Paper
Full-text available
Speech Enhancement (SE) system deals with improving the perceptual quality and preserving the speech intelli-gibility of the noisy mixture. The Time-Frequency (T-F) masking-based SE using the supervised learning algorithm, such as a Deep Neural Network (DNN), has outperformed the traditional SE techniques. However, the notable difference observed b...
Conference Paper
Full-text available
Replay poses a greater threat to the Automatic Speaker Verification (ASV) system than any other spoofing attacks, as it neither require any specific expertise nor a sophisticated equipment. In this paper, we propose a novel countermeasure by modeling the replayed speech as a convolution of genuine speech with additional impulse responses (due to th...
Conference Paper
Full-text available
The murmur produced by the speaker and captured by the Non-Audible Murmur (NAM)-one of the Silent Speech Interface (SSI) technique, suffers from the speech quality degradation. This is due to the lack of radiation effect at the lips and lowpass nature of the soft tissue, which attenuates the high frequency-related information. In this work, a novel...
Conference Paper
Full-text available
Replay attack poses the most difficult challenge for the development of countermeasures for spoofed speech detection (SSD) system. Earlier researchers mainly used vocal tract-based (segmental) information for replay detection. However, during replay, excitation source-based information also gets affected (in particular, degradation in source harmon...