Fig 2 - uploaded by Nirmesh Shah
Content may be subject to copyright.
Schematic representation of LMNN technique (a) before and (b) after applying the LMNN technique. Adapted from [19]. Here, the target neighbors refer to the features that have similar label and impostor is also the neighbor feature vector. However, it is having different label. The goal of LMNN technique is to minimize the number of impostors via relative distance constraint. The objective function is given by [19] :

Schematic representation of LMNN technique (a) before and (b) after applying the LMNN technique. Adapted from [19]. Here, the target neighbors refer to the features that have similar label and impostor is also the neighbor feature vector. However, it is having different label. The goal of LMNN technique is to minimize the number of impostors via relative distance constraint. The objective function is given by [19] :

Source publication
Conference Paper
Full-text available
Obtaining aligned spectral pairs in case of non-parallel data for stand-alone Voice Conversion (VC) technique is a challenging research problem. Unsupervised alignment algorithm, namely, an Iterative combination of a Nearest Neighbor search step and a Conversion step Alignment (INCA) iteratively tries to align the spectral features by minimizing th...

Context in source publication

Context 1
... key idea behind the LMNN is illustrated in Figure 2. [19]. ...

Similar publications

Article
Full-text available
Road traffic monitoring is very important for intelligent transportation. The detection of traffic state based on acoustic information is a new research direction. A vehicles acoustic event classification algorithm based on sparse autoencoder is proposed to analysis the traffic state. Firstly, the multidimensional Mel-cepstrum features and energy f...
Preprint
Full-text available
In this work, we propose an approach that features deep feature embedding learning and hierarchical classification with triplet loss function for Acoustic Scene Classification (ASC). In the one hand, a deep convolutional neural network is firstly trained to learn a feature embedding from scene audio signals. Via the trained convolutional neural net...
Preprint
Full-text available
Unsupervised spoken term discovery consists of two tasks: finding the acoustic segment boundaries and labeling acoustically similar segments with the same labels. We perform segmentation based on the assumption that the frame feature vectors are more similar within a segment than across the segments. Therefore, for strong segmentation performance,...
Preprint
Full-text available
Music genre classification is one of the trending topics in regards to the current Music Information Retrieval (MIR) Research. Since, the dependency of genre is not only limited to the audio profile, we also make use of textual content provided as lyrics of the corresponding song. We implemented a CNN based feature extractor for spectrograms in ord...

Citations

... Obtaining the aligned spectral features in non-parallel VC is more challenging due to the fact that both the source and target speakers have spoken different utterances. Among various available alignment approaches, the most popular alignment techniques are based on Nearest Neighbor (NN), for example, the state-ofthe-art Iterative combination of a Nearest Neighbor search step and a Conversion step Alignment (INCA) [7,8] and its variants [9][10][11]. However, lower % Phonetic Accuracy (PA) has been reported after the NN-based alignment techniques [7]. ...
... If both the estimated and the ground truth frame-level label are found to be the same, it is considered as hit and if not then false. From this, % Phonetic Accuracy (PA) is defined as [11,31]: ...
Conference Paper
Full-text available
Nearest Neighbor (NN)-based alignment techniques are pop- ular in non-parallel Voice Conversion (VC). The performance of NN-based alignment improves with the information about phone boundary. However, estimating the exact phone bound- ary is a challenging task. If text corresponding to the utterance is available, the Hidden Markov Model (HMM) can be used to identify the phone boundaries. However, it requires a large amount of training data that is difficult to collect in realistic VC scenarios. Hence, we propose to exploit a Spectral Transition Measure (STM)-based alignment technique that does not re- quire apriori training data. The idea behind STM is that neurons in the auditory or visual cortex respond strongly to the transi- tional stimuli compared to the steady-state stimuli. The phone boundaries estimated using the STM algorithm are then applied to the NN technique to obtain the aligned spectral features of the source and target speakers. Proposed STM+NN alignment technique is giving on an average 13.67% relative improvement in phonetic accuracy (PA) compared to the NN-based alignment technique. The improvement in %PA after alignment has pos- itively reflected in the better performance in terms of speech quality and speaker similarity (in particular, a relative improve- ment of 13.63% and 13.26% , respectively) of the converted voice.
Thesis
Full-text available
Understanding how a particular speaker is producing speech, and mimicking one‘s voice is a difficult research problem due to the sophisticated mechanism involved in speech production. Voice Conversion (VC) is a technique that modifies the perceived speaker identity in a given speech utterance from a source speaker to a particular target speaker without changing the linguistic content. Each standalone VC system building consists of two stages, namely, training and testing. First, speaker-dependent features are extracted from both speakers‘ training data. These features are first time aligned and corresponding pairs are obtained. Then a mapping function is learned among these aligned feature-pairs. Once the training step is done, during the testing stage, features are extracted from the source speaker‘s held out data. These features are converted using the mapping function. The converted features are then passed through the vocoder that will produce a converted voice. Hence, there are primarily three components of the stand-alone VC system building, namely, the alignment step, the mapping function, and the speech analysis/synthesis framework. Major contributions of this thesis are towards identifying the limitations of existing techniques, improving it, and developing new approaches for the mapping, and alignment stages of the VC. In particular, a novel Amplitude Scaling (AS) method is proposed for frequency warping (FW)-based VC, which linearly transfers the amplitude of the frequency-warped spectrum using the knowledge of a Gaussian Mixture Model (GMM)-based converted spectrum without adding any spurious peaks. To overcome the issue of overfitting in Deep Neural Network (DNN)-based VC, the idea of pre-training is popular. However, this pre-training is time-consuming, and requires a separate network to learn the parameters of the network. Hence, whether this additional pre-training step could be avoided by using recent advances in deep learning is investigated in this thesis. The ability of Generative Adversarial Network (GAN) in estimating probability density function (pdf ) for generating the realistic samples corresponding to the given source speaker‘s utterance resulted in a significant performance improvement in the area of VC. The key limitation of the vanilla GAN-based system is in generating the samples that ma y not correspond to the given source speaker‘s utterance. To address this issue, Minimum Mean Squared Error (MMSE) regularized GAN (i.e., MMSE-GAN) is proposed in this thesis. Obtaining corresponding feature pairs in the context of both parallel as well as non-parallel VC is a challenging task. In this thesis, the strengths and limitations of the different existing alignment strategies are identified, and new alignment strategies are proposed for both parallel and non-parallel VC task. Wrongly aligned pairs will affect the learning of the mapping function, which in turn will deteriorate the quality of the converted voices. In order to remove such wrongly aligned pairs from the training data, outlier removal-based pre-processing technique is proposed for the parallel VC. In the case of non-parallel VC, theoretical convergence proof is developed for the popular alignment technique, namely, Iterative combination of a Nearest Neighbor search step and a Conversion step Alignment (INCA). In addition, the use of dynamic features along with static features to calculate the Nearest Neighbor (NN) aligned pairs in the existing INCA, and Temporal context (TC) INCA is also proposed. Furthermore, a novel distance metric is learned for the NN-based search strategies, as Euclidean distance may not correlate well with the perceptual distance. Moreover, computationally simple Spectral Transition Measure (STM)-based phone alignment technique that does not require any apriori training data is also proposed for the non-parallel VC. Both the parallel and the non-parallel alignment techniques will generate one-to-many and many-to-one feature pairs. These one-to-many and many-to-one pairs will affect the learning of the mapping function and result in the muffling and oversmoothing effect in VC. Hence, unsupervised Vocal Tract Length Normalization (VTLN) posteriorgram, and novel inter mixture weighted GMM Posteriorgram as a speaker-independent representation in the two-stage mapping network is proposed in order to avoid the alignment step from the VC framework. In this thesis, an attempt has also been made to use the acoustic-to-articulatory inversion (AAI) technique for the quality assessment of the voice converted speech. Lastly, the proposed MMSE-GAN architecture is extended in the form of Discover GAN (i.e., MMSE DiscoGAN) for the cross-domain VC applications (w.r.t. attributes of the speech production mechanism), namely, Non-Audible Murmur (NAM)-to-WHiSPer (NAM2WHSP) speech conversion, and WHiSPer-to-SPeeCH (WHSP2SPCH) conversion. Finally, thesis summarizes overall work presented, limitations of various approaches along with future research directions.
Article
Alignment is a key step before learning a mapping function between a source and a target speaker’s spectral features in various state-of-the-art parallel data Voice Conversion (VC) techniques. After alignment, some corresponding pairs are still inconsistent with the rest of the data and are considered outliers. These outliers shift the parameters of the mapping function from their true value and hence, negatively affect the learning of mapping function during the training phase of the VC task. To the best of the authors’ knowledge, the effect of outliers (and hence, their removal) on quality of the converted voice has not been much explored in the VC literature. Recent research has shown the effectiveness of the outlier removal as a pre-processing step in the VC. In this paper, we extend this study with a detailed theory and analysis. The proposed method uses a score distance that is estimated using Robust Principal Component Analysis (ROBPCA) to detect the outliers. In particular, the outliers are determined using a fixed cut-off on the score distances, based on the degrees of freedom in a chi-squared distribution, which is speaker-pair independent. The fixed cut-off is due to the assumption that the score distances follow the normal (i.e., Gaussian) distribution. However, this is a weak statistical assumption even in the cases where quite many samples are available. Hence, in this paper, we propose to explore speaker-pair dependent cut-offs to detect the outliers. In addition, we have presented our results on two state-of-the-art databases, namely, CMU-ARCTIC and Voice Conversion Challenge (VCC) 2016 by developing various state-of-the-art methods in the VC. In particular, we have presented the effectiveness of the outlier removal on Gaussian Mixture Model (GMM), Artificial Neural Network (ANN), and Deep Neural Network (DNN)-based VC techniques. Furthermore, we have presented subjective and objective evaluations using a 95% confidence interval for the statistical significance of the tests. We obtained an average 0.56% relative reduction in Mel Cepstral Distortion (MCD) with the proposed outlier removal approach as a pre-processing step. In particular, with the proposed speaker-pair dependent cut-off, we have observed relative improvement of 12.24% and 30.51% in the speech quality, and 39.7% and 4.27% absolute improvement in the speaker similarity for the CMU-ARCTIC and the VCC 2016, respectively.