About
29
Publications
4,291
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
189
Citations
Introduction
Current institution
Publications
Publications (29)
Although recent neural text-to-speech (TTS) systems have achieved high-quality speech synthesis, there are cases where a TTS system generates low-quality speech, mainly caused by limited training data or information loss during knowledge distillation. Therefore, we propose a novel method to improve speech quality by training a TTS model under the s...
Several fast text-to-speech (TTS) models have been proposed for real-time processing, but there is room for improvement in speech quality. Meanwhile, there is a mismatch between the loss function for training and the mean opinion score (MOS) for evaluation, which may limit the speech quality of TTS models. In this work, we propose a method that can...
Recent works of utilizing phonetic posteriograms (PPGs) for non-parallel voice conversion have significantly increased the usability of voice conversion since the source and target DBs are no longer required for matching contents. In this approach, the PPGs are used as the linguistic bridge between source and target speaker features. However, this...
The histogram equalization approach is an efficient feature normalization technique for noise robust automatic speech recognition. However, it suffers from performance degradation when some fundamental conditions are not satisfied in the test environment. To remedy these limitations of the original histogram equalization methods, class-based histog...
The main advantage of the statistical parametric speech synthesis is its flexibility in changing voice characteristics. A personalized text-to-speech(TTS) system can be implemented by combining a speech synthesis system and a voice transformation system, and it is widely used in many application areas. It is known that the fundamental frequency and...
In this letter, we propose a new probabilistic class histogram equalization technique for noise robust speech recognition. To cope with the sparse data problem which is common in the case of short test data, the proposed histogram equalization technique employs the posterior mean estimator, a kind of the Bayesian estimator, for test CDF. Experiment...
In this paper, a new discriminative likelihood score weighting technique is proposed for speaker identification. The proposed method employs a discriminative weighting of frame-level log-likelihood scores with acoustic-phonetic classification in the Gaussian mixture model (GMM)-based speaker identification. Experiments performed on the Aurora noise...
Support vector machines (SVMs) have been proved to be an effective approach to speaker verification. An appropriate selection of the kernel function is a key issue in SVM-based classification. In this letter, a new SVM-based speaker verification method utilizing weighted kernels in the Gaussian mixture model supervector space is proposed. The weigh...
In this letter, we propose a novel statistical voice activity detection (VAD) technique. The proposed technique employs probabilistically derived multiple acoustic models to effectively optimize the weights on frequency domain likelihood ratios with the discriminative training approach for more accurate voice activity detection. Experiments perform...
The role of the statistical model-based voice activity detector (SMVAD) is to detect speech regions from input signals using the statistical models of noise and noisy speech. The decision rule of SMVAD is based on the likelihood ratio test (LRT). The LRT-based decision rule may cause detection errors because of statistical properties of noise and s...
We propose a new model adaptation method based on the histogram equalization technique for providing robustness in noisy environments. The trained acoustic mean models of a speech recognizer are adapted into environmentally matched conditions by using the histogram equalization algorithm on a single utterance basis. For more robust speech recogniti...
The selection of effective features is especially important in achieving highly accurate speech recognition. Although the mel-cepstrum is a popular and effective feature for speech recognition, it is still unclear that the filterbank adopted in the mel-cepstrum always produces the optimal performance regardless of the phonetic environment of any sp...
In this paper, we introduce a new histogram equalization-based environmental model adaptation method for robust speech recognition in noise environments. The proposed method adapts initially-trained acoustic mean models of a speech recognizer into the environmentally matched models. The covariance models are adapted by using utterance-level local c...
In this letter, a new environmental model adaptation method is proposed for robust speech recognition under noisy environments. The proposed method adapts initial acoustic models of a speech recognizer into environmentally matched models by utilizing the histogram equalization technique. Experiments performed on the Aurora noisy environment showed...
APSIPA ASC 2009: Asia-Pacific Signal and Information Processing Association, 2009 Annual Summit and Conference. 4-7 October 2009. Sapporo, Japan. Poster session: Automatic Speech Recognition (6 October 2009). A statistical model-based voice activity detection (VAD) is a robust algorithm in noisy condition to detect speech region from input signal b...
In this letter, we propose a new histogram equalization technique for feature compensation in speech recognition under noisy environments. The proposed approach combines a signal-to-noise-ratio-dependent feature reconstruction method and the class histogram equalization technique to effectively reduce the acoustic mismatch present in noisy speech f...
In this letter, we propose a new histogram equalization method to compensate for acoustic mismatches mainly caused by corruption of additive noise and channel distortion in speech recognition. The proposed method employs an improved test cumulative distribution function (CDF) by more accurately smoothing the conventional order statistics-based test...
Recently, many techniques have been proposed to improve speaker identification in noise environments. Among these techniques, we consider the feature recombination technique for the multi-band approach in noise robust speaker identification. The conventional feature recombination technique is very effective in the band-limited noise condition, but...
A new class-based histogram equalization method is proposed for robust speech recognition. The proposed method aims at not only compensating for an acoustic mismatch between training and test environments but also reducing the two fundamental limitations of the conventional histogram equalization method, the discrepancy between the phonetic distrib...
In this letter, a probabilistic class histogram equalization method is proposed to compensate for an acoustic mismatch in noise robust speech recognition. The proposed method aims not only to compensate for the acoustic mismatch between training and test environments but also to reduce the limitations of the conventional histogram equalization. It...
In this letter, a new segment-level speech/nonspeech classification method based on the Poisson polling technique is proposed. The proposed method makes two modifications from the baseline Poisson polling method to further improve the classification accuracy. One of them is to employ Poisson mixture models to more accurately represent various segme...
A new class-based histogram equalization method is proposed for robust speech recognition. The proposed method aims at not only compensating the acoustic mismatch between training and test environments, but also at reducing the discrepancy between the phonetic distributions of training and test speech data. The algorithm utilizes multiple class-spe...
Selecting good feature is especially important to achieve high speech recognition accuracy. Although the mel-cepstrum is a popular and effective feature for speech recognition, it is still unclear that the filter-bank in the mel-cepstrum is always optimal regardless of speech recognition environments or the characteristics of specific speech data....
We propose a new approach to improve the performance of speech recognizers by utilizing acoustic-phonetic knowledge sources. WC use the unvoiced. voiced. and silence (UVS) group information of the input speech signal in the conventional speech recognizer. We extract the LJVS information by using a recurrent neural network (RNN). generate a rule-bas...
We propose a new method of phoneme segmentation using MLP
(multi-layer perceptron). The structure of the proposed segmenter
consists of three parts: preprocessor, MLP-based phoneme segmenter, and
postprocessor. The preprocessor utilizes a sequence of 44 order feature
parameters for each frame of speech, based on the acoustic-phonetic
knowledge. The...