-
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, May 22-27, 2011, Prague Congress Center, Prague, Czech Republic; 01/2011
-
[show abstract]
[hide abstract]
ABSTRACT: In this paper, we present a novel approach to relax the constraint of stereo-data which is needed in a series of algorithms for noise-robust speech recognition. As a demonstration in SPLICE algorithm, we generate the pseudo-clean features to replace the ideal clean features from one of the stereo channels, by using HMM-based speech synthesis. Experimental results on aurora2 database show that the performance of our approach is comparable with that of SPLICE. Further improvements are achieved by concatenating a bias adaptation algorithm to handle unknown environments. Relative word error rate reductions of 66% and 24% are achieved over the baseline systems in the clean-training and multi-training conditions, respectively.
Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on; 04/2010 · 4.63 Impact Factor
-
INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010; 01/2010
-
[show abstract]
[hide abstract]
ABSTRACT: This paper proposes a state duration modeling method using full covariance matrix for HMM-based speech synthesis. In this method, a full covariance matrix instead of the conventional diagonal covariance matrix is adopted in the multi-dimensional Gaussian distribution to model the state duration of each context-dependent phoneme. At synthesis stage, the state durations are predicted using the clustered context-dependent distributions with full covariance matrices. Experimental results show that the synthesized speech using full-covariance state duration models is more natural than the conventional method when we change the speaking rate of synthesized speech.
Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on; 05/2009 · 4.63 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: This paper presents a method that the dependency between F0 and spectral features are modeled for the HMM-based parametric speech synthesis system. In conventional systems these two features are modeled as two independent streams, which is inconsistent with the fact that there always exists interaction between the extracted F0 and spectral parameters for model training. A piecewise linear transform is introduced in this paper to explicitly model the dependency of spectrum on F0. The results of our experiments show that the proposed method is able to improve the accuracy of spectral parameter prediction if the F0 features are predicted based on a reliable voicing decision.
Chinese Spoken Language Processing, 2008. ISCSLP '08. 6th International Symposium on; 01/2009
-
[show abstract]
[hide abstract]
ABSTRACT: Posterior probability is mostly used for pronunciation evaluation. This paper introduces pronunciation space models to calculate posterior probability replacing traditional phone-based acoustic models, which makes the calculated posterior probability more precise. Pronunciation space models are constructed using unsupervised clustering method guided by human scores and phone-level posterior probability. By using correlation between machine scores and human scores as the performance measurement, pronunciation space models based method shows its effectiveness for pronunciation evaluation in the experiments on a Chinese database spoken by Koreans with the correlation's improvement from 0.390 to 0.415 comparing to the traditional method based on phone based acoustic models.
Chinese Spoken Language Processing, 2008. ISCSLP '08. 6th International Symposium on; 01/2009
-
[show abstract]
[hide abstract]
ABSTRACT: This paper presents a comparison and evaluation between the conventional maximum likelihood estimation based adaptation and different discriminative adaptation criteria. The performance of different LR and MAP adaptation are compared respectively, and the strategies of first applying LR then MAP based on both MLE and DT criteria are evaluated. The effect of the amount of available data for adaptation is also compared in our experiments. The experiment results of 863 and Tsinghua mandarin evaluation tasks suggests that the process of first applying MWCE-LR then MWCE-MAP can achieve the best performance.
Chinese Spoken Language Processing, 2008. ISCSLP '08. 6th International Symposium on; 01/2009
-
[show abstract]
[hide abstract]
ABSTRACT: In text-independent speaker verification, unsupervised mode can improve system performance. In traditional systems, the speaker model is updated when a test speech has a score higher than a particular threshold; we call this unsupervised model training. In this paper, an unsupervised score normalization is proposed. A target speaker score Gauss and an impostor score Gauss are set up as a prior; the parameters of the impostor score model are updated using the test score. Then the test score is normalized by the new impostor score model. When the unsupervised score normalization, unsupervised model training and factor analysis are adopted in the NIST 2006 SRE core test, the EER of the system is 4.29%.
Chinese Spoken Language Processing, 2008. ISCSLP '08. 6th International Symposium on; 01/2009
-
[show abstract]
[hide abstract]
ABSTRACT: In this paper appropriate confidence measures (CMs) are investigated for Mandarin command word recognition, both in the so-called target region and non-target region, respectively. Here the target region refers to the recognized speech part of command word while the non-target region refers to the recognized silence part. It shows that exploiting extra information in the non-target region can effectively complement the traditional CM which usually focus on the target region. Furthermore, when analyzing the non-target region in a more theoretical way, where Bayesian information criterion (BIC) is employed to locate more precise boundary in the non-target region, even more improvement is achieved. In two different Mandarin telephone command word tasks, more than 20% relative reduction of equal error rate (EER) is obtained.
Chinese Spoken Language Processing, 2008. ISCSLP '08. 6th International Symposium on; 01/2009
-
[show abstract]
[hide abstract]
ABSTRACT: Tonal evaluation of Chinese continuous speech plays an important role in Mandarin Chinese pronunciation test. In this paper, we introduce the Multi- Space Distribution Hidden Markov Model based on prosodic word. The results show that the performance of tonal syllable error rate can be reduced. For the non-standard Chinese Mandarin speech, the correlation between computer score and expert score was improved above 3.0% absolutely, compared with the baseline system without tonal pronunciation test.
Chinese Spoken Language Processing, 2008. ISCSLP '08. 6th International Symposium on; 01/2009
-
IEEE Transactions on Audio, Speech & Language Processing. 01/2009; 17:1171-1185.
-
Proceedings of the 2009 IEEE International Conference on Multimedia and Expo, ICME 2009, June 28 - July 2, 2009, New York City, NY, USA; 01/2009
-
[show abstract]
[hide abstract]
ABSTRACT: This paper presents a novel discriminative training criterion, minimum word classification error (MWCE). By localizing conventional string-level MCE loss function to word-level, a more direct measure of empirical word classification error is approximated and minimized. Because the word-level criterion better matches performance evaluation criteria such as WER, an improved word recognition performance can be achieved. We evaluated and compared MWCE criterion in a unified DT framework, with other commonly-used criteria including MCE, MMI, MWE, and MPE. Experiments on TIMIT and WS JO evaluation tasks suggest that word-level MWCE criterion can achieve consistently better results than string-level MCE. MWCE even outperforms other substring-level criteria on the above two tasks, including MWE and MPE.
Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on; 05/2008 · 4.63 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Due to the inconsistency between the maximum likelihood (ML) based training and the synthesis application in HMM-based speech synthesis, a minimum generation error (MGE) criterion had been proposed for HMM training. This paper continues to apply the MGE criterion to model adaptation for HMM-based speech synthesis. We propose a MGE linear regression (MGELR) based model adaptation algorithm, where the regression matrices used to transform source models to target models are optimized to minimize the generation errors for the input speech data uttered by the target speaker. The proposed MGELR approach was compared with the maximum likelihood linear regression (MLLR) based model adaptation. Experimental results indicate that the generation errors were reduced after the MGELR-based model adaptation. And from the subjective listening test, the discrimination and the quality of the synthesized speech using MGELR were better than the results using MLLR.
Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on; 05/2008 · 4.63 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: This paper presents a minimum unit selection error (MUSE) training method for HMM-based unit selection speech synthesis system, which selects the optimal phone-sized unit sequence from the speech database by maximizing the combined likelihood of a group of trained HMMs. Under MUSE criterion, the weights and distribution parameters of these HMMs are estimated to minimize the number of different units between the selected phone sequences and the natural phone sequences for the training sentences. The optimization is realized by discriminative training using generalized probabilistic descent (GPD) algorithm. Results of our experiment show that this proposed method is able to improve the performance of the baseline system where model weights are set manually and distribution parameters are trained under maximum likelihood criterion.
Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on; 05/2008 · 4.63 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Due to the inconsistency between the maximum likelihood (ML) based training and the synthesis application in HMM-based speech synthesis, a minimum generation error (MGE) criterion had been proposed for HMM training. This paper continues to apply the MGE criterion to model adaptation for HMM-based speech synthesis. We propose a MGE linear regression (MGELR) based model adaptation algorithm, where the regression matrices used to transform source models to target models are optimized to minimize the generation errors for the input speech data uttered by the target speaker. The proposed MGELR approach was compared with the maximum likelihood linear regression (MLLR) based model adaptation. Experimental results indicate that the generation errors were reduced after the MGELR-based model adaptation. And from the subjective listening test, the discrimination and the quality of the synthesized speech using MGELR were better than the results using MLLR.
Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on; 05/2008 · 4.63 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Recently, we proposed a novel optimization algorithm called constrained line search (CLS) to train Gaussian mean vectors of HMMs in the MMI sense. In this paper, we extend and re-formulate it in a more general framework. The new CLS can optimize any discriminative objective functions including MMI, MCE, MPE/MWE etc. Also, closed-form solutions to update all Gaussian mixture parameters, including means, covariances and mixture weights, are obtained. We investigate the new CLS on several benchmark speech recognition databases, including TIDIGITS, Switchboard mini-train and Switchboard full h5train00 sets. Experimental results show that the new CLS optimization method outperforms the conventional EBW method in both performance and convergence behavior.
Automatic Speech Recognition & Understanding, 2007. ASRU. IEEE Workshop on; 01/2008
-
[show abstract]
[hide abstract]
ABSTRACT: We extend our previous work on soft margin estimation (SME) to large vocabulary continuous speech recognition in two aspects. The first is to use the extended Baum-Welch method to replace the conventional generalized probabilistic descent algorithm for optimization. The second is to compare SME with minimum classification error (MCE) training with the same implementation details in order to show that it is indeed the margin component in the objective function with margin-based utterance and frame selection that contributes to the success of SME. Tested on the 5 k-word Wall Street Journal task, all the SME methods work better than MCE. The best SME approach achieves a relative word error rate reduction of about 19% over our best baseline performance. This enhancement can only be demonstrated because of our use of margin-based objective function and the extended Baum-Welch parameter optimization method.
Automatic Speech Recognition & Understanding, 2007. ASRU. IEEE Workshop on; 01/2008
-
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2008, March 30 - April 4, 2008, Caesars Palace, Las Vegas, Nevada, USA; 01/2008
-
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2008, March 30 - April 4, 2008, Caesars Palace, Las Vegas, Nevada, USA; 01/2008