[Show abstract][Hide abstract] ABSTRACT: This paper proposes a unit-selection and waveform concatenation speech synthesis system based on synthetic speech naturalness evaluation. A Support Vector Machine (SVM) and Log Likelihood Ratio (LLR) based synthetic speech naturalness evaluation system was introduced in our previous work. In this paper, the evaluation system is improved in three aspects. Finally, a unit-selection and concatenation waveform speech synthesis system is built on the base of the synthetic speech naturalness evaluation system. Optimum unit sequence is chosen through the re-scoring for the N-best path. Subjective listening tests show the proposed synthetic speech evaluation based speech synthesis system significantly outperforms the traditional unit-selection speech synthesis system.
[Show abstract][Hide abstract] ABSTRACT: This paper introduces the speech synthesis system developed by USTC for Blizzard Challenge 2008. Two synthetic voices from the released UK English database are built using the HMM-based unit selection synthesis method, which is a hybrid of sta-tistical parametric synthesis and unit-selection techniques. In this method, the optimal sequence of phone-sized candidate units is selected from the database following the statistical crite-rions derived from a set of trained HMMs for different acoustic features. Then the waveforms of selected units are concatenated to generate the synthesized speech. The evaluation results of Blizzard Challenge 2008 show that our system has good per-formance on similarity, naturalness and intelligibility for both English voices.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we present a novel approach to relax the constraint of stereo-data which is needed in a series of algorithms for noise-robust speech recognition. As a demonstration in SPLICE algorithm, we generate the pseudo-clean features to replace the ideal clean features from one of the stereo channels, by using HMM-based speech synthesis. Experimental results on aurora2 database show that the performance of our approach is comparable with that of SPLICE. Further improvements are achieved by concatenating a bias adaptation algorithm to handle unknown environments. Relative word error rate reductions of 66% and 24% are achieved over the baseline systems in the clean-training and multi-training conditions, respectively.
[Show abstract][Hide abstract] ABSTRACT: This paper proposes a method to detect the errors in synthetic speech of a unit selection speech synthesis system automatically using log likelihood ratio and support vector machine (SVM). For SVM training, a set of synthetic speech are firstly generated by a given speech synthesis system and their synthetic errors are labeled by manually annotating the segments that sound unnatural. Then, two context-dependent acoustic models are trained using the natural and unnatural segments of labeled synthetic speech respectively. The log likelihood ratio of acoustic features between these two models is adopted to train the SVM classifier for error detection. Experimental results show the proposed method is effective in detecting the errors of pitch contour within a word for a Mandarin speech synthesis system. The proposed SVM method using log likelihood ratio between context-dependent acoustic models outperforms the SVM classifier trained on acoustic features directly.
[Show abstract][Hide abstract] ABSTRACT: Factor analysis is a model of the speaker and session variability in Gaussian mixture models and is widely used in text-independent speaker recognition. There exist two issues when the loading matrices of the eigenvoice and eigenchannel are estimated jointly. First, the speaker diagonal matrix (residual) will not take effect; second, the channel factors can not be very large. In this paper, the loading matrices of eigenvoice and the diagonal are calculated serially and different eigenchannel matrices are assembled to form a large channel loading matrix. The performance can be improved by the proposed algorithm. In the NIST speaker recognition evaluation (SRE) 2008 core test corpus, the equal error rates (EERs) of the five sub sessions were 3.3%, 5.1%, 5.0%, 5.3%, and 5.0%.
No preview · Article · Oct 2009 · ACTA AUTOMATICA SINICA
[Show abstract][Hide abstract] ABSTRACT: This paper presents two new ideas for text dependent mispronunciation detection. Firstly, mispronunciation detection is formulated as a classification problem to integrate various predictive features. A Support Vector Machine (SVM) is used as the classifier and the log-likelihood ratios between all the acoustic models and the model corresponding to the given text are employed as features for the classifier. Secondly, Pronunciation Space Models (PSMs) are proposed to enhance the discriminative capability of the acoustic models for pronunciation variations. In PSMs, each phone is modeled with several parallel acoustic models to represent pronunciation variations of that phone at different proficiency levels, and an unsupervised method is proposed for the construction of the PSMs. Experiments on a database consisting of more than 500,000 Mandarin syllables collected from 1335 Chinese speakers show that the proposed methods can significantly outperform the traditional posterior probability based method. The overall recall rates for the 13 most frequently mispronounced phones increase from 17.2%, 7.6% and 0% to 58.3%, 44.3% and 29.5% at three precision levels of 60%, 70% and 80%, respectively. The improvement is also demonstrated by a subjective experiment with 30 subjects, in which 53.3% of the subjects think the proposed method is better than the traditional one and 23.3% of them think that the two methods are comparable.
No preview · Article · Oct 2009 · Speech Communication
[Show abstract][Hide abstract] ABSTRACT: This paper presents an investigation into ways of integrating articulatory features into hidden Markov model (HMM)-based parametric speech synthesis. In broad terms, this may be achieved by estimating the joint distribution of acoustic and articulatory features during training. This may in turn be used in conjunction with a maximum-likelihood criterion to produce acoustic synthesis parameters for generating speech. Within this broad approach, we explore several variations that are possible in the construction of an HMM-based synthesis system which allow articulatory features to influence acoustic modeling: model clustering, state synchrony and cross-stream feature dependency. Performance is evaluated using the RMS error of generated acoustic parameters as well as formal listening tests. Our results show that the accuracy of acoustic parameter prediction and the naturalness of synthesized speech can be improved when shared clustering and asynchronous-state model structures are adopted for combined acoustic and articulatory features. Most significantly, however, our experiments demonstrate that modeling the dependency between these two feature streams can make speech synthesis systems more flexible. The characteristics of synthetic speech can be easily controlled by modifying generated articulatory features as part of the process of producing acoustic synthesis parameters.
Full-text · Article · Aug 2009 · IEEE Transactions on Audio Speech and Language Processing
[Show abstract][Hide abstract] ABSTRACT: Gaussian mixture models (GMM) have become one of the standard acoustic approaches for language identification. Furthermore, the GMM-SVM is proven to work well by introducing the discriminative method into the GMM-based acoustic systems. In these systems, the intersession variability within language has become an important adverse factor that degrades the system performance. To tackle this problem, we propose a subspace analysis method, termed as Intra-language Difference Subspace Estimatio (IDSE), under the GMM-SVM framework. In IDSE method, the difference vector is modeled with three components: Extra-language difference, Intra-language difference and noise difference. Then the Intra-language and noise difference are effectively estimated and eliminated from the difference vector. The experiments on NIST 07 evaluation tasks show effectiveness of the proposed method.
[Show abstract][Hide abstract] ABSTRACT: This paper proposes a state duration modeling method using full covariance matrix for HMM-based speech synthesis. In this method, a full covariance matrix instead of the conventional diagonal covariance matrix is adopted in the multi-dimensional Gaussian distribution to model the state duration of each context-dependent phoneme. At synthesis stage, the state durations are predicted using the clustered context-dependent distributions with full covariance matrices. Experimental results show that the synthesized speech using full-covariance state duration models is more natural than the conventional method when we change the speaking rate of synthesized speech.
[Show abstract][Hide abstract] ABSTRACT: In the text-independent speaker verification research, the information of previous trials can be adopted to update the speaker models or the test scores dynamically. This process is defined as the unsupervised mode, which can make a coupling between the trials and the speaker models. The unsupervised mode is very useful for real speaker recognition application. In this paper, a score-based unsupervised adaptation is proposed as well as model-based unsupervised adaptation. In the score-based unsupervised adaptation mode, a bi-Gaussian model is introduced as a prior score distribution. Then the MAP (maximum a posteriori) method is adopted to adjust the parameters of the score normalization. In the test process, the unsupervised score adaptation and unsupervised model adaptation can both improve the performance. In the case of NIST SRE 2006 lconv4w-lconv4w corpus, the equal error rate (EER) of the proposed system is 4.3% and the minimum detection cost function (minDCF) is 0.021.
No preview · Article · May 2009 · ACTA AUTOMATICA SINICA
[Show abstract][Hide abstract] ABSTRACT: Posterior probability is mostly used for pronunciation evaluation. This paper introduces pronunciation space models to calculate posterior probability replacing traditional phone-based acoustic models, which makes the calculated posterior probability more precise. Pronunciation space models are constructed using unsupervised clustering method guided by human scores and phone-level posterior probability. By using correlation between machine scores and human scores as the performance measurement, pronunciation space models based method shows its effectiveness for pronunciation evaluation in the experiments on a Chinese database spoken by Koreans with the correlation's improvement from 0.390 to 0.415 comparing to the traditional method based on phone based acoustic models.
[Show abstract][Hide abstract] ABSTRACT: This paper presents a method that the dependency between F0 and spectral features are modeled for the HMM-based parametric speech synthesis system. In conventional systems these two features are modeled as two independent streams, which is inconsistent with the fact that there always exists interaction between the extracted F0 and spectral parameters for model training. A piecewise linear transform is introduced in this paper to explicitly model the dependency of spectrum on F0. The results of our experiments show that the proposed method is able to improve the accuracy of spectral parameter prediction if the F0 features are predicted based on a reliable voicing decision.
[Show abstract][Hide abstract] ABSTRACT: Tonal evaluation of Chinese continuous speech plays an important role in Mandarin Chinese pronunciation test. In this paper, we introduce the Multi- Space Distribution Hidden Markov Model based on prosodic word. The results show that the performance of tonal syllable error rate can be reduced. For the non-standard Chinese Mandarin speech, the correlation between computer score and expert score was improved above 3.0% absolutely, compared with the baseline system without tonal pronunciation test.
[Show abstract][Hide abstract] ABSTRACT: In this paper appropriate confidence measures (CMs) are investigated for Mandarin command word recognition, both in the so-called target region and non-target region, respectively. Here the target region refers to the recognized speech part of command word while the non-target region refers to the recognized silence part. It shows that exploiting extra information in the non-target region can effectively complement the traditional CM which usually focus on the target region. Furthermore, when analyzing the non-target region in a more theoretical way, where Bayesian information criterion (BIC) is employed to locate more precise boundary in the non-target region, even more improvement is achieved. In two different Mandarin telephone command word tasks, more than 20% relative reduction of equal error rate (EER) is obtained.
[Show abstract][Hide abstract] ABSTRACT: In text-independent speaker verification, unsupervised mode can improve system performance. In traditional systems, the speaker model is updated when a test speech has a score higher than a particular threshold; we call this unsupervised model training. In this paper, an unsupervised score normalization is proposed. A target speaker score Gauss and an impostor score Gauss are set up as a prior; the parameters of the impostor score model are updated using the test score. Then the test score is normalized by the new impostor score model. When the unsupervised score normalization, unsupervised model training and factor analysis are adopted in the NIST 2006 SRE core test, the EER of the system is 4.29%.
[Show abstract][Hide abstract] ABSTRACT: This paper presents a comparison and evaluation between the conventional maximum likelihood estimation based adaptation and different discriminative adaptation criteria. The performance of different LR and MAP adaptation are compared respectively, and the strategies of first applying LR then MAP based on both MLE and DT criteria are evaluated. The effect of the amount of available data for adaptation is also compared in our experiments. The experiment results of 863 and Tsinghua mandarin evaluation tasks suggests that the process of first applying MWCE-LR then MWCE-MAP can achieve the best performance.
[Show abstract][Hide abstract] ABSTRACT: This paper introduces the USTC's speech synthesis system for Blizzard Challenge 2009. USTC attended all English tasks including the hub tasks and the spoke tasks. According to the various conditions for different tasks, different versions of HMM based unit-selection systems are constructed based on the USTC Blizzard Challenge 2008 system. Many new techniques are employed in our speech synthesis system construction. Results of internal experiments comparing these techniques are shown, and analyzed. The evaluation results of Blizzard Challenge 2009 prove that our system has good quality in all the naturalness, similarity and intelligibility of the synthetic speech.
[Show abstract][Hide abstract] ABSTRACT: In order to solve the issues related to the maximum likelihood (ML) based HMM training for HMM-based speech synthesis, a minimum generation error (MGE) criterion had been proposed. This paper continues to apply the MGE criterion to model adaptation for HMM-based speech synthesis. We introduce a MGE linear regression (MGELR) based model adaptation algorithm, where the transforms from source HMMs to target HMMs are optimized to minimize the generation errors for the adaptation data of the target speaker. The regression matrices for both mean vector and covariance matrix of Gaussian distribution are re-estimated. The proposed MGELR approach was compared with the maximum likelihood linear regression (MLLR) based model adaptation. Experimental results indicate that the generation errors were reduced after the MGELR-based model adaptation. And from the subjective listening test, the speaker similarity and the quality of the synthesized speech using MGELR were better than the results using MLLR.
[Show abstract][Hide abstract] ABSTRACT: Due to the inconsistency between the maximum likelihood (ML) based training and the synthesis application in HMM-based speech synthesis, a minimum generation error (MGE) criterion had been proposed for HMM training. This paper continues to apply the MGE criterion to model adaptation for HMM-based speech synthesis. We propose a MGE linear regression (MGELR) based model adaptation algorithm, where the regression matrices used to transform source models to target models are optimized to minimize the generation errors for the input speech data uttered by the target speaker. The proposed MGELR approach was compared with the maximum likelihood linear regression (MLLR) based model adaptation. Experimental results indicate that the generation errors were reduced after the MGELR-based model adaptation. And from the subjective listening test, the discrimination and the quality of the synthesized speech using MGELR were better than the results using MLLR.
[Show abstract][Hide abstract] ABSTRACT: This paper presents a novel discriminative training criterion, minimum word classification error (MWCE). By localizing conventional string-level MCE loss function to word-level, a more direct measure of empirical word classification error is approximated and minimized. Because the word-level criterion better matches performance evaluation criteria such as WER, an improved word recognition performance can be achieved. We evaluated and compared MWCE criterion in a unified DT framework, with other commonly-used criteria including MCE, MMI, MWE, and MPE. Experiments on TIMIT and WS JO evaluation tasks suggest that word-level MWCE criterion can achieve consistently better results than string-level MCE. MWCE even outperforms other substring-level criteria on the above two tasks, including MWE and MPE.