Ren-Hua Wang

University of Science and Technology of China, Luchow, Anhui Sheng, China

Are you Ren-Hua Wang?

Claim your profile

Publications (99)118.71 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper proposes a unit-selection and waveform concatenation speech synthesis system based on synthetic speech naturalness evaluation. A Support Vector Machine (SVM) and Log Likelihood Ratio (LLR) based synthetic speech naturalness evaluation system was introduced in our previous work. In this paper, the evaluation system is improved in three aspects. Finally, a unit-selection and concatenation waveform speech synthesis system is built on the base of the synthetic speech naturalness evaluation system. Optimum unit sequence is chosen through the re-scoring for the N-best path. Subjective listening tests show the proposed synthetic speech evaluation based speech synthesis system significantly outperforms the traditional unit-selection speech synthesis system.
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, May 22-27, 2011, Prague Congress Center, Prague, Czech Republic; 01/2011
  • Source
    Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present a novel approach to relax the constraint of stereo-data which is needed in a series of algorithms for noise-robust speech recognition. As a demonstration in SPLICE algorithm, we generate the pseudo-clean features to replace the ideal clean features from one of the stereo channels, by using HMM-based speech synthesis. Experimental results on aurora2 database show that the performance of our approach is comparable with that of SPLICE. Further improvements are achieved by concatenating a bias adaptation algorithm to handle unknown environments. Relative word error rate reductions of 66% and 24% are achieved over the baseline systems in the clean-training and multi-training conditions, respectively.
    Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on; 04/2010
  • Source
    INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents an investigation into ways of integrating articulatory features into hidden Markov model (HMM)-based parametric speech synthesis. In broad terms, this may be achieved by estimating the joint distribution of acoustic and articulatory features during training. This may in turn be used in conjunction with a maximum-likelihood criterion to produce acoustic synthesis parameters for generating speech. Within this broad approach, we explore several variations that are possible in the construction of an HMM-based synthesis system which allow articulatory features to influence acoustic modeling: model clustering, state synchrony and cross-stream feature dependency. Performance is evaluated using the RMS error of generated acoustic parameters as well as formal listening tests. Our results show that the accuracy of acoustic parameter prediction and the naturalness of synthesized speech can be improved when shared clustering and asynchronous-state model structures are adopted for combined acoustic and articulatory features. Most significantly, however, our experiments demonstrate that modeling the dependency between these two feature streams can make speech synthesis systems more flexible. The characteristics of synthetic speech can be easily controlled by modifying generated articulatory features as part of the process of producing acoustic synthesis parameters.
    IEEE Transactions on Audio Speech and Language Processing 08/2009; 17:1171-1185. · 1.68 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper proposes a state duration modeling method using full covariance matrix for HMM-based speech synthesis. In this method, a full covariance matrix instead of the conventional diagonal covariance matrix is adopted in the multi-dimensional Gaussian distribution to model the state duration of each context-dependent phoneme. At synthesis stage, the state durations are predicted using the clustered context-dependent distributions with full covariance matrices. Experimental results show that the synthesized speech using full-covariance state duration models is more natural than the conventional method when we change the speaking rate of synthesized speech.
    Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on; 05/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a comparison and evaluation between the conventional maximum likelihood estimation based adaptation and different discriminative adaptation criteria. The performance of different LR and MAP adaptation are compared respectively, and the strategies of first applying LR then MAP based on both MLE and DT criteria are evaluated. The effect of the amount of available data for adaptation is also compared in our experiments. The experiment results of 863 and Tsinghua mandarin evaluation tasks suggests that the process of first applying MWCE-LR then MWCE-MAP can achieve the best performance.
    Chinese Spoken Language Processing, 2008. ISCSLP '08. 6th International Symposium on; 01/2009
  • Source
    Wu Guo, Li-Rong Dai, Ren-Hua Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: In text-independent speaker verification, unsupervised mode can improve system performance. In traditional systems, the speaker model is updated when a test speech has a score higher than a particular threshold; we call this unsupervised model training. In this paper, an unsupervised score normalization is proposed. A target speaker score Gauss and an impostor score Gauss are set up as a prior; the parameters of the impostor score model are updated using the test score. Then the test score is normalized by the new impostor score model. When the unsupervised score normalization, unsupervised model training and factor analysis are adopted in the NIST 2006 SRE core test, the EER of the system is 4.29%.
    Chinese Spoken Language Processing, 2008. ISCSLP '08. 6th International Symposium on; 01/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper appropriate confidence measures (CMs) are investigated for Mandarin command word recognition, both in the so-called target region and non-target region, respectively. Here the target region refers to the recognized speech part of command word while the non-target region refers to the recognized silence part. It shows that exploiting extra information in the non-target region can effectively complement the traditional CM which usually focus on the target region. Furthermore, when analyzing the non-target region in a more theoretical way, where Bayesian information criterion (BIC) is employed to locate more precise boundary in the non-target region, even more improvement is achieved. In two different Mandarin telephone command word tasks, more than 20% relative reduction of equal error rate (EER) is obtained.
    Chinese Spoken Language Processing, 2008. ISCSLP '08. 6th International Symposium on; 01/2009
  • Source
    Yi-Qian Pan, Si Wei, Ren-Hua Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: Tonal evaluation of Chinese continuous speech plays an important role in Mandarin Chinese pronunciation test. In this paper, we introduce the Multi- Space Distribution Hidden Markov Model based on prosodic word. The results show that the performance of tonal syllable error rate can be reduced. For the non-standard Chinese Mandarin speech, the correlation between computer score and expert score was improved above 3.0% absolutely, compared with the baseline system without tonal pronunciation test.
    Chinese Spoken Language Processing, 2008. ISCSLP '08. 6th International Symposium on; 01/2009
  • Source
    Si Wei, Yi-Qian Pan, Guo-Ping Hu, Yu Hu, Ren-Hua Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: Posterior probability is mostly used for pronunciation evaluation. This paper introduces pronunciation space models to calculate posterior probability replacing traditional phone-based acoustic models, which makes the calculated posterior probability more precise. Pronunciation space models are constructed using unsupervised clustering method guided by human scores and phone-level posterior probability. By using correlation between machine scores and human scores as the performance measurement, pronunciation space models based method shows its effectiveness for pronunciation evaluation in the experiments on a Chinese database spoken by Koreans with the correlation's improvement from 0.390 to 0.415 comparing to the traditional method based on phone based acoustic models.
    Chinese Spoken Language Processing, 2008. ISCSLP '08. 6th International Symposium on; 01/2009
  • Source
    Zhen-Hua Ling, Wei Zhang, Ren-Hua Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a method that the dependency between F0 and spectral features are modeled for the HMM-based parametric speech synthesis system. In conventional systems these two features are modeled as two independent streams, which is inconsistent with the fact that there always exists interaction between the extracted F0 and spectral parameters for model training. A piecewise linear transform is introduced in this paper to explicitly model the dependency of spectrum on F0. The results of our experiments show that the proposed method is able to improve the accuracy of spectral parameter prediction if the F0 features are predicted based on a reliable voicing decision.
    Chinese Spoken Language Processing, 2008. ISCSLP '08. 6th International Symposium on; 01/2009
  • Wu GUO, Yi-Jie LI, Li-Rong DAI, Ren-Hua WANG
    Acta Automatica Sinica. 01/2009; 35(9):1193-1198.
  • Yan Song, Li-Rong Dai, Ren-Hua Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: Gaussian mixture models (GMM) have become one of the standard acoustic approaches for language identification. Furthermore, the GMM-SVM is proven to work well by introducing the discriminative method into the GMM-based acoustic systems. In these systems, the intersession variability within language has become an important adverse factor that degrades the system performance. To tackle this problem, we propose a subspace analysis method, termed as Intra-language Difference Subspace Estimatio (IDSE), under the GMM-SVM framework. In IDSE method, the difference vector is modeled with three components: Extra-language difference, Intra-language difference and noise difference. Then the Intra-language and noise difference are effectively estimated and eliminated from the difference vector. The experiments on NIST 07 evaluation tasks show effectiveness of the proposed method.
    Proceedings of the 2009 IEEE International Conference on Multimedia and Expo, ICME 2009, June 28 - July 2, 2009, New York City, NY, USA; 01/2009
  • Si Wei, Guoping Hu, Yu Hu, Ren-Hua Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents two new ideas for text dependent mispronunciation detection. Firstly, mispronunciation detection is formulated as a classification problem to integrate various predictive features. A Support Vector Machine (SVM) is used as the classifier and the log-likelihood ratios between all the acoustic models and the model corresponding to the given text are employed as features for the classifier. Secondly, Pronunciation Space Models (PSMs) are proposed to enhance the discriminative capability of the acoustic models for pronunciation variations. In PSMs, each phone is modeled with several parallel acoustic models to represent pronunciation variations of that phone at different proficiency levels, and an unsupervised method is proposed for the construction of the PSMs. Experiments on a database consisting of more than 500,000 Mandarin syllables collected from 1335 Chinese speakers show that the proposed methods can significantly outperform the traditional posterior probability based method. The overall recall rates for the 13 most frequently mispronounced phones increase from 17.2%, 7.6% and 0% to 58.3%, 44.3% and 29.5% at three precision levels of 60%, 70% and 80%, respectively. The improvement is also demonstrated by a subjective experiment with 30 subjects, in which 53.3% of the subjects think the proposed method is better than the traditional one and 23.3% of them think that the two methods are comparable.
    Speech Communication. 01/2009;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper introduces the USTC's speech synthesis system for Blizzard Challenge 2009. USTC attended all English tasks including the hub tasks and the spoke tasks. According to the various conditions for different tasks, different versions of HMM based unit-selection systems are constructed based on the USTC Blizzard Challenge 2008 system. Many new techniques are employed in our speech synthesis system construction. Results of internal experiments comparing these techniques are shown, and analyzed. The evaluation results of Blizzard Challenge 2009 prove that our system has good quality in all the naturalness, similarity and intelligibility of the synthetic speech.
    01/2009;
  • [Show abstract] [Hide abstract]
    ABSTRACT: In the text-independent speaker verification research, the information of previous trials can be adopted to update the speaker models or the test scores dynamically. This process is defined as the unsupervised mode, which can make a coupling between the trials and the speaker models. The unsupervised mode is very useful for real speaker recognition application. In this paper, a score-based unsupervised adaptation is proposed as well as model-based unsupervised adaptation. In the score-based unsupervised adaptation mode, a bi-Gaussian model is introduced as a prior score distribution. Then the MAP (maximum a posteriori) method is adopted to adjust the parameters of the score normalization. In the test process, the unsupervised score adaptation and unsupervised model adaptation can both improve the performance. In the case of NIST SRE 2006 1 conv 4w-1 conv 4w corpus, the equal error rate (EER) of the proposed system is 4.3% and the minimum detection cost function (minDCF) is 0.021.
    ACTA AUTOMATICA SINICA 01/2009; 35(3):267-271.
  • Zhi-Jie Yan, Bo Zhu, Yu Hu, Ren-Hua Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a novel discriminative training criterion, minimum word classification error (MWCE). By localizing conventional string-level MCE loss function to word-level, a more direct measure of empirical word classification error is approximated and minimized. Because the word-level criterion better matches performance evaluation criteria such as WER, an improved word recognition performance can be achieved. We evaluated and compared MWCE criterion in a unified DT framework, with other commonly-used criteria including MCE, MMI, MWE, and MPE. Experiments on TIMIT and WS JO evaluation tasks suggest that word-level MWCE criterion can achieve consistently better results than string-level MCE. MWCE even outperforms other substring-level criteria on the above two tasks, including MWE and MPE.
    Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on; 05/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Due to the inconsistency between the maximum likelihood (ML) based training and the synthesis application in HMM-based speech synthesis, a minimum generation error (MGE) criterion had been proposed for HMM training. This paper continues to apply the MGE criterion to model adaptation for HMM-based speech synthesis. We propose a MGE linear regression (MGELR) based model adaptation algorithm, where the regression matrices used to transform source models to target models are optimized to minimize the generation errors for the input speech data uttered by the target speaker. The proposed MGELR approach was compared with the maximum likelihood linear regression (MLLR) based model adaptation. Experimental results indicate that the generation errors were reduced after the MGELR-based model adaptation. And from the subjective listening test, the discrimination and the quality of the synthesized speech using MGELR were better than the results using MLLR.
    Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on; 05/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Due to the inconsistency between the maximum likelihood (ML) based training and the synthesis application in HMM-based speech synthesis, a minimum generation error (MGE) criterion had been proposed for HMM training. This paper continues to apply the MGE criterion to model adaptation for HMM-based speech synthesis. We propose a MGE linear regression (MGELR) based model adaptation algorithm, where the regression matrices used to transform source models to target models are optimized to minimize the generation errors for the input speech data uttered by the target speaker. The proposed MGELR approach was compared with the maximum likelihood linear regression (MLLR) based model adaptation. Experimental results indicate that the generation errors were reduced after the MGELR-based model adaptation. And from the subjective listening test, the discrimination and the quality of the synthesized speech using MGELR were better than the results using MLLR.
    Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on; 05/2008
  • Zhen-Hua Ling, Ren-Hua Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a minimum unit selection error (MUSE) training method for HMM-based unit selection speech synthesis system, which selects the optimal phone-sized unit sequence from the speech database by maximizing the combined likelihood of a group of trained HMMs. Under MUSE criterion, the weights and distribution parameters of these HMMs are estimated to minimize the number of different units between the selected phone sequences and the natural phone sequences for the training sentences. The optimization is realized by discriminative training using generalized probabilistic descent (GPD) algorithm. Results of our experiment show that this proposed method is able to improve the performance of the baseline system where model weights are set manually and distribution parameters are trained under maximum likelihood criterion.
    Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on; 05/2008

Publication Stats

459 Citations
118.71 Total Impact Points

Institutions

  • 1999–2011
    • University of Science and Technology of China
      • Department of Electronic Engineering and Information Science
      Luchow, Anhui Sheng, China
  • 2008
    • Carnegie Mellon University
      • Language Technologies Institute
      Pittsburgh, PA, United States
    • Georgia Institute of Technology
      • School of Electrical & Computer Engineering
      Atlanta, Georgia, United States
  • 1996–2008
    • Hefei University of Technology
      Luchow, Anhui Sheng, China