Conference Paper

Hidden Semi-Markov Model Based Speech Synthesis

Conference: INTERSPEECH 2004 - ICSLP, 8th International Conference on Spoken Language Processing, Jeju Island, Korea, October 4-8, 2004
Source: DBLP


In the present paper, a hidden-semi Markov model (HSMM) based speech synthesis system is proposed. In a hidden Markov model (HMM) based speech synthesis system which we have proposed, rhythm and tempo are controlled by state duration probability distributions modeled by single Gaussian distributions. To synthesis speech, it constructs a sentence HMM corresponding to an arbitralily given text and determine state durations maximizing their probabilities, then a speech parameter vector sequence is generated for the given state sequence. However, there is an inconsistency: although the speech is synthesized from HMMs with explicit state duration probability distributions, HMMs are trained without them. In the present paper, we introduce an HSMM, which is an HMM with explicit state duration probability distributions, into the HMM-based speech synthesis system. Experimental results show that the use of HSMM training improves the naturalness of the synthesized speech.

16 Reads
  • Source
    • "For each of the three audio features, the models are clustered separately state-wise by means of decision-tree based context clustering using linguistically motivated questions on the phonetic, segmental , syllable, word and utterance levels. State durations are modeled explicitly rather than via state transition probabilities (HSMMs rather than HMMs [31]), and duration models are also clustered using a single decision-tree across all five states. The feature questions used for the clustering are based on the English question set in the EMIME system [26] with adaptations towards our German phone set. "
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper investigates joint speaker-dependent audiovisual Hidden Semi-Markov Models (HSMM) where the visual models produce a sequence of 3D motion tracking data that is used to animate a talking head and the acoustic models are used for speech synthesis. Different acoustic, visual, and joint audiovisual models for four different Austrian German speakers were trained and we show that the joint models perform better compared to other approaches in terms of synchronization quality of the synthesized visual speech. In addition, a detailed analysis of the acoustic and visual alignment is provided for the different models. Importantly, the joint audiovisual modeling does not decrease the acoustic synthetic speech quality compared to acoustic-only modeling so that there is a clear advantage in the common duration model of the joint audiovisual modeling approach that is used for synchronizing acoustic and visual parameter sequences. Finally, it provides a model that integrates the visual and acoustic speech dynamics.
    IEEE Journal of Selected Topics in Signal Processing 04/2014; 8(2):336-347. DOI:10.1109/JSTSP.2013.2281036 · 2.37 Impact Factor
  • Source
    • "HSMMs are characterized by their ability to incorporate the explicit modelling of state durations not only in the synthesis phase as HMMs do, but also in the training phase of the HSMM-based speech synthesis systems improving the naturalness of synthetic speech [20] . In the following subsections four approaches for HSMM modelling of emotional speech are described. "
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes and evaluates four different HSMM (hidden semi-Markov model) training methods for HMM-based synthesis of emotional speech. The first method, called emotion-dependent modelling, uses individual models trained for each emotion separately. In the second method, emotion adaptation modelling, at first a model is trained using neutral speech, and thereafter adaptation is performed to each emotion of the database. The third method, emotion-independent approach, is based on an average emotion model which is initially trained using data from all the emotions of the speech database. Consequently, an adaptive model is build for each emotion. In the fourth method, emotion adaptive training, the average emotion model is trained with simultaneously normalization of the output and state duration distributions. To evaluate these training methods, a Modern Greek speech database which consists of four categories of speech, anger, fear, joy and sadness, was used. Finally, an emotion recognition rate subjective test was performed in order to measure and compare the ability of each of the four approaches in synthesizing emotional speech. The evaluation results showed that the emotion adaptive training achieved the highest emotion recognition rates among four evaluated methods, throughout all four emotions of the database.
    03/2013; 05(04):23-29. DOI:10.5815/ijitcs.2013.04.03
  • Source
    • "Fundamental frequency (F0) models were trained using HTS [3]. The F0 contour is a mixture of values in the voiced and unvoiced region of the speech signal. "
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper gives an overview of the UCD Blizzard Challenge 2011 entry. The entry is a unit selection synthesiser that uses hidden Markov models for prosodic modelling. The evaluation consisted of synthesising 2213 sentences from a high quality 15 hour dataset provided by Lessac Technologies. Results are analysed within the context of other systems and the future work for the system is discussed.
Show more