Conference Paper

Hidden semi-Markov model based speech synthesis.

Conference: INTERSPEECH 2004 - ICSLP, 8th International Conference on Spoken Language Processing, Jeju Island, Korea, October 4-8, 2004
Source: DBLP
0 Bookmarks
 · 
94 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper applies a dynamic sinusoidal synthesis model to sta-tistical parametric speech synthesis (HTS). For this, we utilise regularised cepstral coefficients to represent both the static am-plitude and dynamic slope of selected sinusoids for statistical modelling. During synthesis, a dynamic sinusoidal model is used to reconstruct speech. A preference test is conducted to compare the selection of different sinusoids for cepstral rep-resentation. Our results show that when integrated with HTS, a relatively small number of sinusoids selected according to a perceptual criterion can produce quality comparable to using all harmonics. A Mean Opinion Score (MOS) test shows that our proposed statistical system is preferred to one using mel-cepstra from pitch synchronous spectral analysis.
    INTERSPEECH 2014; 09/2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The generation of realistic observation sequences from HMM state distributions has been successfully applied to the problem of speech synthesis. This approach fea-tures unprecedented qualities, from automated synthetic voice construction, voice conversion and language inde-pendence, to speaking style variability and emotional ex-pression. However, output quality from these synthesis systems do not yet meet the standards set by state-of-the-art unit selection synthesis. This paper aims to provide insight into factors causing degradation of speech qual-ity. An alternative voice coding scheme based on the si-nusoidal model is investigated and a modified voice con-struction procedure outlined.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Research on speech synthesis area has made great progress recently, perhaps motivated by its numerous appli-cations, of which text-to-speech converters and dialog sys-tems are examples. Several improvements have been reported in the technical literature related to existing state-of-the-art techniques as well as in the development of new ideas re-lated to the alteration of voice characteristics, with their even-tual application to different languages. Nevertheless, in spite of the attention that the speech synthesis field has been re-ceiving, the technique which employs unit selection and con-catenation of waveform segments still remains as the most popular approach among those available nowadays. In this paper, we report how a synthesizer for the Brazilian Por-tuguese language was constructed according to a technique in which the speech waveform is generated through parame-ters directly determined from Hidden Markov Models. When compared with systems based on unit selection and concate-nation, the proposed synthesizer presents the advantage of being trainable, with the utilization of contextual factors in-cluding information related to different levels of the follow-ing acoustic units: phones, syllables, words, phrases and ut-terances. Such information is brought into effect through a set of questions for context-clustering. Thus, both the spectral and the prosodic characteristics of the system are managed by decision-trees generated for each one of the following pa-rameters: mel-cepstral coefficients, fundamental frequency and state durations. As a typical characteristic of the tech-nique based on Hidden Markov Models, synthesized speech with quality comparable to commercial applications built un-der the unit selection and concatenation approach can be ob-tained even from a database as small as eighteen minutes of speech. This was tested by a subjective comparison of sam-ples from the synthesizer in question and other systems cur-rently available for Brazilian Portuguese. Resumo -A pesquisa ná area de síntese de voz tem alcançado grande progresso recentemente, provavelmente motivada por suas inúmeras aplica oes, dentre as quais se pode citar conversores texto-voz e sistemas de diálogo. Mui-tas melhorias nas técnicas de estado-da-arte existentes, as-sim como o desenvolvimento de novas idéias relacionadas a R. Maia is with National Institute of Information and Communica-tions Technology (NiCT), and ATR Spoken Language Communica-tion Laboratories (ATR-SLC), Kyoto, Japan. H. Zen, K. Tokuda and T. Kitamura are with altera oes das características da voz sintetizada, seguidas por suas respectivas aplica oes a diferentes idiomas, são descri-tos na literatura técnica. No entanto, apesar da aten ao que á area de síntese de voz tem recebido, a técnica que consiste na sele ao e concatena ao de unidades de forma de onda ainda permanece como a mais empregada atualmente. Neste artigo descreve-se a constru ao de um sintetizador para o português brasileiro, baseado em uma técnica na qual o sinal de vo e gerado por parâmetros diretamente obtidos a partir de Mo-delos Escondidos de Markov. Quando comparado a sistemas que utilizam o método de sele ao e concatena ao de formas de onda, o sintetizador em questão apresenta a vantagem de ser treinável, com o uso de fatores contextuais que incluem informa oes referentes aos diferentes níveis das seguintes unidades acústicas: fone, sílaba, palavra, frase e período. Tais informa oes são efetivadas através de um conjunto de perguntas usadas para uma técnica de agrupamento de con-textos. Portanto, as características espectrais e prosódicas do sistema são controladas po arvores-de-decisões corres-pondentes a cada um dos seguintes parâmetros: coeficien-tes mel-cepestrais, freqüência fundamental e dura ao de es-tados. Como uma propriedade típica do método de síntese de voz baseado em Modelos Escondidos de Markov, pode-se obter voz sintetizada com qualidade comparáveì a de al-gumas aplica oes comerciais, construídas de acordo com a técnica de sele ao e concatena ao de unidades, mesmo para uma base de dados tão pequena quanto dezoito minutos de voz. Isto foi testado através de uma avalia ao subjetiva de amostras geradas pelo sintetizador em questão e por outros sistemas disponíveis para o português brasileiro. Palavras-chave: Processamento de Voz, Sistemas de Con-versão Texto-Voz (TTS), Síntese de Voz, Modelos Escondi-dos de Markov (HMM).