Chinese Prosody Structure Prediction Based on Conditional Random Fields.
DOI: 10.1109/ICNC.2009.44 Conference: Fifth International Conference on Natural Computation, ICNC 2009, Tianjian, China, 14-16 August 2009, 6 Volumes
In this paper, a novel statistical method based on Conditional Random Fields (CRF) is proposed for hierarchical prosody structure prediction, which is a key module in speech synthesis systems. We will discuss how to build the prosody models for mandarin Chinese using Conditional Random Fields in detail, including corpus preparation, feature selection, feature template design, model training and evaluation. Comparison is conducted between the new method and the classical decision tree based one. The experimental results show that CRF-based method can significantly improve the overall performance with the same feature set.
Available from: Lei Xie
- "Specifically, perception of prosodic boundaries is essential for listeners. In Chinese speech synthesis systems, typical prosody boundary labels consist of prosodic word (PW), prosodic phrase (PPH) and intonational phrase (IPH), which construct a three-layer prosody structure tree , as shown in Fig. 1. The leaf nodes of tree structure are lexical words that can be derived from a lexical-based word segmentation module. "
[Show abstract] [Hide abstract]
ABSTRACT: Prosody affects the naturalness and intelligibility of speech.
However, automatic prosody prediction from text for Chinese
speech synthesis is still a great challenge and the traditional
conditional random fields (CRF) based method always heavily
relies on feature engineering. In this paper, we propose to
use neural networks to predict prosodic boundary labels directly
from Chinese characters without any feature engineering.
Experimental results show that stacking feed-forward
and bidirectional long short-term memory (BLSTM) recurrent
network layers achieves superior performance over the
CRF-based method. The embedding features learned from
raw text further enhance the performance.
2015 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2015), Scottsdale, Arizona, USA; 12/2015
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.