Fig 4 - uploaded by Rui Liu
Content may be subject to copyright.
A sample of Mongolian Text Analysis process. 

A sample of Mongolian Text Analysis process. 

Source publication
Chapter
Full-text available
Recently, Deep Neural Network (DNN), which is a feed-forward artificial neural network with many hidden layers, has opened a new research direction for Speech Synthesis. It can represent high dimension and correlated features efficiently and model highly complex mapping function compactly. However, the research on DNN-based Mongolian speech synthes...

Context in source publication

Context 1
... Mongolian sentence " " (means: I am a college student) for example, Figure.4 shows the above process. ...

Similar publications

Conference Paper
Full-text available
Singing voice conversion is a task to convert a song sang by a source singer to the voice of a target singer. In this paper , we propose using a parallel data free, many-to-one voice conversion technique on singing voices. A phonetic posterior feature is first generated by decoding singing voices through a robust Automatic Speech Recognition Engine...

Citations

... Rui Liu et al. introduced a new method to segment Mongolian words into stems and suffixes, which greatly improved the performance of the Mongolian rhyming phrase prediction system [30]. Immediately after that, Rui Liu proposed a DNN-based Mongolian speech synthesis system, which performs better than the traditional HMM [31]. In addition, he introduced the Bidirectional Long-Term Memory (BilstM) model to improve the phrase break prediction step in the traditional speech synthesis system, making it more applicable to Mongolian [32]. ...
Article
Full-text available
Low-resource text-to-speech synthesis is a very promising research direction. Mongolian is the official language of the Inner Mongolia Autonomous Region and is spoken by more than 10 million people worldwide. Mongolian, as a representative low-resource language, has a relative lack of open-source datasets for its TTS. Therefore, we make public an open-source multi-speaker Mongolian TTS dataset, named MnTTS2, for related researchers. In this work, we invited three Mongolian announcers to record topic-rich speeches. Each announcer recorded 10 h of Mongolian speech, and the whole dataset was 30 h in total. In addition, we built two baseline systems based on state-of-the-art neural architectures, including a multi-speaker Fastspeech 2 model with HiFi-GAN vocoder and a full end-to-end VITS model for multi-speakers. On the system of FastSpeech2+HiFi-GAN, the three speakers scored 4.0 or higher on both naturalness evaluation and speaker similarity. In addition, the three speakers achieved scores of 4.5 or higher on the VITS model for naturalness evaluation and speaker similarity scores. The experimental results show that the published MnTTS2 dataset can be used to build robust Mongolian multi-speaker TTS models.
... Worldwide, there are about 6 million users. At the same time, Mongolian is also the main national language of China's Inner Mongolia Autonomous Region [1] . Therefore, the study of Mongolian-oriented speech synthesis technology is of great significance to the fields of education, transportation, and communication in minority areas. ...
Preprint
Recurrent Neural Networks (RNNs) have become the standard modeling technique for sequence data, and are used in a number of novel text-to-speech models. However, training a TTS model including RNN components has certain requirements for GPU performance and takes a long time. In contrast, studies have shown that CNN-based sequence synthesis technology can greatly reduce training time in text-to-speech models while ensuring a certain performance due to its high parallelism. We propose a new text-to-speech system based on deep convolutional neural networks that does not employ any RNN components (recurrent units). At the same time, we improve the generality and robustness of our model through a series of data augmentation methods such as Time Warping, Frequency Mask, and Time Mask. The final experimental results show that the TTS model using only the CNN component can reduce the training time compared to the classic TTS models such as Tacotron while ensuring the quality of the synthesized speech.
... The neural TTS solutions [1], [2] are the examples. Unlike the conventional TTS pipeline [3]- [6], the neural solutions employ a network to learn the mapping directly from the <text, wav> pair [7]. For instance, Tacotron [1], Tacotron2 [2] and their variants [8], [9] are based on an encoder-decoder architecture with an attention mechanism [10]. ...
... In a preliminary study, we evaluate the effect of the linear decay [26] based scheduled sampling strategy in model training. We find that the quality and clarity of speech deteriorate significantly when we linearly decay the sampling probability from 1 to 0. 6 When the sampling probability is reduced to 0, the training is reduced to a free-running mode. While the free-running mode matches the inference process, it doesn't guarantee that the model learns to predict the real data well. ...
... TensorSpeech/TensorFlowTTS/blob/master/preprocess/ljspeech_preprocess. yaml 6 Note that this finding was based on the linear decay strategy and 150 K training steps. In our preliminary experiments, using some better decay strategies (such as exponential decay in the original paper of scheduled sampling) and training more steps (such as 600 k steps), we could further improve the decoding performance after the sampling probability drops to 0. However, It is not the focus of this paper to study how to improve the training strategy of SS during training stage. ...
Article
Full-text available
Neural end-to-end text-to-speech (TTS) is superior to conventional statistical methods in many ways. However, the exposure bias problem, that arises from the mismatch between the training and inference process in autoregressive models, remains an issue. It often leads to performance degradation in face of out-of-domain test data. To address this problem, we study a novel decoding knowledge transfer strategy, and propose a multi-teacher knowledge distillation (MT-KD) network for Tacotron2 TTS model. The idea is to pre-train two Tacotron2 TTS teacher models in teacher forcing and scheduled sampling modes, and transfer the pre-trained knowledge to a student model that performs free running decoding. We show that the MT-KD network provides an adequate platform for neural TTS training, where the student model learns to emulate the behaviors of the two teachers, at the same time, minimizing the mismatch between training and run-time inference. Experiments on both Chinese and English data show that MT-KD system consistently outperforms the competitive baselines in terms of naturalness, robustness and expressiveness for in-domain and out-of-domain test data. Furthermore, we show that knowledge distillation outperforms adversarial learning and data augmentation in addressing the exposure bias problem.
... W ITH the advent of deep learning, neural TTS has shown many advantages over the conventional TTS techniques [1]- [3]. For example, encoder-decoder architecture with attention mechanism, such as Tacotron [4]- [7], has consistently achieved high voice quality. ...
... (1) What did he say to that? (2) Where would be the use? (3) Where is it? (4) The soldiers then? ...
Article
Full-text available
We propose a novel training strategy for Tacotron-based text-to-speech (TTS) system that improves the speech styling at utterance level. One of the key challenges in prosody modeling is the lack of reference that makes explicit modeling difficult. The proposed technique doesn't require prosody annotations from training data. It doesn't attempt to model prosody explicitly either, but rather encodes the association between input text and its prosody styles using a Tacotron-based TTS framework. This study marks a departure from the style token paradigm where prosody is explicitly modeled by a bank of prosody embeddings. It adopts a combination of two objective functions: 1) frame level reconstruction loss, that is calculated between the synthesized and target spectral features; 2) utterance level style reconstruction loss, that is calculated between the deep style features of synthesized and target speech. The style reconstruction loss is formulated as a perceptual loss to ensure that utterance level speech style is taken into consideration during training. Experiments show that the proposed training strategy achieves remarkable performance and outperforms the state-of-the-art baseline in both naturalness and expressiveness. To our best knowledge, this is the first study to incorporate utterance level perceptual quality as a loss function into Tacotron training for improved expressiveness.-------------------------------[Speech Samples: https://ttslr.github.io/Expressive-TTS-Training-with-Frame-and-Style-Reconstruction-Loss/]
... In the recent past, concatenative (Hunt & Black, 1996;Merritt, Clark, Wu, Yamagishi, & King, 2016) and statistical parametric speech synthesis (Liu, Bao, Gao, & Wang, 2017;Tokuda et al., 2013;Zen & Sak, 2015;Zen, Senior, & Schuster, 2013) systems were the mainstream techniques. We note * Corresponding author at: National University of Singapore, Singapore. ...
Article
Non-autoregressive architecture for neural text-to-speech (TTS) allows for parallel implementation, thus reduces inference time over its autoregressive counterpart. However, such system architecture doesn’t explicitly model temporal dependency of acoustic signal as it generates individual acoustic frames independently. The lack of temporal modeling often adversely impacts speech continuity, thus voice quality. In this paper, we propose a novel neural TTS model that is denoted as FastTalker. We study two strategies for high-quality speech synthesis at low computational cost. First, we add a shallow autoregressive acoustic decoder on top of the non-autoregressive context decoder to retrieve the temporal information of the acoustic signal. Second, we further implement group autoregression to accelerate the inference of the autoregressive acoustic decoder. The group-based autoregression acoustic decoder generates acoustic features as a sequence of groups instead of frames, each group having multiple consecutive frames. Within a group, the acoustic features are generated in parallel. With the shallow and group autoregression, FastTalker retrieves the temporal information of the acoustic signal, while keeping the fast-decoding property. The proposed FastTalker achieves a good balance between speech quality and inference speed. Experiments show that, in terms of voice quality and naturalness, FastTalker outperforms the non-autoregressive FastSpeech baseline significantly, and is on par with the autoregressive baselines. It also shows a considerable inference speedup over Tacotron2 and Transformer TTS. [Speech samples: https://ttslr.github.io/FastTalker/] ------------------------------------------------------------------------------------- Our deepest gratitude goes to the AE and anonymous reviewers for the careful work !
... We now evaluate the effectiveness of the proposed enhanced word embedding techniques in phrase break prediction, and their contributions to a DNN-based Mongolian TTS system [76]. ...
... Mongolian speech data: We conducted experiments on a well-formulated Mongolian TTS database, that has rich phonetic and prosodic content [62], [76]. The database contains Mongolian daily expressions recorded by a female native speaker. ...
... We conducted listening tests to further evaluate the contributions of phrase break prediction to speech synthesis quality. We compare four DNN-based Mongolian TTS systems [76], which differ only in terms of prosodic break inputs. ...
Article
Full-text available
Prosodic phrasing is an important factor that affects naturalness and intelligibility in text-to-speech synthesis. Studies show that deep learning techniques improve prosodic phrasing when large text and speech corpus are available. However, for low-resource languages, such as Mongolian, prosodic phrasing remains a challenge for various reasons. First, the database suitable for system training is limited; Second, word composition knowledge that is prosody-informing has not been used in prosodic phrase modeling. To address these problems, in this paper, we propose a feature augmentation method in conjunction with a self-attention neural classifier. We augment input text with morphological and phonological decompositions of words to enhance the text encoder. We study the use of self-attention classifier, that makes use of global context of a sentence, as a decoder for phrase break prediction. Both objective and subjective evaluations validate the effectiveness of the proposed phrase break prediction framework, that consistently improves voice quality in a Mongolian text-to-speech synthesis system.
... With the advent of deep learning, end-to-end TTS has shown many advantages over the conventional TTS techniques [1][2][3]. For example, Tacotron-based approaches [4][5][6][7] with an encoder-decoder architecture and attention mechanism have shown to achieve remarkable performance. ...
Conference Paper
Full-text available
While neural end-to-end text-to-speech (TTS) is superior to conventional statistical methods in many ways, the exposure bias problem in the autoregressive models remains an issue to be resolved. The exposure bias problem arises from the mismatch between the training and inference process, that results in unpredictable performance for out-of-domain test data at run-time. To overcome this, we propose a teacher-student training scheme for Tacotron-based TTS by introducing a distillation loss function in addition to the feature loss function. We first train a Tacotron2-based TTS model by always providing natural speech frames to the decoder, that serves as a teacher model. We then train another Tacotron2-based model as a student model, of which the decoder takes the predicted speech frames as input, similar to how the decoder works during run-time inference. With the distillation loss, the student model learns the output probabilities from the teacher model, that is called knowledge distillation. Experiments show that our proposed training scheme consistently improves the voice quality for out-of-domain test data both in Chinese and English systems.
... In this evaluation, a set of 20 sentences were randomly selected from test set and the synthesised speech was generated through the DNN-based Mongolian TTS system [4] based on proposed front-end and the original front-end [4]. 20 subjects were asked to choose which one was better of paired synthesis speech. ...
... In this evaluation, a set of 20 sentences were randomly selected from test set and the synthesised speech was generated through the DNN-based Mongolian TTS system [4] based on proposed front-end and the original front-end [4]. 20 subjects were asked to choose which one was better of paired synthesis speech. ...
... preference Subjective evaluation results of the DNN-based Mongolian TTS by using proposed front-end and original front-end in[4]. ...
Chapter
Full-text available
In the context of text-to-speech systems (TTS), a front-end is a critical step for extracting linguistic features from given input text. In this paper, we propose a Mongolian TTS front-end which joint training Grapheme-to-Phoneme conversion (G2P) and phrase break prediction (PB). We use a bidirectional long short-term memory (LSTM) network as the encoder side, and build two decoders for G2P and PB that share the same encoder. Meanwhile, we put the source input features and encoder hidden states together into the Decoder, aim to shorten the distance between the source and target sequence and learn the alignment information better. More importantly, to obtain a robust representation for Mongolian words, which are agglutinative in nature and lacks sufficient training corpus, we design specific multi-view input features for it. Our subjective and objective experiments have demonstrated the effectiveness of this proposal.
... Furthermore, morphological vector (mi) and acoustic vector (ti, si, ei) both play a very good role in the performance for the two tasks. Specifically, in G2P, the introduction of acoustic Figure 2: MOS of the DNN-based Mongolian TTS by using proposed front-end and [27]. ...
... To evaluate the naturalness of the synthesized Mongolian speech of the proposed front-end component under DNN-based Mongolian TTS [27], a subjective listening test was conducted. The naturalness of the synthesized speech was assessed by the mean opinion score (MOS) test method. ...
Conference Paper
In the context of text-to-speech systems (TTS), a front-end is a critical step for extracting linguistic features from given input text. In this paper, we propose a Mongolian TTS front-end with encoder-decoder neural network model by using bridge method and multi-view features, which joint training Grapheme-to-Phoneme conversion (G2P) and phrase break prediction (PB). This is the first try to model the G2P and PB in a unified framework. For the encoder-decoder model, we use a bidirectional long short-term memory (LSTM) network as the encoder side, and build two decoders for G2P and PB that share the same encoder. For bridge method in encoder side, we put the source input features and encoder hidden states together into the Decoder, aim to shorten the distance between the source and target sequence and learn the alignment information better. To obtain a robust representation for Mongolian words, which are agglutinative in nature and lacks sufficient training corpus, we design specific multi-view input features for it, in which we learn vector representations of morphological (stem&suffix) and acoustic sequences corresponding to single Mongolian words. Our subjective and objective experiments have demonstrated the effectiveness of this proposal.
... Zhao applied the HMM in the Mongolian speech synthesis in 2014 [17]. Then Liu introduced DNN to Mongolian TTS in 2017 [18]. But limited by available annotated data, both of these two methods cannot give a natural-sounding and even clear synthesized speech, which is not satisfied the requirements of application. ...
... Because of the annotation cost, as far as we know, there are no large scale speech synthesis corpus in Mongolian language. The largest Mongolian speech synthesis corpus is reported in [17] and [18]. With a private communication with their common corresponding author, we got known that the current largest speech synthesis corpus is contains about 5 hours Mongolian recording data, and has been used to train their DNN-based and HMM-based TTS system. ...
... We compared the proposed end-to-end Mongolian speech synthesis system with HMM-based [17] and DNN-based [18] Mongolian TTS system. The detailed model configuration is described as follows. ...
Conference Paper
Full-text available
Speech synthesis, or text-to-speech (TTS), generates a speech waveform of the given text. To build a satisfactory TTS system, a large natural speech corpus is requested. In the traditional approach, the corpus should be accompanied with precise annotations. However, the annotation is difficult and costly. Recently, end-to-end speech synthesis methods are proposed, which eliminated the requirement of annotation. The end-to-end methods make the development of TTS system less costly and easier. We used the state-of-the-art end-to-end Tacotron model in the Mongolian TTS task. With much more unannotated speech data (about 17 hours), the new system beats the old best Mongolian TTS system, which is trained on a small amount of annotated data (about 5 hours), with a big margin. The new mean opinion score (MOS) is 3.65 vs 2.08 which is the old one. The proposed system becomes the first Mongolian TTS system can be utilized in real applications.