Figure - uploaded by Jianzong Wang
Content may be subject to copyright.
Contexts in source publication
Context 1
... evaluation results are shown in Table 1, where the synthetic speech of English and Mandarin are rated separately. It can be seen that our model gets the same performance as AlignTTS, and outperforms Tacotron 2 by 0.09 in synthesizing English. ...Similar publications
Citations
... There have been several approaches proposed for style transfer, and they can be classified into: 1) Coarse-Grained Style Transfer (CST) techniques [5] [6] and 2) Fine-Grained Style Transfer (FST) techniques [7] [8] [9] [10] [11]. While CST techniques focus on capturing sentence-level style features like emotion, which can be transferred across sentences of different text [12], FST techniques focus on capturing style features like rhythm, emphasis, melody, and loudness, which can not necessarily be transferred between sentences of different text [13]. Both CST and FST techniques get latent representations from either a Melspectrogram or hand-crafted features. ...
... We also add to the model an acoustic feature predictor (AFP) to predict per-phone acoustic feature values, given the encoder outputs [12,13]. This enables the model to produce natural synthesised speech from text alone, without requiring any additional inputs, whilst offering the option of control when desired. ...
... 18 There are some other reasons that can cause robust issues, such as the test domain is not well covered by the training domain. Research works that scale to unseen domain can alleviate this issue, such as increasing the amount and diversity of the training data [127], adopting relative position encoding to support long sequence unseen in training [17,423], etc. [376,297,264,199,38] Positional attention [262,228,200] Replacing Attention with Duration Prediction ...
... Language/Speaker Level Multi-lingual/speaker TTS [438,241,38] Paragraph Level Long-form reading [11,389,370] Utterance Level Timbre/Prosody/Noise [303,377,139,315,202,39] Word/Syllable Level Fine-grained information [319,113,44,329] Character/Phoneme Level [318,423,319,44,39,185] Frame Level [184,155,48,427] Information Type We can categorize the works according to the types of information being modeled: 1) explicit information, where we can explicitly get the labels of these variation information, and 2) implicit information, where we can only implicitly obtain these variation information. ...
... 4) Word/syllable level [319,113,44,329], which can model the fine-grained style/prosody information that cannot be covered by utterance level information. 5) Character/phoneme level [318,423,319,44,39,185], such as duration, pitch or prosody information. 6) Frame level [184,155,48,427], the most fine-grained information. ...
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad applications in the industry. As the development of deep learning and artificial intelligence, neural network-based TTS has significantly improved the quality of synthesized speech in recent years. In this paper, we conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current research and future trends. We focus on the key components in neural TTS, including text analysis, acoustic models and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc. We further summarize resources related to TTS (e.g., datasets, opensource implementations) and discuss future research directions. This survey can serve both academic researchers and industry practitioners working on TTS.
... We also add to the model an acoustic feature predictor (AFP) to predict per-phone acoustic feature values, given the encoder outputs [12,13]. This enables the model to produce natural synthesised speech from text alone, without requiring any additional inputs, whilst offering the option of control when desired. ...
Text does not fully specify the spoken form, so text-to-speech models must be able to learn from speech data that vary in ways not explained by the corresponding text. One way to reduce the amount of unexplained variation in training data is to provide acoustic information as an additional learning signal. When generating speech, modifying this acoustic information enables multiple distinct renditions of a text to be produced. Since much of the unexplained variation is in the prosody, we propose a model that generates speech explicitly conditioned on the three primary acoustic correlates of prosody: , energy and duration. The model is flexible about how the values of these features are specified: they can be externally provided, or predicted from text, or predicted then subsequently modified. Compared to a model that employs a variational auto-encoder to learn unsupervised latent features, our model provides more interpretable, temporally-precise, and disentangled control. When automatically predicting the acoustic features from text, it generates speech that is more natural than that from a Tacotron 2 model with reference encoder. Subsequent human-in-the-loop modification of the predicted acoustic features can significantly further increase naturalness.
... provide the phoneme-level acoustic conditions 4 . In order to extract phoneme-level information from speech, we first average the speech frames corresponding to the same phoneme according to alignment between phoneme and mel-spectrogram sequence (shown in Figure 2a), to convert to length of speech frame sequence into the length of phoneme sequence, similar to Sun et al. (2020); Zeng et al. (2020). During inference, we use another phoneme-level acoustic predictor (shown in Figure 2d) which is built upon the original phoneme encoder to predict the phoneme-level vectors. ...
... Using speech encoders to extract a single vector or a sequence of vectors to represent the characteristics of a speech sequence has been adopted in previous works Jia et al., 2018;Cooper et al., 2020;Zeng et al., 2020). They usually leverage them to improve the speaker timbre or prosody of the TTS model, or improve the controllability of the model. ...
... The phoneme-level acoustic encoder ( Figure 2c) and predictor (Figure 2d) share the same structure, which consists of 2 convolutional layers with filter size and kernel size of 256 and 3 respectively, and a linear layer to compress the hidden to a dimension of 4 (we choose the dimension of 4 according to our preliminary study and is also consistent with previous works Zeng et al., 2020)). We use MFA (McAuliffe et al., 2017) to extract the alignment between the phoneme and mel-spectrogram sequence, which is used to prepare the input of the phoneme-level acoustic encoder. ...
Custom voice, a specific text to speech (TTS) service in commercial speech platforms, aims to adapt a source TTS model to synthesize personal voice for a target speaker using few speech data. Custom voice presents two unique challenges for TTS adaptation: 1) to support diverse customers, the adaptation model needs to handle diverse acoustic conditions that could be very different from source speech data, and 2) to support a large number of customers, the adaptation parameters need to be small enough for each target speaker to reduce memory usage while maintaining high voice quality. In this work, we propose AdaSpeech, an adaptive TTS system for high-quality and efficient customization of new voices. We design several techniques in AdaSpeech to address the two challenges in custom voice: 1) To handle different acoustic conditions, we use two acoustic encoders to extract an utterance-level vector and a sequence of phoneme-level vectors from the target speech during training; in inference, we extract the utterance-level vector from a reference speech and use an acoustic predictor to predict the phoneme-level vectors. 2) To better trade off the adaptation parameters and voice quality, we introduce conditional layer normalization in the mel-spectrogram decoder of AdaSpeech, and fine-tune this part in addition to speaker embedding for adaptation. We pre-train the source TTS model on LibriTTS datasets and fine-tune it on VCTK and LJSpeech datasets (with different acoustic conditions from LibriTTS) with few adaptation data, e.g., 20 sentences, about 1 minute speech. Experiment results show that AdaSpeech achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker, which demonstrates its effectiveness for custom voice. Audio samples are available at https://speechresearch.github.io/adaspeech/.
... Besides the conversion of timbre, the conversions can also be conducted in various domains such as prosody, pitch, rhythm or other non-linguistic domains. Representation learning methods for these speech factors have already been proposed and applied in many research fields in speech processing [2,3,4,5,6,7] However, directly applying the speech representations extracted by these methods in VC may cause unexpected conversions of other speech factors as they may be not necessarily orthogonal. Therefore, disentangling the representations of intermingling various informative factors in speech signal is crucial to achieve highly controllable VC [8]. ...
Factorizing speech as disentangled speech representations is vital to achieve highly controllable style transfer in voice conversion (VC). Conventional speech representation learning methods in VC only factorize speech as speaker and content, lacking controllability on other prosody-related factors. State-of-the-art speech representation learning methods for more speech factors are using primary disentangle algorithms such as random resampling and ad-hoc bottleneck layer size adjustment, which however is hard to ensure robust speech representation disentanglement. To increase the robustness of highly controllable style transfer on multiple factors in VC, we propose a disentangled speech representation learning framework based on adversarial learning. Four speech representations characterizing content, timbre, rhythm and pitch are extracted, and further disentangled by an adversarial network inspired by BERT. The adversarial network is used to minimize the correlations between the speech representations, by randomly masking and predicting one of the representations from the others. A word prediction network is also adopted to learn a more informative content representation. Experimental results show that the proposed speech representation learning framework significantly improves the robustness of VC on multiple factors by increasing conversion rate from 48.2% to 57.1% and ABX preference exceeding by 31.2% compared with state-of-the-art method.
... They reached improved BLEU scores in machine translation, but their approach (and refinements by Huang et al. (2019)) are hard to parallelize, which is unattractive in a world driven by parallel computing. Zeng et al. (2020) used relative attention in speech synthesis, letting each query interact with separate matrix transformations for each key vector, depending on their relative-distance offset. Raffel et al. ...
... Since small changes in voice quality or latency can have a large impact on customer experience, it is necessary for speech product design to efficiently synthesize high quality speech, especially for smart speakers and dialogue robots. Currently, the general speech synthesis pipeline is composed of two components [1,2,3,4,5,6,7,8,9]: (1) a speech feature prediction model that transforms the text into time-aligned speech features, for example mel-spectrum, and (2) a neural vocoder that generates raw waveform from speech features. Our work focuses on the second model to efficiently generate speech with high quality from the mel-spectrum. ...
Recent neural vocoders usually use a WaveNet-like network to capture the long-term dependencies of the waveform, but a large number of parameters are required to obtain good modeling capabilities. In this paper, an efficient network, named location-variable convolution, is proposed to model the dependencies of waveforms. Different from the use of unified convolution kernels in WaveNet to capture the dependencies of arbitrary waveforms, location-variable convolutions utilizes a kernel predictor to generate multiple sets of convolution kernels based on the mel-spectrum, where each set of convolution kernels is used to perform convolution operations on the associated waveform intervals. Combining WaveGlow and location-variable convolutions, an efficient vocoder, named MelGlow, is designed. Experiments on the LJSpeech dataset show that MelGlow achieves better performance than WaveGlow at small model sizes, which verifies the effectiveness and potential optimization space of location-variable convolutions.
Speech synthesis has made significant strides thanks to the transition from machine learning to deep learning models. Contemporary text-to-speech (TTS) models possess the capability to generate speech of exceptionally high quality, closely mimicking human speech. Nevertheless, given the wide array of applications now employing TTS models, mere high-quality speech generation is no longer sufficient. Present-day TTS models must also excel at producing expressive speech that can convey various speaking styles and emotions, akin to human speech. Consequently, researchers have concentrated their efforts on developing more efficient models for expressive speech synthesis in recent years. This paper presents a systematic review of the literature on expressive speech synthesis models published within the last 5 years, with a particular emphasis on approaches based on deep learning. We offer a comprehensive classification scheme for these models and provide concise descriptions of models falling into each category. Additionally, we summarize the principal challenges encountered in this research domain and outline the strategies employed to tackle these challenges as documented in the literature. In the Section 8, we pinpoint some research gaps in this field that necessitate further exploration. Our objective with this work is to give an all-encompassing overview of this hot research area to offer guidance to interested researchers and future endeavors in this field.