Contexts in source publication

Context 1
... with the system of [22], our system is significantly better in preserving the source content. The comparison is shown as in Table 3. ...
Context 2
... with the system of [22], our system is significantly better in preserving the source content. The comparison is shown as in Table 3. ...

Similar publications

Article
Full-text available
В работе приведены теоретические основы предлагаемого метода определения доли молекул акцептора, участвующих в триплет-триплетном переносе энергии, от общего их числа в замороженных растворах органических соединений. Расчетная модель включает в себя молекулы донора энергии с константой скорости дезактивации триплетных возбуждений, превышающей соотв...
Article
Full-text available
We consider the M-neighbor approximation in the problem of one-qubit pure state transfer along the N-node zigzag and alternating spin chains governed by the XXZ-Hamiltonian with the dipoledipole interaction. We show that always M > 1, i.e., the nearest neighbor approximation is not applicable to such interaction. Moreover, only all-node interaction...

Citations

... Several studies adopt multi-level prosodic features to enhance the performance in TTS. For example, phone-Level contentstyle disentanglement [45] and multi-resolution VAEs [46] have been successful applied to generate multi-scale style embeddings, which are then integrated into TTS models based on VAE [47], Vision Transformers (VITs) [48], [49], or diffusion-based models [50]. Research efforts [51], [52] have further refined this approach by modeling hierarchical prosody at both phoneme and word levels. ...
Preprint
Full-text available
Emotional text-to-speech synthesis (TTS) aims to generate realistic emotional speech from input text. However, quantitatively controlling multi-level emotion rendering remains challenging. In this paper, we propose a diffusion-based emotional TTS framework with a novel approach for emotion intensity modeling to facilitate fine-grained control over emotion rendering at the phoneme, word, and utterance levels. We introduce a hierarchical emotion distribution (ED) extractor that captures a quantifiable ED embedding across different speech segment levels. Additionally, we explore various acoustic features and assess their impact on emotion intensity modeling. During TTS training, the hierarchical ED embedding effectively captures the variance in emotion intensity from the reference audio and correlates it with linguistic and speaker information. The TTS model not only generates emotional speech during inference, but also quantitatively controls the emotion rendering over the speech constituents. Both objective and subjective evaluations demonstrate the effectiveness of our framework in terms of speech quality, emotional expressiveness, and hierarchical emotion control.
... Emotion styles can be controlled by modifying different prosodic cues. Current studies [78], [79] mainly focus on designing the prosody embedding as a control vector that is derived from a representation learning framework. For example, style tokens [36] are designed to represent high-level styles such as speaker style, pitch range and speaking rate. ...
Article
Full-text available
Emotional speech synthesis aims to synthesize human voices with various emotional effects. The current studies are mostly focused on imitating an averaged style belonging to a specific emotion type. In this paper, we seek to generate speech with a mixture of emotions at run-time. We propose a novel formulation that measures the relative difference between the speech samples of different emotions. We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework. During the training, the framework does not only explicitly characterize emotion styles but also explores the ordinal nature of emotions by quantifying the differences with other emotions. At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector. The objective and subjective evaluations have validated the effectiveness of the proposed framework. To our best knowledge, this research is the first study on modelling, synthesizing, and evaluating mixed emotions in speech.
... The most widely used speaker encoder method in the zs-TTS is a reference encoder-based approach. The reference encoder takes a reference speech sample as an input and output speaker embedding in a single vector [7,14,15] or sequence of fine-grained vectors [16,17,18]. These approaches typically do not regard the timbre and cadence separately but concentrate on extracting speaker characteristics or prosody in entangled form. ...
... By choosing specific tokens, the model could control the style of synthesized speech. Other studies [118,119,136,137] mainly replace the global style embedding with fine-grained prosody embedding. Some other studies based on variational autoencoders [138] show the effectiveness of controlling the speech style by learning, scaling, or combining disentangled representations [139,140]. ...
Thesis
Full-text available
Speech generation aims to synthesize human-like voices from the input of text or speech. Current speech generation techniques can generate high-quality, natural-sounding speech but do not convey emotional context in human-human interaction. This thesis is focused on modelling and synthesizing realistic emotions for speech generation. Despite considerable research efforts, there exist several open issues, such as the limited generalizability and controllability of the generated emotions, restricting the scope of applications of current studies. The aim of this thesis is to advance the state-of-the-art to overcome the limitations of existing approaches, facilitate more natural human-computer interaction, and bring one step closer to achieving emotional intelligence. There are two types of approaches to generating various emotional effects: one is directly associating text with emotional prosody ("emotional text-to-speech"); the other is injecting the desired emotion into a speech signal ("emotional voice conversion"). This thesis is devoted to the advancement of knowledge for these two approaches. Firstly, we study emotional voice conversion (EVC) frameworks for seen and unseen speakers and emotions. We first propose a speaker-independent emotional voice conversion framework that can convert anyone's emotion. A variational auto-encoding (VAE)-based encoder-decoder structure is proposed to learn the spectrum and prosody mapping, respectively. The prosody conversion is also modelled by the continuous wavelet transform (CWT) to learn the temporal dependencies. We further study the use of a pre-trained speech emotion recognition (SER) model to transfer emotional style during training and at run-time inference. In this way, the network is able to transfer both seen and unseen emotional styles to a new utterance. Secondly, we study sequence-to-sequence (seq2seq) emotion modelling for emotional voice conversion. We propose a novel training strategy leveraging text-to-speech without the need for parallel data. To our best knowledge, this is the first work of seq2seq emotional voice conversion that only needs a limited amount of emotional speech data to train. Moreover, the proposed framework can do many-to-many emotional voice conversion and conduct spectral and duration mapping at the same time. Thirdly, we formulate an emotion intensity modelling technique and propose an emotion intensity controlling mechanism based on relative attributes. We prove that our proposed mechanism outperforms other competing controlling methods in speech quality and emotion intensity control. We also propose style pre-training and reducing perceptual losses from a pre-trained SER to improve the emotional intelligibility of converted emotional speech. To our best knowledge, the proposed framework provides the first fine-grained, effective emotion intensity control in emotional voice conversion. Lastly, we study a novel research problem of mixed emotion synthesis for emotional speech generation. This study is the first to model and synthesize mixed emotions for both emotional text-to-speech and emotional voice conversion. We formulate emotional styles as an attribute and explicitly model the degree of relevance between different emotions through a ranking-based support vector machine. By manually adjusting the relevance, we are able to synthesize mixed emotions and further control the emotion rendering at run-time. We redesign current evaluation metrics to validate our idea. We further investigate the ability of our proposed framework to synthesize the mixed effects of two conflicting emotions, such as Happy and Angry. We also demonstrate the potential ability to build an emotion transition system with our proposals.
... Considering that style and semantic information of utterance are closely related, the text embeddings derived from pre-trained language models (PLMs), e.g., Bidirectional Encoder Representations from Transformer (BERT) [29], have been incorporated to TTS models to improve the expressiveness of the synthesized speech [30]- [32]. Fine-grained style representations to model local prosodic variations in speech, such as word level [33]- [35] and phoneme level [36], [37], are also considered in some works. ...
... Instead of modeling the global style of speech in a sentence level, some researches focus on local prosodic characteristics [25]- [27], [36]. [25], [26] introduce the idea of utilizing an additional reference attention mechanism to align the extracted style embedding sequence with the phoneme sequence. ...
... [25], [26] introduce the idea of utilizing an additional reference attention mechanism to align the extracted style embedding sequence with the phoneme sequence. Other studies, such as [27], [36], achieve the same goal based on the forced-alignment technology. Moreover, a recent study considering the multi-scale nature of speech style [24] also draws our attention. ...
Preprint
Expressive speech synthesis is crucial for many human-computer interaction scenarios, such as audiobooks, podcasts, and voice assistants. Previous works focus on predicting the style embeddings at one single scale from the information within the current sentence. Whereas, context information in neighboring sentences and multi-scale nature of style in human speech are neglected, making it challenging to convert multi-sentence text into natural and expressive speech. In this paper, we propose MSStyleTTS, a style modeling method for expressive speech synthesis, to capture and predict styles at different levels from a wider range of context rather than a sentence. Two sub-modules, including multi-scale style extractor and multi-scale style predictor, are trained together with a FastSpeech 2 based acoustic model. The predictor is designed to explore the hierarchical context information by considering structural relationships in context and predict style embeddings at global-level, sentence-level and subword-level. The extractor extracts multi-scale style embedding from the ground-truth speech and explicitly guides the style prediction. Evaluations on both in-domain and out-of-domain audiobook datasets demonstrate that the proposed method significantly outperforms the three baselines. In addition, we conduct the analysis of the context information and multi-scale style representations that have never been discussed before.
... Fine-grained representations extracted with a variational reference encoder combined with speaker embeddings [40], as well as collaborative and adversarial learning [41] are successful in disentangling speaker and style content. Finally, apart from manipulating the aforementioned prosodic features, naturalness of synthesized speech can be improved by focusing on higher-level characteristics such as stress or intonation. ...
Preprint
Full-text available
In this paper, we present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset. These features are fed as an input sequence of prosodic labels to a prosody encoder module which augments an autoregressive attention-based text-to-speech model. We utilize various methods in order to improve prosodic control range and coverage, such as augmentation, F0 normalization, balanced clustering for duration and speaker-independent clustering. The final model enables fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. Instead of relying on reference utterances for inference, we introduce a prior prosody encoder which learns the style of each speaker and enables speech synthesis without the requirement of reference audio. We also fine-tune the multispeaker model to unseen speakers with limited amounts of data, as a realistic application scenario and show that the prosody control capabilities are maintained, verifying that the speaker-independent prosodic clustering is effective. Experimental results show that the model has high output speech quality and that the proposed method allows efficient prosody control within each speaker's range despite the variability that a multispeaker setting introduces.
... However, HCE still suffers from the absence of local-scale style modeling (e.g., intonation, rhythm, stress). To model and control local prosodic variations in speech, some previous works attempt to predict finer-grained speaking styles from text, such as word level [16,17] and phoneme level [18,19]. It is more widely accepted that the style expressions of human speech are multi-scale in nature [20,21], where the global-scale style is usually observed as emotion and the local-scale is more close to the prosody variation [22,23]. ...
... Emotion styles can be controlled by modifying different prosodic cues. Current studies [78], [79] mainly focus on designing the prosody embedding as a control vector that is derived from a representation learning framework. For example, style tokens [36] are designed to represent high-level styles such as speaker style, pitch range and speaking rate. ...
Preprint
Full-text available
Emotional speech synthesis aims to synthesize human voices with various emotional effects. The current studies are mostly focused on imitating an averaged style belonging to a specific emotion type. In this paper, we seek to generate speech with a mixture of emotions at run-time. We propose a novel formulation that measures the relative difference between the speech samples of different emotions. We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework. During the training, the framework does not only explicitly characterize emotion styles, but also explores the ordinal nature of emotions by quantifying the differences with other emotions. At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector. The objective and subjective evaluations have validated the effectiveness of the proposed framework. To our best knowledge, this research is the first study on modelling, synthesizing and evaluating mixed emotions in speech.
... By choosing specific tokens, the model could control the style of synthesized speech. Other studies [67], [68], [69], [70] mainly replace the global style embedding with fine-grained prosody embedding. Some other studies based on Variational Autoencoders (VAE) [71] show the effectiveness of controlling the speech style by learning, scaling, or combining disentangled representations [72], [73]. ...
Article
Full-text available
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity. In EVC, emotions are usually treated as discrete categories overlooking the fact that speech also conveys emotions with various intensity levels that the listener can perceive. In this paper, we aim to explicitly characterize and control the intensity of emotion. We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding. We further learn the actual emotion encoder from an emotion-labelled database and study the use of relative attributes to represent fine-grained emotion intensity. To ensure emotional intelligibility, we incorporate emotion classification loss and emotion embedding similarity loss into the training of the EVC network. As desired, the proposed network controls the fine-grained emotion intensity in the output speech. Through both objective and subjective evaluations, we validate the effectiveness of the proposed network for emotional expressiveness and emotion intensity control.
... However, HCE still suffers from the absence of local-scale style modeling (e.g., intonation, rhythm, stress). To model and control local prosodic variations in speech, some previous works attempt to predict finer-grained speaking styles from text, such as word level [15] and phoneme level [16]. It is more widely accepted that the style expressions of human speech are multi-scale in nature [17,18], where the global-scale style is usually observed as emotion and the local-scale is more close to the prosody variation [19,20]. ...
Preprint
Previous works on expressive speech synthesis focus on modelling the mono-scale style embedding from the current sentence or context, but the multi-scale nature of speaking style in human speech is neglected. In this paper, we propose a multi-scale speaking style modelling method to capture and predict multi-scale speaking style for improving the naturalness and expressiveness of synthetic speech. A multi-scale extractor is proposed to extract speaking style embeddings at three different levels from the ground-truth speech, and explicitly guide the training of a multi-scale style predictor based on hierarchical context information. Both objective and subjective evaluations on a Mandarin audiobooks dataset demonstrate that our proposed method can significantly improve the naturalness and expressiveness of the synthesized speech.