Figure 2 - uploaded by Branislav Gerazov
Content may be subject to copyright.
Example Praat annotation of a French utterance: "Son bagou pourrait faciliter la communauté." containing functional contours: declaration (DC), dependency to the left/right (DG/DD), and cliticisation (XX). 

Example Praat annotation of a French utterance: "Son bagou pourrait faciliter la communauté." containing functional contours: declaration (DC), dependency to the left/right (DG/DD), and cliticisation (XX). 

Contexts in source publication

Context 1
... the SFC are only a small part of the PySFC ecosystem, with the majority of code going into the necessary tools for working with the data, and to a lesser extent into the plotting functionalities. Currently, PySFC supports the proprietary SFC fpro file format as well as standard Praat TextGrid annotations. An example Praat annotation is shown in Fig. 2. The levels that are important for the PySFC are the phonetic (PHON) and syl- lable (SYLL) interval tiers, which are used for determining the vocalic nuclei and RU boundaries, and the linguistic function tiers. The latter are point tiers in which each function's scope is marked by a start and end point (":FF") and the anchor RU is ...
Context 2
... and RU boundaries, and the linguistic function tiers. The latter are point tiers in which each function's scope is marked by a start and end point (":FF") and the anchor RU is marked with a landmark (":" plus the function type, e.g. ":DC"). If the function has no left or right context, the land- mark itself delimits the scope, e.g. for DC. In Fig. 2, there is a PHRASE tier that holds the attitude, NIV1 with an overlapped dependency to the left (DG) and a dependency to the right (DD), and NIV2 with three clitic (XX) contours. The PySFC decom- position of the example annotated here is shown later in Fig. ...

Citations

... Modelling speech intonation and associated F0 contours is a chal- lenging task that has been faced in the past decades for a variety of speech applications: from text-to-speech, voice identity conversion, and speech emotion conversion among others. The representation of such F0 variations is a challenging task for at least two main reasons: first, the F0 sequence corresponding to a speech signal is discontin- uous by nature: F0 values are only over speech segments that are voiced, and undefined otherwise; second, the F0 varies over mul- tiple time scales associated with pre-defined linguistic units (e.g., syllable, phrase) or with latent units [3]. Accordingly, a number of models have been proposed to model F0 variations : 1) Basically, as a linear sequence of F0 values defined at each time step, either from discontinuous raw F0 values or from continuous interpolated F0 val- ues over voiced instants. ...
Conference Paper
Full-text available
Voice interfaces are becoming wildly popular and driving demand for more advanced speech synthesis and voice transformation systems. Current text-to-speech methods produce realistic sounding voices, but they lack the emotional expressivity that listeners expect, given the context of the interaction and the phrase being spoken. Emotional voice conversion is a research domain concerned with generating expressive speech from neutral synthesised speech or natural human voice. This research investigated the effectiveness of using a sequence-to-sequence (seq2seq) encoder-decoder based model to transform the intonation of a human voice from neutral to expressive speech, with some preliminary introduction of linguistic conditioning. A subjective experiment conducted on the task of speech emotion recognition by listeners successfully demonstrated the effectiveness of the proposed sequence-to-sequence models to produce convincing voice emotion transformations. In particular, conditioning the model on the position of the syllable in the phrase significantly improved recognition rates.
... The syllable pitch contour submodel differs from conventional pitch decomposition models, such as the superposition of functional contours (SFC) model 63 and its new version, Python implementation of the SFC (PySFC). 64 The submodel uses both linguistic features and prosodic tags as its AFs and thus involves underlying forms of both the syntactic structure of text and the prosodic structure of utterance. The automatic labeling of break performed by the modified PLM algorithm provides prosodic structure information, which is used to solve the problem of mismatch between syntactic and prosodic structures. ...
Article
In this paper, a hierarchical prosody model (HPM)-based method for Mandarin spontaneous speech is proposed. First, an HPM is designed for describing relations among acoustic features of utterances, linguistic features of texts, and prosodic tags representing the underlying hierarchical prosodic structures of utterances. Subsequently, a sequential optimization algorithm is employed to train the HPM based on a large conversational speech corpus, the Mandarin Conversational Dialogue Corpus (MCDC), which features orthographic transcriptions and prosodic event annotations. In this unsupervised training method, all utterances of the MCDC are labeled with two types of prosodic tags, namely, break and prosodic states, automatically and simultaneously. After training, the HPM parameters are examined to identify critical prosodic properties of Mandarin spontaneous speech, which are then compared with their counterparts in the read-speech HPM. The prosodic tags on the studied utterances enable mapping of various prosodic events onto the hierarchical prosodic structures of the utterances. Prosodic analyses of some disfluent events are conducted using the prosodic tags affixed to the MCDC. Finally, an application of the HPM to assist in Mandarin spontaneous-speech recognition is discussed. Significant relative error rate reductions of 9.0%, 9.2%, 15.6%, and 7.3% are obtained for base-syllable, character, tone, and word recognition, respectively.
... The example shows the extracted elementary contours for the annotated linguistic functions: declaration (DC), dependency to the left/right (DG/DD), and cliticisation (DV, XX). Decomposition was done using the PySFC system, and the figures are taken from(Gerazov and Bailly, 2018). ...
Conference Paper
Full-text available
Prosody in speech is used to communicate a variety of linguistic, paralinguistic and non-linguistic information via multiparametric contours. The Superposition of Functional Contours (SFC) model is capable of extracting the average shape of these elementary contours through iterative analysis-by-synthesis training of neural network contour generators (CGs). The Weighted SFC (WSFC) model is an extension to the SFC that can capture the prominence of each functional contour in the final prosody. Finally, the recently proposed Variational Prosody Model (VPM) is able, in addition, to capture a part of the functional contours' variance. Its variational CGs (VCGs) use the linguistic context input to map out a prosodic latent space for each contour. Here we propose an extension on the VPM based on variance embedding and recurrent neural network contour generators (VRCGs). This approach decouples the prosodic latent space from the length of the contour's scope, thus it can now be readily explored even for longer contours.
... The built-in TTS system (named COMPOST) was rst designed by Alissali et al [AB93] at GIPSA-lab. It controls the several processing stages (text preprocessing, morphological analyzer, part-of-speech tagger, letter-to-sound pronunciation, prosody generation and corpus- [GB18]; [GBX18]). ...
Thesis
A socially assistive robot (SAR) is meant to engage people into situated interaction such as monitoring physical exercise, neuropsychological rehabilitation or cognitive training. While the interactive behavioral policies of such systems are mainly hand-scripted, we discuss here key features of the training of multimodal interactive behaviors in the framework of the SOMBRERO project.In our work, we used learning by demonstration in order to provide the robot with adequate skills for performing collaborative tasks in human centered environments. There are three main steps of learning interaction by demonstration: we should (1) collect representative interactive behaviors from human coaches; (2) build comprehensive models of these overt behaviors while taking into account a priori knowledge (task and user model, etc.); and then (3) provide the target robot with appropriate gesture controllers to execute the desired behaviors.Multimodal HRI (Human-Robot Interaction) models are mostly inspired by Human-Human interaction (HHI) behaviors. Transferring HHI behaviors to HRI models faces several issues: (1) adapting the human behaviors to the robot’s interactive capabilities with regards to its physical limitations and impoverished perception, action and reasoning capabilities; (2) the drastic changes of human partner behaviors in front of robots or virtual agents; (3) the modeling of joint interactive behaviors; (4) the validation of the robotic behaviors by human partners until they are perceived as adequate and meaningful.In this thesis, we study and make progress over those four challenges. In particular, we solve the two first issues (transfer from HHI to HRI) by adapting the scenario and using immersive teleoperation. In addition, we use Recurrent Neural Networks to model multimodal interactive behaviors (such as speech, gaze, arm movements, head motion, backchannels) that surpass traditional methods (Hidden Markov Model, Dynamic Bayesian Network, etc.) in both accuracy and coordination between the modalities. We also build and evaluate a proof-of-concept autonomous robot to perform the tasks.
... An example SFC decomposition of the intonation of a French utterance is shown in the left plot in Fig. 3, where we can see a declaration contour overlapped with one left dependency between the verbal group and its subject, one right dependency between the verbal phrase and its direct object, and two clitic contours cueing articles. The decomposition was performed using the PySFC 2 prosody analysis system [11]. ...
Conference Paper
Full-text available
The way speech prosody encodes linguistic, paralinguistic and non-linguistic information via multiparametric representations of the speech signals is still an open issue. The Superposition of Functional Contours (SFC) model proposes to decompose prosody into elementary multiparametric functional contours through the iterative training of neural network contour generators using analysis-by-synthesis. Each generator is responsible for computing multiparametric contours that encode one given linguistic, paralinguistic and non-linguistic information on a variable scope of rhythmic units. The contributions of all generators' outputs are then overlapped and added to produce the prosody of the utterance. We propose an extension of the contour generators that allows them to model the prominence of the elementary contours based on contextual information. WSFC jointly learns the patterns of the elementary multiparametric functional contours and their weights dependent on the contours' contexts. The experimental results show that the proposed weighted SFC (WSFC) model can successfully capture contour prominence and thus improve SFC modelling performance. The WSFC is also shown to be effective at modelling the impact of attitudes on the prominence of functional contours cuing syntactic relations in French, and that of emphasis on the prominence of tone contours in Chinese.
... An example SFC decomposition of the intonation of a French utterance is shown in the left plot in Fig. 3, where we can see a declaration contour overlapped with one left dependency between the verbal group and its subject, one right dependency between the verbal phrase and its direct object, and two clitic contours cueing articles. The decomposition was performed using the PySFC prosody analysis system [11]. ...
Preprint
Full-text available
The way speech prosody encodes linguistic, paralinguistic and non-linguistic information via multiparametric representations of the speech signals is still an open issue. The Superposition of Functional Contours (SFC) model proposes to decompose prosody into elementary multiparametric functional contours through the iterative training of neural network contour generators using analysis-by-synthesis. Each generator is responsible for computing multiparametric contours that encode one given linguistic, paralinguistic and non-linguistic information on a variable scope of rhythmic units. The contributions of all generators' outputs are then overlapped and added to produce the prosody of the utterance. We propose an extension of the contour generators that allows them to model the prominence of the elementary contours based on contextual information. WSFC jointly learns the patterns of the elementary multiparametric functional contours and their weights dependent on the contours' contexts. The experimental results show that the proposed weighted SFC (WSFC) model can successfully capture contour prominence and thus improve SFC modelling performance. The WSFC is also shown to be effective at modelling the impact of attitudes on the prominence of functional contours cuing syntactic relations in French, and that of emphasis on the prominence of tone contours in Chinese.
... It has been successfully used to model different linguistic levels, including: attitudes [6], grammatical dependencies [7], cliticisation [5], focus [8], as well as tones in Mandarin [9]. The experiments were carried out using PySFC, which is a prosody analysis system based on the SFC model, implemented in the scientific Python ecosystem [10]. The system has been licensed as free software and is available on GitHub 1 . ...
Conference Paper
Full-text available
The Superposition of Functional Contours (SFC) prosody model decomposes the intonation and duration contours into elementary contours that encode specific linguistic functions. It can be used to extract these functional contours at multiple linguistic levels. The PySFC system, which incorporates the SFC, can thus be used to analyse the significance of including the neighbouring syllables in the scope of the tone functional contours in spoken Chinese on the modelling of prosody. Our results show that significant improvements of modelling tone functional contours are obtained by including the right syllable in the scope, but not the left one. We thus show that there is a larger carry-over effect for Chinese tones in contrast to an anticipatory one. This finding is in line with the established state-of-the-art.