About
80
Publications
7,458
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
337
Citations
Introduction
Current institution
Additional affiliations
October 2005 - present
Publications
Publications (80)
Deep learning-based speech synthesis has significantly improved realistic audio deepfakes. Despite advanced techniques such as self-supervised learning (SSL) and datasets, current state-of-the-art (SOTA) detection systems fail in out-of-domain scenarios due to the inability to generalize. This work explores the generalization problem through compre...
Research on textual style transfer has observed that the concept of style can vary across domains. This research examines the encoding of style across the sentiment and formality domains and observes that formality appears to be more globally encoded, and sentiment more locally encoded. The work also shows how the encoding of a style can inform the...
Organic dysphonia can lead to vocal impairments. Recording patients’ impaired voice could allow them to use voice cloning systems. Voice cloning, being the process of producing speech matching a target speaker voice, given textual input and an audio sample from the speaker, can be used in such a context. However, dysphonic patients may only produce...
Audiobook readers play with their voices to emphasize some text passages, highlight discourse changes or significant events, or in order to make listening easier and entertaining. A dialog is a central passage in audiobooks where the reader applies significant voice transformation, mainly prosodic modifications, to realize character properties and...
Organic dysphonia can lead to vocal impairments. Recording patients’ impaired voice
could allow them to use voice cloning systems. In the domain of speech synthesis, voice
cloning is the process of producing speech matching a target speaker voice, given textual
input and an audio sample from the speaker. It can achieve high-quality speech with
only...
Unsupervised textual style transfer presupposes that style is a coherent and consistent concept and that style transfer approaches will generalise consistently across different domains of style. This paper explores whether this presupposition is appropriate for different types of style. We explore this question by comparing the performance and late...
Using TTS systems helps to reduce the cost of audio-book generation. This paper investigates the idea of mixing synthetic and recorded natural speech signals to control the trade-off between the overall quality of audio book and its production cost. Firstly, fully synthetic signals and mixed synthetic and natural signals are compared perceptually u...
Voice corpus plays a crucial role in the quality of the synthetic speech generation, specially under a length constraint. Creating a new voice is costly and the recording script selection for an expressive TTS task is generally considered as an optimization problem in order to achieve a rich and parsimonious corpus. In order to vocalize a given boo...
Hybrid TTS systems generally try to optimise their cost function with the voice provided to generate the best signal. The voice is based on a speech corpus usually designed for a specific purpose. In this paper, we consider that the voice creation is realized through a corpus design step under reduction constraints. During this stage, a recording s...
In this study, we propose an approach for script selection in order to design TTS speech corpora. A Deep Convolutional Neural Network (DCNN) is used to project linguistic information to an embedding space. The embedded representation of the corpus is then fed to a selection process to extract a subset of utterances which offers a good linguistic co...
In this study, we propose an approach for script selection
in order to design TTS speech corpora. A Deep Convolutional Neural
Network (DCNN) is used to project linguistic information to an embedding space. The embedded representation of the corpus is then fed to a
selection process to extract a subset of utterances which offers a good lin-
guistic...
In this study, we propose an approach for script selection in order to design TTS speech corpora. A Deep Convolutional Neural Network (DCNN) is used to project linguistic information to an embedding space. The embedded representation of the corpus is then fed to a selection process to extract a subset of utterances which offers a good linguistic co...
Deep neural networks have become the state of the art in speech synthesis. They have been used to directly predict signal parameters or provide unsupervised speech segment descriptions through embeddings. In this paper, we present four models with two of them enabling us to extract phone-level embeddings for unit selection speech synthesis. Three o...
This paper presents a new corpus, called EMO&LY (EMOtion and AnomaLY), composed of speech and facial video records of subjects that contains controlled anomalies. As far as we know, to study the problem of anomaly detection in discourse by using machine learning classification techniques, no such corpus exists or is available to the community. In E...
In the field of expressive speech synthesis, a lot of work has been conducted on suprasegmental prosodic features while few has been done on pronunciation variants. However, prosody is highly related to the sequence of phonemes to be expressed. This article raises two issues in the generation of emotional pronunciations for TTS systems. The first i...
Actually a lot of work on expressive speech focus on acoustic models and prosody variations. However, in expressive Text-to-Speech (TTS) systems, prosody generation strongly relies on the sequence of phonemes to be expressed and also to the words below these phonemes. Consequently, linguistic and phonetic cues play a significant role in the percept...
This paper presents an evaluation of three different anomaly detector methods over different feature sets. The three anomaly detectors are based respectively on Gaussian Mixture Model (GMM), One-Class SVM and isolation Forest. The considered feature sets are built from personality evaluation and audio signal. Personality evaluations are extracted f...
This paper presents the design of an anomaly detector based on three different sets of features, one corresponding to some prosodic descriptors and two extracted from Big Five traits. Big Five traits correspond to a simple but efficient representation of a human personality. They are extracted from a manual annotation while prosodic features are ex...
https://www.isca-speech.org/archive/pdfs/interspeech_2017/fayet17_interspeech.pdf
To bring more expressiveness into text-to-speech systems, this paper presents a new pronunciation variant generation method which works by adapting standard, i.e., dictionary-based, pronunciations to a spontaneous style. Its strength and originality lie in exploiting a wide range of linguistic, articulatory and prosodic features, and in using a pro...
Text-to-Speech (TTS) systems rely on a grapheme-to-phoneme converter which is built to produce canonical, or statically stylized, pronunciations. Hence, the TTS quality drops when phoneme sequences generated by this converter are inconsistent with those labeled in the speech corpus on which the TTS system is built, or when a given expressivity is d...
Text-to-speech (TTS) systems are built on speech corpora which are labeled with carefully checked and segmented phonemes. However, phoneme sequences generated by automatic grapheme-to-phoneme converters during synthesis are usually inconsistent with those from the corpus, thus leading to poor quality synthetic speech signals. To solve this problem...
In this paper, the rhythmic patterns observed in natural and synthesized speech are compared for three literary forms (rhymes, poems, and fairy tales). The aim of the comparison is to evaluate how rhythm could be improved in synthesized speech, which could allow adapting it to specific styles or genres. The study is based on the analysis of a corpu...
This paper compares the rhythmic features obtained in natural and in synthesized speech along three dimensions: the speech type (synthesized vs natural speech), the literary genre (rhymes, poems vs story telling), and the communication setting (speech addressed to children vs addressed to adults). The study is based on the analysis of duration patt...
Pronunciation adaptation consists in predicting pronunciation variants of words and utterances based on their standard pronunciation and a target style. This is a key issue in text-to-speech as those variants bring expressiveness to synthetic speech, especially when considering a spontaneous style. This paper presents a new pronunciation adaptation...
This paper compares the rhythmic features obtained in natural and in synthesized speech along three dimensions: the speech type (synthesized vs natural speech), the literary genre (rhymes, poems vs story telling), and the communication setting (speech addressed to children vs addressed to adults). The study is based on the analysis of duration patt...
Cet article compare selon trois dimensions certaines caractéristiques rythmiques observées en parole naturelle et en parole synthétique : le type de parole (synthèse vs parole naturelle), le genre littéraire (comptines, poèmes vs récits), et la situation d’allocution (parole adressée à l’enfant vs à l’adulte). Cette étude se base sur l’analyse des...
Unit selection speech synthesis systems generally rely on target and concatenation costs for selecting a best unit sequence. These costs, though often considering contextual features, mainly include local distances that are accumulated afterwards. In this paper, we describe a new duration target cost that takes a whole sequence into account. It aim...
Subjective evaluation is a crucial problem in the speech processing community and especially for the speech synthesis field, no matter what system is used. Indeed, when trying to assess the effectiveness of a proposed method, researchers usually conduct subjective evaluations by randomly choosing a small set of samples, from the same domain, taken...
TTS voice building generally relies on a script extracted from a big text corpus while optimizing the coverage of linguistic and phonological events supposedly related to voice acoustic quality. Previous works have shown differences on objective measures between smartly reduced and random corpora, but not when subjective evaluations are performed....
This paper presents a preliminary study whose main aim is to characterize four distinct speaking styles according to a limited set of prosodic features, including the length of prosodic phrases (AP and IP), the distribution of stressed syllables, pitch register span, the duration of silent pauses, etc. The analysis was performed using semi-automati...
Speech synthesis systems usually use the Viterbi algorithm as a basis for unit selection, while it is not the only possible choice. In this paper, we study a speech synthesis system relying on the A
* algorithm, which is a general pathfinding strategy developing a graph rather than a lattice. Using state of the art techniques, we propose and analyz...
In this paper, we present an approach that allows a TTS-
system to dictate texts to primary school pupils, while being in
conformity with the prosodic features of this speaking style.
The approach relies on the elaboration of a preprocessing
prosodic module that avoids developing a specific system for a
so limited task. The proposal is based on two...
Expressive speech processing is an important scientific problem as expressivity introduces a lot of variability into speech. This variability leads to a degradation of speech application performances. Variations are reflected in the linguistic, phonological and acoustic sides of speech. However our main interest is on phonology, more precisely the...
The aim of this paper is to present an algorithm that automatically segment a text in prosodic chunks for a dictation by conforming to the rules and procedures used in real settings to dictate a text to primary school children. A better understanding and modeling of these rules and procedures is crucial to develop robust automatic tools that could...
This paper presents a software library, namely ROOTS for Rich Object Oriented Transcription System, that helps to describe spoken messages in a coherent manner linking sequences of items on numerous levels (linguistic, phonological, or acoustic). The proposed representation is incremental and can thus describe any or all parts of an utterance. In o...
In the speech processing field, stylization of fundamental frequency F <sub>0</sub> has been subjected to numerous works. Models proposed in the literature rely on knowledge stemming from phonology and linguistics. We propose an approach that deals with the issue of F <sub>0</sub> curve stylization requiring as few linguistic assumptions as possibl...
Evaluation of prosody transformation systems is an important issue. First, the existing evaluation methodologies focus on parallel evaluation of systems and are not applicable to compare parallel and non-parallel systems. Secondly, these methodologies do not guarantee the independence from other features such as the segmental component. In particul...
The work presented in this thesis lies within the scope of prosody conversion and more particularly the fundamental frequency conversion which is considered as a prominent factor in prosody processing. This document deals with the different steps necessary to build such a conversion system : stylization, clustering and conversion of melodic contour...
In a voice transformation context, prosody transformation us-ing parallel corpora is quite unrealistic as such corpora are dif-ficult and also expensive to build. Based on this observation, we propose an approach for transforming prosody using non-parallel corpora thanks to the MLLR adaptation strategy. This methodology is applied to the joint tran...
In a voice transformation context, prosody transfor-mation using parallel corpora is quite unrealistic as such corpora are difficult and also expensive to build. Based on this observation, we propose an approach for transforming prosody using non-parallel corpora thanks to the MLLR adaptation strategy. This me-thodology is applied to the joint tran...
This article describes a new unsupervised methodology to learn F 0 classes using HMM models on a syllable basis. A F0 class is represented by a HMM with three emitting states. The clustering algorithm relies on an iterative gaussian splitting and EM retraining process. First, a single class is learnt on a training corpus (8000 syllables) and it is...
This article describes a new approach to estimate F
0 curves using B-spline and Spline models characterized by a knot sequence and associated control points. The free parameters
of the model are the number of knots and their location. The free-knot placement, which is a NP-hard problem, is done using
a global MLE (Maximum Likelihood Estimation) wit...
This article describes a new approach to estimate F 0 curves us-ing a B-Spline model characterized by a knot sequence and as-sociated control points. The free parameters of the model are the number of knots and their location. The free-knot place-ment, which is a NP-hard problem, is done using a global MLE within a simulated-annealing strategy. The...
This article describes a new approach to estimate F0 curves using a B-Spline model characterized by a knot se- quence and associated control points. The free parameters of the model are the number of knots and their location. The free-knot placement, which is a NP-hard problem, is done using a global MLE within a simulated-annealing strategy. The o...
This article describes a F0 curve estimation scheme based on a B-spline model1. We compare this model with more classical spline representation. The free parameters of both models are the number of knots and their location. An optimal location is proposed using a simulated annealing strategy. Experiments on real F0 curves confirm the adequacy and g...