
Thierry Dutoit- Professor
- University of Mons
Thierry Dutoit
- Professor
- University of Mons
About
341
Publications
105,649
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
8,098
Citations
Introduction
Current institution
Publications
Publications (341)
Spatial audio and 3-Dimensional sound rendering techniques play a pivotal and essential role in immersive audio experiences. Head-Related Transfer Functions (HRTFs) are acoustic filters which represent how sound interacts with an individual's unique head and ears anatomy. The use of HRTFs compliant to the subjects anatomical traits is crucial to en...
This paper presents a novel guitar dataset made out of richtly annotated guitarist improvisations. The annotations gather notes, playing techniques, instrument tuning, audio effect configurations as well as transcription of post improvisation interviews. The dataset gathers ten hours of improvisations and around five hours of interviews. These acco...
In this paper, we study the controllability of an Expressive TTS system trained on a dataset for a continuous control. The dataset is the Blizzard 2013 dataset based on audiobooks read by a female speaker containing a great variability in styles and expressiveness. Controllability is evaluated with both an objective and a subjective experiment. The...
In this paper, we present open-source¹ tools that facilitates the use of controllable TTS systems in experiments, towards the democratization of TTS systems across domains. ICE-Talk is a web-based GUI that allows the use of a TTS system with controllable parameters via a text field and a clickable 2D plot. It enables the study of latent spaces for...
ICE-Talk is an open source web-based GUI that allows the use of a TTS system with controllable parameters via a text field and a clickable 2D plot. It enables the study of latent spaces for controllable TTS. Moreover it is implemented as a module that can be used as part of a Human-Agent interaction.
Despite the growing interest for expressive speech synthesis, synthesis of nonverbal expressions is an under-explored area. In this paper we propose an audio laughter synthesis system based on a sequence-to-sequence TTS synthesis system. We leverage transfer learning by training a deep learning model to learn to generate both speech and laughs from...
This paper focuses on the analysis and synthesis of hypo and hyperarticulated speech in the framework of HMM-based speech synthesis. First of all, a new French database matching our needs was created, which contains three identical sets, pronounced with three different degrees of articulation: neutral, hypo and hyperarticulated speech. On that basi...
Various parametric representations have been proposed to model the speech signal. While the performance of such vocoders is well-known in the context of speech processing, their extrapolation to singing voice synthesis might not be straightforward. The goal of this paper is twofold. First, a comparative subjective evaluation is performed across fou...
This paper addresses the problem of estimating the voice source directly from speech waveforms. A novel principle based on Anticausality Dominated Regions (ACDR) is used to estimate the glottal open phase. This technique is compared to two other state-of-the-art well-known methods, namely the Zeros of the Z-Transform (ZZT) and the Iterative Adaptiv...
In a previous work, we showed that the glottal source can be estimated from speech signals by computing the Zeros of the Z-Transform (ZZT). Decomposition was achieved by separating the roots inside (causal contribution) and outside (anticausal contribution) the unit circle. In order to guarantee a correct deconvolution, time alignment on the Glotta...
An inversion of the speech polarity may have a dramatic detrimental effect on the performance of various techniques of speech processing. An automatic method for determining the speech polarity (which is dependent upon the recording setup) is thus required as a preliminary step for ensuring the well-behaviour of such techniques. This paper proposes...
It was recently shown that complex cepstrum can be effectively used for glottal flow estimation by separating the causal and anticausal components of speech. In order to guarantee a correct estimation, some constraints on the window have been derived. Among these, the window has to be synchronized on a Glottal Closure Instant. This paper proposes a...
In the framework of assessing the pathology severity in chronic cough diseases, medical literature underlines the lack of tools for allowing the automatic, objective and reliable detection of cough events. This paper describes a system based on two microphones which we developed for this purpose. The proposed approach relies on a large variety of a...
In most current approaches of speech processing, information is extracted from the magnitude spectrum. However recent perceptual studies have underlined the importance of the phase component. The goal of this paper is to investigate the potential of using phase-based features for automatically detecting voice disorders. It is shown that group delay...
This paper addresses the problem of pitch modification, as an important module for an efficient voice transformation system. The Deterministic plus Stochastic Model of the residual signal we proposed in a previous work is compared to TDPSOLA, HNM and STRAIGHT. The four methods are compared through an important subjective test. The influence of the...
Statistical parametric speech synthesizers have recently shown their ability to produce natural-sounding and flexible voices. Unfortunately the delivered quality suffers from a typical buzziness due to the fact that speech is vocoded. This paper proposes a new excitation model in order to reduce this undesirable effect. This model is based on the d...
This paper investigates the differences occuring in the excitation for different voice qualities. Its goal is two-fold. First a large corpus containing three voice qualities (modal, soft and loud) uttered by the same speaker is analyzed and significant differences in characteristics extracted from the excitation are observed. Secondly rules of modi...
This paper addresses the issue of cough detection using only audio recordings, with the ultimate goal of quantifying and qualifying the degree of pathology for patients suffering from respiratory diseases, notably mucoviscidosis. A large set of audio features describing various aspects of the audio signal is proposed. These features are assessed in...
This paper addresses the problem of automatic detection of voice pathologies directly from the speech signal. For this, we investigate the use of the glottal source estimation as a means to detect voice disorders. Three sets of features are proposed, depending on whether they are related to the speech or the glottal signal, or to prosody. The relev...
In this paper, we review the datasets of emotional speech publicly available and their usability for state of the art speech synthesis. This is conditioned by several characteristics of these datasets: the quality of the recordings, the quantity of the data and the emotional content captured contained in the data. We then present a dataset that was...
During the last few years, spoken language technologies have known a big improvement thanks to Deep Learning. However Deep Learning-based algorithms require amounts of data that are often difficult and costly to gather. Particularly, modeling the variability in speech of different speakers, different styles or different emotions with few data remai...
Complex cepstrum is known in the literature for linearly separating causal and anticausal components. Relying on advances achieved by the Zeros of the Z-Transform (ZZT) technique, we here investigate the possibility of using complex cepstrum for glottal flow estimation on a large-scale database. Via a systematic study of the windowing effects on th...
This paper proposes a method to improve the quality delivered by statistical parametric speech synthesizers. For this, we use a codebook of pitch-synchronous residual frames, so as to construct a more realistic source signal. First a limited codebook of typical excitations is built from some training database. During the synthesis part, HMMs are us...
The development of a system for the automatic, objective and reliable detection of cough events is a need underlined by the medical literature for years. The benefit of such a tool is clear as it would allow the assessment of pathology severity in chronic cough diseases. Even though some approaches have recently reported solutions achieving this ta...
Homomorphic analysis is a well-known method for the separation of non-linearly combined signals. More particularly, the use of complex cepstrum for source-tract deconvolution has been discussed in various articles. However there exists no study which proposes a glottal flow estimation methodology based on cepstrum and reports effective results. In...
The problem of pitch tracking has been extensively studied in the speech research community. The goal of this paper is to investigate how these techniques should be adapted to singing voice analysis, and to provide a comparative evaluation of the most representative state-of-the-art approaches. This study is carried out on a large database of annot...
Speech generated by parametric synthesizers generally suffers from a typical buzziness, similar to what was encountered in old LPC-like vocoders. In order to alleviate this problem, a more suited modeling of the excitation should be adopted. For this, we hereby propose an adaptation of the Deterministic plus Stochastic Model (DSM) for the residual....
The modeling of speech production often relies on a source-filter approach. Although methods parameterizing the filter have nowadays reached a certain maturity, there is still a lot to be gained for several speech processing applications in finding an appropriate excitation model. This manuscript presents a Deterministic plus Stochastic Model (DSM)...
The pseudo-periodicity of voiced speech can be exploited in several speech processing applications. This requires however that the precise locations of the Glottal Closure Instants (GCIs) are available. The focus of this paper is the evaluation of automatic methods for the detection of GCIs directly from the speech waveform. Five state-of-the-art G...
This paper proposes a new procedure to detect Glottal Closure and Opening Instants (GCIs and GOIs) directly from speech waveforms. The procedure is divided into two successive steps. First a mean-based signal is computed, and intervals where speech events are expected to occur are extracted from it. Secondly, at each interval a precise position of...
Source-tract decomposition (or glottal flow estimation) is one of the basic problems of speech processing. For this, several techniques have been proposed in the literature. However studies comparing different approaches are almost nonexistent. Besides, experiments have been systematically performed either on synthetic speech or on sustained vowels...
As part of the Human-Computer Interaction field, Expressive speech synthesis is a very rich domain as it requires knowledge in areas such as machine learning, signal processing, sociology, and psychology. In this chapter, we will focus mostly on the technical side. From the recording of expressive speech to its modeling, the reader will have an ove...
Skeleton-based human action recognition has recently drawn increasing attention thanks to the availability of low-cost motion capture devices, and accessibility of large-scale 3D skeleton datasets. One of the key challenges in action recognition lies in the high dimensionality of the captured data. In recent works, researchers draw inspiration from...
Among the various cues that help us understand and interact with our surroundings, depth is of particular importance. It allows us to move in space and grab objects to complete different tasks. Therefore, depth prediction has been an active research field for decades and many algorithms have been proposed to retrieve depth. Some imitate human visio...
Every spring, European forest soundscapes fill up with the drums and calls of woodpeckers as they draw territories and pair up. Each drum or call is species-specific and easily picked up by a trained ear. In this study, we worked toward automating this process and thus toward making the continuous acoustic monitoring of woodpeckers practical. We re...
As part of the Human-Computer Interaction field, Expressive speech synthesis is a very rich domain as it requires knowledge in areas such as machine learning, signal processing, sociology, psychology. In this Chapter, we will focus mostly on the technical side. From the recording of expressive speech to its modeling, the reader will have an overvie...
The field of Text-to-Speech has experienced huge improvements last years benefiting from deep learning techniques. Producing realistic speech becomes possible now. As a consequence, the research on the control of the expressiveness, allowing to generate speech in different styles or manners, has attracted increasing attention lately. Systems able t...
The field of Text-to-Speech has experienced huge improvements last years benefiting from deep learning techniques. Producing realistic speech becomes possible now. As a consequence, the research on the control of the expressiveness, allowing to generate speech in different styles or manners, has attracted increasing attention lately. Systems able t...
During the last few years, spoken language technologies have known a big improvement thanks to Deep Learning. However Deep Learning-based algorithms require amounts of data that are often difficult and costly to gather. Particularly, modeling the variability in speech of different speakers, different styles or different emotions with few data remai...
In this paper a methodology to recognize actions based on RGB videos is proposed which takes advantages of the recent breakthrough made in deep learning. Following the development of Convolutional Neural Networks (CNNs), research was conducted on the transformation of skeletal motion data into 2D images. In this work, a solution is proposed requiri...
In this paper, we present our work on building a database of Nonverbal Conversation Expressions (NCE). In this study, these NCE consist of smiles, laughs, head and eyebrow movements. We describe our annotation scheme and explain our choises. We finally give inter-rater agreement results on small part of the dataset
This work focuses on laughter intensity level, the way it is perceived and suggests ways to estimate it automatically. In the first part of this paper, we present a laughter intensity database which is collected through online perception tests. Participants are asked to rate the overall intensity of laughs. Presented laughs are either audio only or...
In order to properly train an automatic speech recognition system, speech with its annotated transcriptions is required. The amount of real annotated data recorded in noisy and reverberant conditions is extremely limited, especially compared to the amount of data that can be simulated by adding noise to clean annotated speech. Thus, using both real...
During the last decade, the applications of signal processing have drastically improved with deep learning. However areas of affecting computing such as emotional speech synthesis or emotion recognition from spoken language remains challenging. In this paper, we investigate the use of a neural Automatic Speech Recognition (ASR) as a feature extract...
Motion capture allows accurate recording of human motion, with applications in many fields, including entertainment, medicine, sports science and human computer interaction. A common difficulty with this technology is the occurrence of missing data, due to occlusions, or recording conditions. Various models have been proposed to estimate missing da...
Matlab code.
Matlab script containing a Matlab implementation of the proposed method. Note that the maintained version of the code can be found on the Github repository: https://github.com/numediart/MocapRecovery.
(M)
In this paper, we present a database of emotional speech intended to be open-sourced and used for synthesis and generation purpose. It contains data for male and female actors in English and a male actor in French. The database covers 5 emotion classes so it could be suitable to build synthesis and voice transformation systems with the potential to...
During the last decade, the applications of signal processing have drastically improved with deep learning. However areas of affecting computing such as emotional speech synthesis or emotion recognition from spoken language remains challenging. In this paper, we investigate the use of a neural Automatic Speech Recognition (ASR) as a feature extract...
In this article, we present a large 3D motion capture dataset of Taijiquan martial art gestures (n = 2200 samples) that includes 13 classes (relative to Taijiquan techniques) executed by 12 participants of various skill levels. Participants levels were ranked by three experts on a scale of [0-10]. The dataset was captured using two motion capture s...
This paper proposes a model for visual laughter generation by the means of speaker-dependent training of Hidden Markov Models (HMMs). It is composed of the following parts: 1) facial and 2) and head motions are modeled with separate HMMs while 3) eye-blink are added as a post-processing step on the generated eyelid trajectories.
The models are trai...
Dealing with speech corrupted by noise and reverberation is still an issue for automatic speech recognition. To address this, a solution that can be combined with multi-style learning consists of using multi-task learning, where the acoustic model is trained to solve one main task and at least one auxiliary task simultaneously. In noisy and reverbe...
Dealing with noise deteriorating the speech is still a major problem for automatic speech recognition. An interesting approach to tackle this problem consists of using multi-task learning. In this case, an efficient auxiliary task is clean-speech generation. This auxiliary task is trained in addition to the main speech recognition task and its goal...
In this paper we present the AmuS database of about three hours worth of data related to amused speech recorded from two males and one female subjects and contains data in two languages French and English. We review previous work on smiled speech and speech-laughs. We describe acoustic analysis on part of our database, and a perception test compari...
Among two-dimensional phase unwrapping algorithms, the quality-guided strategy offers one of the best trade-off between speed and accuracy. it assigns a quality value, also called reliability, to each pixel and removes the wraps on a path that goes from the highest to the lowest quality region. By doing so, challenging regions are unwrapped last, a...
In this paper, we present our work on analysis and classification of smiled vowels, chuckling (or shaking) vowels and laughter syllables. This work is part of a larger framework that aims at assessing the level of amusement in speech using the audio modality only. Indeed all of these three categories occur in amused speech and are considered to con...
In the recent domain of motion capture and analysis, a new challenge has been the automatic evaluation of skill in gestures. Many methods have been proposed for gesture evaluation based on feature extraction, skill modeling and gesture comparison. However, movements can be influenced by many factors other than skill, including morphology. All these...
In recent years, 3D skeleton-based action recognition has become a popular technique of action classification, thanks to development and availability of cheaper depth sensors. State-of-the-art methods generally represent motion sequences as high dimensional trajectories followed by a time warping technique. These trajectories are used to train a cl...
I-Vectors have been successfully applied in the speaker identification community in order to characterize the speaker and its acoustic environment. Recently, i-vectors have also shown their usefulness in automatic speech recognition, when con-catenated to standard acoustic features. Instead of directly feeding the acoustic model with i-vectors, we...
Overfitting is a commonly met issue in automatic speech recognition and is especially impacting when the amount of training data is limited. In order to address this problem, this article investigates acoustic modeling through Multi-Task Learning, with two speaker-related auxiliary tasks. Multi-Task Learning is a regularization method which aims at...
In a previous work we developed an HMM-based TTS system for a Basque dialect spoken in southern France. We observed that French words, frequent in daily conversations, were not pronounced properly by the TTS system because the training corpus contained very few instances of some French phones. This paper reports our attempt to improve the pronuncia...
In this work, we experiment with the use of smiling and laughter in order to help create more natural and efficient listening agents. We present preliminary results on a system which predicts smile and laughter sequences in one dialogue participant based on observations of the other participant's behavior. This system also predicts the level of int...
Sound designers select the sounds they use among massive collections of recordings. They usually rely on text-based queries to narrow down a subset from these collections when looking for specific content. However, when it comes to unknown collections, this approach can fail to precisely retrieve files according to their content. We investigate an...
Drumming sounds are substantial clues when searching audio recordings for the presence of woodpeckers. Woodpeckers use drumming for territory defence and mate attraction to such an extent that some species have no vocalisations for these functions. This implies that drumming bears species markers. This hypothesis stands at the root of our project t...
In order to address the commonly met issue of overfitting in speech recognition, this article investigates Multi-Task Learning, when the auxiliary task focuses on speaker classification. Overfitting occurs when the amount of training data is limited, leading to an over-sensible acoustic model. Multi-Task Learning is a method, among many other regul...
Affect bursts are short, isolated and non-verbal expressions of affect expressed vocally or facially. In this paper we present an attempt at synthesizing audio affect bursts on several levels of arousal. This work concerns 3 different types of affect bursts: disgust, startle and surprised expressions. Data are first gathered for each of these affec...
It has been shown that adding expressivity and emotional expressions to an agent's communication systems would improve the interaction quality between this agent and a human user. In this paper we present a multimodal database of affect bursts, which are very short non-verbal expressions with facial, vocal, and gestural components that are highly s...
This paper provides a short summary of the importance of taking into account laughter and smile expressions in Human-Computer Interaction systems. Based on the literature, we mention some important characteristics of these expressions in our daily social interactions. We describe some of our own contributions and ongoing work to this field.
We propose InspectorWidget, an opensource application to track and analyze users' behaviors in interactive software. The key contributions of our application are: 1) it works with closed applications that do not provide source code nor scripting capabilities; 2) it covers the whole pipeline of software analysis from logging input events to visual s...
Generalization is a common issue for automatic speech recognition. A successful method used to improve recognition results consists of training a single system to solve multiple related tasks in parallel. This overview investigates which auxiliary tasks are helpful for speech recognition when multi-task learning is applied on a deep learning based...
Laughter is everywhere. So much so that we often do not even notice it. First, laughter has a strong connection with humour. Most of us seek out laughter and people who make us laugh, and it is what we do when we gather together as groups relaxing and having a good time. But laughter also plays an important role in making sure we interact with each...
In this paper, we expose our work on classification of smiled vowels, shaking vowels and laughter syllables. This work is part of a larger framework that aims at assessing the level of amusement in speech using only audio cues. Indeed all of these three categories occur in amused speech and are considered to express a different level of amusement....
Smile is not only a visual expression. When it occurs together with speech, it also alters its acoustic realization. Being able to synthesize speech altered by the expression of smile can hence be an important contributor for adding naturalness and expressiveness in interactive systems. In this work, we present a first attempt to develop a Hidden M...
In this paper, we present our work on speech-smile/shaking vowels classification. An efficient classification system would be a first step towards the estimation (from speech signals only) of amusement levels beyond smile, as indeed shaking vowels represent a transition from smile to laughter superimposed to speech. A database containing examples o...
In this paper, we address the problem of sensor-dependent gesture recognition thanks to adaptation procedure. Capturing human movements by a motion capture (MoCap) system provides very accurate data. Unfortunately, such systems are very expensive, unlike recent depth sensors, like Microsoft Kinect, which are much cheaper, but provide lower data qua...
Modeling human attention has been arousing a lot of interest due to its numerous applications. The process that allows us to focus on some more important stimuli is defined as the " attention'. Seam carving is an approach to resize images or video sequences while preserving the semantic content. To define what is important, gradient was first used...
In this paper we propose synchronization rules between acoustic and visual laughter synthesis systems. Previous works have addressed separately the acoustic and visual laughter synthesis following an HMM-based approach. The need of synchronization rules comes from the constraint that in laughter, HMM-based synthesis cannot be performed using a unif...
In this work, we present a study dedicated to improve the speech-laugh synthesis quality. The impact of two factors is evaluated. The first factor is the addition of breath intake sounds after laughter bursts in speech. The second is the repetition of the word interrupted by laughs in the speech-laugh sentences. Several configurations are evaluated...
Saliency models are able to provide heatmaps highlighting areas in images which attract human gaze. Most of them are designed for still images but an increasing trend goes towards an extension to videos by adding dynamic features to the models. Nevertheless, only few are specifically designed to manage the temporal aspect. We propose a new model wh...
This paper presents a novel processing method for heart sound signal: the statistical time growing neural network (STGNN). The STGNN performs a robust classification by merging supervised and unsupervised statistical methods to overcome non-stationary behavior of the signal. By combining available preprocessing and segmentation techniques and the S...
This paper presents an HMM-based speech-smile synthesis system. In order to do that, databases of three speech styles were recorded. This system was used to study to what extent synthesized speech-smiles (defined as Duchenne smiles in our work) and spread-lips (speech modulated by spreading the lips) communicate amusement. Our evaluation results sh...
This paper presents an HMM-based synthesis approach for speech-laughs. The building stone of this project was the idea of the co-occurrence of smile and laughter bursts in varying proportions within amused speech utterances. A corpus with three complementary speaking styles was used to train the underlying HMM models: neutral speech, speech-smile,...
This paper provides a classification of calibration methods for cameras and projectors. From basic homography to complex geometric calibration methods, this paper aims at simplifying the choice of the methods to perform a calibration regarding the complexity of the setup. The classical camera calibration methods are presented. A comparison gives th...
By detecting the presence or absence of vocal species, the study of soundscapes unveils information about ecosystems. The present work is an analysis of sound records collected at the beginning of July 2014 in the rural, publicly owned nature park of Chevetogne, Belgium. A continuous 24-hour window is focused on. The primary microphone is set up in...
The present invention is related to a method for coding excitation signal of a target speech comprising the steps of: extracting from a set of training normalized residual frames, a set of relevant normalized residual frames, said training residual frames being extracted from a training speech, synchronized on Glottal Closure Instant(GCI), pitch an...
Sound designers source sounds in massive collections, heavily tagged by themselves and sound librarians. For each query, once successive keywords attained a limit to filter down the results, hundreds of sounds are left to be reviewed. AudioMetro combines a new content-based information visualization technique with instant audio feedback to facilita...
SuperRare is a new object-oriented attention al-gorithm based on the notion of rarity: rare regions are worthy of attention. The main novelty of the model is to use superpixels of several sizes instead of simple pixels. This approach allows SuperRare to react efficiently to salient objects of any size. The model is validated on Jian Li's database,...