About
71
Publications
20,247
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
710
Citations
Citations since 2017
Introduction
Research Scientist & Principal Investigator on Audio, Speech and Social Signal Processing.
Publications
Publications (71)
In this paper, we present an audio-visual emotion conversion based on deep learning for 3D talking head. The technology aims at retargeting neutral facial and speech expression into emotional ones. The challenging issues are how to control dynamics and variations of different expressions of both speech and the face. The controllability of facial ex...
Affective social multimedia computing is an emergent research topic for both affective computing and multimedia research communities. Social multimedia is fundamentally changing how we communicate, interact, and collaborate with other people in our daily lives. Social multimedia contains much affective information. Effective extraction of affective...
In this paper, we aim at improving the performance of synthesized speech in statistical parametric speech synthesis (SPSS) based on a generative adversarial network (GAN). In particular, we propose a novel architecture combining the traditional acoustic loss function and the GAN's discriminative loss under a multi-task learning (MTL) framework. The...
This paper presents the techniques used in our contribution to Emotion Recognition in the Wild 2017 video based sub-challenge. The purpose of the sub-challenge is to classify the six basic emotions (angry, sad, happy, surprise, fear and disgust) and neutral. Our proposed solution utilizes three state-of-the-arts techniques to overcome the challenge...
This paper presents the techniques used in our contribution to Emotion Recognition in the Wild 2016’s video based sub-challenge. The purpose of the sub-challenge is to classify the six basic emotions (angry, sad, happy, surprise, fear & disgust) and neutral. Compared to earlier years’ movie based datasets, this year’s test dataset introduced realit...
Voice conversion (VC) aims to make one speaker (source) to sound like spoken by another speaker (target) without changing the language content. Most of the state-of-the-art voice conversion systems focus only on timbre conversion. However, the speaker identity is characterized by the source-related cues such as fundamental frequency and energy as w...
Automatic detection of pathological voice is a challenging task in speech processing. Appropriate acoustic cues of voice can be used to differentiate between normal voices and pathological voices. We propose a method to represent each speech utterance using three types of speech signal representations (i.e., cross-correlation matrix, Gaussian distr...
To convert one speaker's voice to another's, the mapping of the corresponding speech segments from source speaker to target speaker must be obtained first. In parallel voice conversion, normally dynamic time warping (DTW) method is used to align signals of source and target voices. However, for conversion between non-parallel speech data, the DTW b...
Nonnegative matrix factorization (NMF) is a popular method for source separation. In this paper, an alternating direction method of multipliers (ADMM) for NMF is studied, which deals with the NMF problem using the cost function of beta-divergence. Our study shows that this algorithm outperforms state-of-the-art algorithms on synthetic data sets, bu...
This paper is to show a representation of fundamental frequency (F0) using continuous wavelet transform (CWT) for prosody modeling in emotion conversion. Emotional conversion aims at converting speech from one emotion state to another. Specifically, we use CWT to decompose F0 into a five-scale representation that corresponds to five temporal scales...
This paper proposes a real-time variable-Q non-stationary Ga-bor transform (VQ-NSGT) system for speech pitch shifting. The system allows for time-frequency representations of speech on variable-Q (VQ) with perfect reconstruction and computational efficiency. The proposed VQ-NSGT phase vocoder can be used for pitch shifting by simple frequency trans...
Non-negative matrix factorization (NMF) aims at finding non-negative representations of nonnegative data. Among different NMF algorithms, alternating direction method of multipliers (ADMM) is a popular one with superior performance. However , we find that ADMM shows instability and inferior performance on real-world data like speech signals. In thi...
In this paper, we address the problem of phase retrieval to recover a signal from the magnitude of its Fourier transform. In many applications of phase retrieval, the signals encountered are naturally sparse. In this work, we consider the case where the signal is sparse under the assumption that few components are nonzero. We exploit further the sp...
In this paper, we propose a linear pre-and post-filter for improving the perceptual speech quality of a vocoder based on amplitude spectrum of residual signal. The goal of the pre-filter in analyzer is to make the noise of spectrum inaudible. It attempts to hide the noise under the signal spectrum by exploiting human auditory masking. The post-filt...
Emotional facial expression transfer involves sequence-to-sequence mappings from an neutral facial expression to another emotional facial expression, which is a well-known problem in computer graphics. In the graphics community, current considered methods are typically linear (e.g., methods based on blendshape mapping) and the dynamical aspects of...
The Nyström method is an efficient technique for scaling kernel learning to very large data sets with more than millions. Instead of computing kernel matrix, it is to approximate a kernel learning problem with a linear prediction problem. We propose an ensemble Nyström method for high dimensional prediction of conflict level from speech. The experi...
In this paper, we present a method to improve the classification recall of a deep Boltzmann machine (DBM) on the task of emotion recognition from speech. The task involves the binary classification of four emotion dimensions such as arousal, expectancy, power, and valence. The method consists of dividing the features of the input data into separate...
Separating leading voice from a music mixture remains challenging for automatic systems. Competing harmonics from music accompaniment severely interfere the leading voice estimation. To properly extract the leading voice, separation algorithms based on source-filter modeling of human voice and non-negative matrix factorization have been introduced....
High-quality natural singing is appealing to everyone. However, singing well is not trivial. Professional singers are trained and practised over the years for various singing skills, such as sense of rhythm, precision in pitch, vocal control and personal styling etc. We, Institute for Infocomm Research (I2R), in Singapore has developed a personaliz...
Pathological speech usually refers to the voice disorders resulting from atypicalities in voice and/or in the articulatory mechanisms due to disease, illness or other physical problem in the speech production system. It may increase unhealthy social behavior and voice abuse, and dramatically affect the patients' quality of life. Therefore, automati...
In this paper, we explore Dynamic Gaussian Processes (DGP) based learning techniques for voice conversion. In particular, we propose to use dynamic squared exponential GP with sparse partial least squares (SPLS) technique to model nonlinearities as well as to capture the dynamics in the source data. The concatenation of previous and next frames can...
In this paper, we compare the performance of support vector machine (SVM) and asymmetric SIMPLS classifiers in emotion recognition from naturalistic dialogues. These two classifiers are evaluated on the SEMAINE corpus that involves emotional binary classification tasks of four dimensions, namely, activation, expectation, power, and valence. The exp...
Voice disorders could increase unhealthy social behavior and voice abuse, and dramatically affect the patients' quality of life. Therefore, automatic intelligibility detection of pathological voices has an important role in the opportune treatment of pathological voices. This paper aims at designing an intelligibility detection system which is char...
The transformation of vocal characteristics aims at modifying voice such that the intelligibility of aphonic voice is increased or the voice characteristics of a speaker (source speaker) to be perceived as if another speaker (target speaker) had uttered it. In this paper, the current state-of-the-art voice characteristics transformation methodology...
This paper describes a Speaker State Classification System (SSCS) for the INTERSPEECH 2011 Speaker State Challenge. Our SSC system for the Intoxication and Sleepiness Sub-Challenges uses fusion of several individual sub-systems. We make use of three standard feature sets per corpus given by organizers. Modeling is based on our own developed classif...
This paper investigates the performance of objective speech and audio quality measures for the prediction of perceived sound quality of synthetic speech. A number of existing quality measures have been applied to synthetic speech generated by different speech synthesizers such like LP synthesizer, HSM synthesizer, STRAIGHT synthesizer and several H...
This paper aims at evaluating the performance of a “Lombard effect model” for improving speech intelligibility over telephone channel. It is well known that the naturalness and intelligibility of speech degrades rapidly in communication channels, such as phone networks or public address systems. To reduce the degradation, a ”Lombard effect mimickin...
Seeing that speakers increase the intensity of their voice when speaking in loud noise (Lombard effect), this paper proposes a speech transformation approach to mimic this Lombard ef-fect for improving the intelligibility of speech in noisy envi-ronments. The approach attempts to simulate the variations of duration, formant frequencies, formant ban...
This paper presents high-level strategies for controlling emotional speech morphing algorithms. Emotion morphing is realized by representing the acoustic features in their time-frequency plan that is warped and modified to generate natural morphed emotional speech. These acoustic features are desirable to be decomposed into multidimensional space a...
This paper describes a method to increase speech intelligibility when the speech signal is being transmitted over telephone lines. In order to detect all factors which affect speech intelligibility, we use telephone simulation tool in ITUT Software Tools Library release 2005 (STL2005) to identify the most problematic telephone-channel deterioration...
This paper describes I2R's submission to the Blizzard Challenge 2009. This is our second time participating in this challenge. In this paper, we will describe our main approach to building the required voices. We will introduce the procedure of database processing, the definitions of the acoustic, prosodic and linguistic parameters, the components...
In this paper, we use a stochastic fixed-point theorem to study the stochastic convergence properties (in mean-squares sense) of the cascaded LMS predictor including conditions on the stepsize for the adaptive algorithm convergence and the misadjustment. An analytic expression for the misadjustment is derived for Gaussian statistical signals and sh...
In a low bit-rate audio transform coding system, the quantization noise is spread out over the entire analysis block in the time domain after the inverse transformation in the decoder. This increases the likelihood of pre-echoes occurring prior to a transient event. In order to control preechoes level, a suitable quantizer should be applied in the...
In this paper, we use a stochastic fixed-point theorem to analyze the stochastic convergence properties of the cascaded RLS-LMS prediction filter in terms of conditions of convergence and the misadjustment. It is shown that the cascaded RLS-LMS prediction filter converges to almost the same optimal solution of the conventional RLS filter. The misad...
This paper describes the cascaded recursive least square-least mean square (RLS-LMS) prediction, which is part of the recently published MPEG-4 Audio Lossless Coding international standard. The predictor consists of cascaded stages of simple linear predictors, with the prediction error at the output of one stage passed to the next stage as the inpu...
It is well known that most adaptive filtering algorithms are developed based on the methods of least mean squares or of least squares. The popular adaptive algorithms such like the LMS, the RLS and their variants have been developed for different applications. In this paper, we propose to use maximum a posteriori (MAP) probability approach to estim...
This paper addresses the problem of adaptively optimizing a two-channel lossless finite-impulse-response (FIR) filter bank, which finds application in subband coding and wavelet signal analysis. Instead of using a gradient decent procedure-with its inherent problem of becoming trapped in local minima of a nonquadratic cost function-two eigenstructu...
We study the performance of an FIR cascade structure for adaptive linear prediction, in which each FIR filter stage is independently adapted using an LMS algorithm. The performance bound is derived for the cascade LMS predictor under some assumptions. We discover that it is possible for this bound to be better than that of the linear predictive cod...
An FIR cascade structure for adaptive linear prediction is studied in which each stage FIR filter is independently adapted using an LMS algorithm. Theoretical analysis shows that the cascade performs a linear prediction in a way of successive refinement and each stage tries to obliterate the dominant mode of its input. Experimental results show tha...
In this paper, we study the issue of the high sampling rate audio modeling for lossless audio coding. We propose a cascade LMS structure to successfully model all high sampling rate audio signals. This cascade structure predictor, not only performs better than its counterpart FTR linear prediction coding (LPC) technique in modeling general audio si...
We analyze the effect of both correlated and uncorrelated input signals on the performance of a cascade RLS-LMS based adaptive linear prediction algorithm for lossless compression of different resolution audio signals. We show that the sensitivity of the cascade RLS-LMS predictor to the input signal depends mainly on that of the RLS pre-whitener. T...
In this paper a new approach to pitch detection for spontaneous speech in noisy environment is proposed. It uses multi-rate lossless FIR filter to build a model which connects acoustic tubes and wavelet transform. An "on-line" eigen-structure algorithm is used to control the parameters of the model based on the state description of the physical sys...
In this paper, an efficient implementation of the forward and inverse MDCT is proposed for even-length MDCT. The algorithm uses discrete cosine transform of type II (DCT-II) to compute the forward MDCT and their inverse DCT-III to compute the inverse MDCT. The lifting scheme is used to approximate multiplications appearing in the MDCT lattice struc...
In this paper, four frame synchronisation algorithms for OFDM systems are examined for preamble-based time synchronisation. However, the performance of the frame synchronisation depends on the threshold pre-defined. The performance of these algorithms is analysed in terms of estimation mean and variance of optimum frame position. The threshold can...
A novel echo embedding technique is proposed to overcome inherent trade-off between in-audibility and robustness in conventional
echo hiding. It makes use of masking model to embed two echoes by both positive and negative pulses (closely located) and
high energy to host audio signals. Subjective listening tests show that the proposed method could i...
An adaptive and content-based audio watermarking system based on
echo hiding is presented. The proposed system aims to overcome the
problems of audible echoes in a simple echo hiding approach. Audibility
of echoes is reduced especially for problematic signals by the
application of signal-based attenuation, psycho-acoustic model and
perceptual filte...
In this paper, we present the implementation of the MPEG-4
advanced audio coding (MPEG-4 AAC) encoder on a floating-point DSP
(ADSP-21060 SHARC) based hardware. We discuss how selective optimization
of the encoder verification model (VM) software structure allows robust
performance using limited resources, highlight some of the problems
inherent in...
This paper presents a comparative study of adaptive wavelet design
for a speech coder based on wavelet-type multiresolution transforms. It
is concerned with the problem of choosing suitable transforms that are
adapted to the given speech signal in the sense that they maximize the
coding gain at each resolution level. Four adaptive algorithms are
re...
This paper compares the eigenstructure and modulation algorithms,
which are used for two-channel lossless FIR filter optimization. We
study the effects of eigenvalue separation of the input covariance
matrix and the step size on their convergence behavior. First, we show
that the convergence rate of two algorithms increases as the separation
of eig...
We consider the problem of adaptively optimizing a two-channel
lossless FIR filter bank, which finds application in subband coding or
wavelet signal analysis. Instead of using a gradient descent
procedure-with its inherent problem of possible convergence to local
minima-we consider two eigenstructure algorithms. Both algorithms
feature a priori bou...
We develop a new algorithm for multirate filter bank op- timization, which finds application in subband coding or wavelet signal analysis. Although some impressive off-line algorithms have recently been developed for this purpose, the computation demand of such algorithms often renders them prohibitive for real-time applications. In this vein, adap...
Dans cet article, nous appliquons un résultat de Benveniste pour l'analyse des performances aymptotiques des deux algorithmes proposés dans [3] et réalisons une étude comparative. Nous obtenons des expressions analytiques sur la variance asymptotique des paramètres de bancs de filtres sans pertes et celle de la sortie (y2(n, Θ) -y2(n, Θ*)). L...
Projects
Projects (6)
It aims at the study of vocal signals beyond the basic verbal message or speech. It includes accent, intonation, pitch, prosody, rhythm, modulation, and fluency and non-vocal phenomena such as facial expression, hand gestures, eye movements, and so on.
We develop innovative architectures and algorithms for emotion recognition and generation by exploring different modalities (e.g., speech, image, video, text, and physiological signals)