Dong-Yan Huang

Dong-Yan Huang
Agency for Science, Technology and Research (A*STAR) | A*Star · Institute for Infocomm Research (I2R)

About

71
Publications
20,247
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
710
Citations
Citations since 2017
10 Research Items
490 Citations
2017201820192020202120222023020406080100
2017201820192020202120222023020406080100
2017201820192020202120222023020406080100
2017201820192020202120222023020406080100

Publications

Publications (71)
Conference Paper
In this paper, we present an audio-visual emotion conversion based on deep learning for 3D talking head. The technology aims at retargeting neutral facial and speech expression into emotional ones. The challenging issues are how to control dynamics and variations of different expressions of both speech and the face. The controllability of facial ex...
Conference Paper
Affective social multimedia computing is an emergent research topic for both affective computing and multimedia research communities. Social multimedia is fundamentally changing how we communicate, interact, and collaborate with other people in our daily lives. Social multimedia contains much affective information. Effective extraction of affective...
Conference Paper
Full-text available
In this paper, we aim at improving the performance of synthesized speech in statistical parametric speech synthesis (SPSS) based on a generative adversarial network (GAN). In particular, we propose a novel architecture combining the traditional acoustic loss function and the GAN's discriminative loss under a multi-task learning (MTL) framework. The...
Conference Paper
Full-text available
This paper presents the techniques used in our contribution to Emotion Recognition in the Wild 2017 video based sub-challenge. The purpose of the sub-challenge is to classify the six basic emotions (angry, sad, happy, surprise, fear and disgust) and neutral. Our proposed solution utilizes three state-of-the-arts techniques to overcome the challenge...
Conference Paper
Full-text available
This paper presents the techniques used in our contribution to Emotion Recognition in the Wild 2016’s video based sub-challenge. The purpose of the sub-challenge is to classify the six basic emotions (angry, sad, happy, surprise, fear & disgust) and neutral. Compared to earlier years’ movie based datasets, this year’s test dataset introduced realit...
Conference Paper
Voice conversion (VC) aims to make one speaker (source) to sound like spoken by another speaker (target) without changing the language content. Most of the state-of-the-art voice conversion systems focus only on timbre conversion. However, the speaker identity is characterized by the source-related cues such as fundamental frequency and energy as w...
Conference Paper
Automatic detection of pathological voice is a challenging task in speech processing. Appropriate acoustic cues of voice can be used to differentiate between normal voices and pathological voices. We propose a method to represent each speech utterance using three types of speech signal representations (i.e., cross-correlation matrix, Gaussian distr...
Conference Paper
Full-text available
To convert one speaker's voice to another's, the mapping of the corresponding speech segments from source speaker to target speaker must be obtained first. In parallel voice conversion, normally dynamic time warping (DTW) method is used to align signals of source and target voices. However, for conversion between non-parallel speech data, the DTW b...
Conference Paper
Full-text available
Nonnegative matrix factorization (NMF) is a popular method for source separation. In this paper, an alternating direction method of multipliers (ADMM) for NMF is studied, which deals with the NMF problem using the cost function of beta-divergence. Our study shows that this algorithm outperforms state-of-the-art algorithms on synthetic data sets, bu...
Conference Paper
Full-text available
This paper is to show a representation of fundamental frequency (F0) using continuous wavelet transform (CWT) for prosody modeling in emotion conversion. Emotional conversion aims at converting speech from one emotion state to another. Specifically, we use CWT to decompose F0 into a five-scale representation that corresponds to five temporal scales...
Conference Paper
Full-text available
This paper proposes a real-time variable-Q non-stationary Ga-bor transform (VQ-NSGT) system for speech pitch shifting. The system allows for time-frequency representations of speech on variable-Q (VQ) with perfect reconstruction and computational efficiency. The proposed VQ-NSGT phase vocoder can be used for pitch shifting by simple frequency trans...
Conference Paper
Full-text available
Non-negative matrix factorization (NMF) aims at finding non-negative representations of nonnegative data. Among different NMF algorithms, alternating direction method of multipliers (ADMM) is a popular one with superior performance. However , we find that ADMM shows instability and inferior performance on real-world data like speech signals. In thi...
Conference Paper
Full-text available
In this paper, we address the problem of phase retrieval to recover a signal from the magnitude of its Fourier transform. In many applications of phase retrieval, the signals encountered are naturally sparse. In this work, we consider the case where the signal is sparse under the assumption that few components are nonzero. We exploit further the sp...
Conference Paper
Full-text available
In this paper, we propose a linear pre-and post-filter for improving the perceptual speech quality of a vocoder based on amplitude spectrum of residual signal. The goal of the pre-filter in analyzer is to make the noise of spectrum inaudible. It attempts to hide the noise under the signal spectrum by exploiting human auditory masking. The post-filt...
Article
Full-text available
Emotional facial expression transfer involves sequence-to-sequence mappings from an neutral facial expression to another emotional facial expression, which is a well-known problem in computer graphics. In the graphics community, current considered methods are typically linear (e.g., methods based on blendshape mapping) and the dynamical aspects of...
Article
Full-text available
The Nyström method is an efficient technique for scaling kernel learning to very large data sets with more than millions. Instead of computing kernel matrix, it is to approximate a kernel learning problem with a linear prediction problem. We propose an ensemble Nyström method for high dimensional prediction of conflict level from speech. The experi...
Article
Full-text available
In this paper, we present a method to improve the classification recall of a deep Boltzmann machine (DBM) on the task of emotion recognition from speech. The task involves the binary classification of four emotion dimensions such as arousal, expectancy, power, and valence. The method consists of dividing the features of the input data into separate...
Conference Paper
Full-text available
Separating leading voice from a music mixture remains challenging for automatic systems. Competing harmonics from music accompaniment severely interfere the leading voice estimation. To properly extract the leading voice, separation algorithms based on source-filter modeling of human voice and non-negative matrix factorization have been introduced....
Conference Paper
Full-text available
High-quality natural singing is appealing to everyone. However, singing well is not trivial. Professional singers are trained and practised over the years for various singing skills, such as sense of rhythm, precision in pitch, vocal control and personal styling etc. We, Institute for Infocomm Research (I2R), in Singapore has developed a personaliz...
Conference Paper
Full-text available
Pathological speech usually refers to the voice disorders resulting from atypicalities in voice and/or in the articulatory mechanisms due to disease, illness or other physical problem in the speech production system. It may increase unhealthy social behavior and voice abuse, and dramatically affect the patients' quality of life. Therefore, automati...
Conference Paper
Full-text available
In this paper, we explore Dynamic Gaussian Processes (DGP) based learning techniques for voice conversion. In particular, we propose to use dynamic squared exponential GP with sparse partial least squares (SPLS) technique to model nonlinearities as well as to capture the dynamics in the source data. The concatenation of previous and next frames can...
Conference Paper
Full-text available
In this paper, we compare the performance of support vector machine (SVM) and asymmetric SIMPLS classifiers in emotion recognition from naturalistic dialogues. These two classifiers are evaluated on the SEMAINE corpus that involves emotional binary classification tasks of four dimensions, namely, activation, expectation, power, and valence. The exp...
Article
Voice disorders could increase unhealthy social behavior and voice abuse, and dramatically affect the patients' quality of life. Therefore, automatic intelligibility detection of pathological voices has an important role in the opportune treatment of pathological voices. This paper aims at designing an intelligibility detection system which is char...
Article
Full-text available
The transformation of vocal characteristics aims at modifying voice such that the intelligibility of aphonic voice is increased or the voice characteristics of a speaker (source speaker) to be perceived as if another speaker (target speaker) had uttered it. In this paper, the current state-of-the-art voice characteristics transformation methodology...
Conference Paper
This paper describes a Speaker State Classification System (SSCS) for the INTERSPEECH 2011 Speaker State Challenge. Our SSC system for the Intoxication and Sleepiness Sub-Challenges uses fusion of several individual sub-systems. We make use of three standard feature sets per corpus given by organizers. Modeling is based on our own developed classif...
Article
This paper investigates the performance of objective speech and audio quality measures for the prediction of perceived sound quality of synthetic speech. A number of existing quality measures have been applied to synthetic speech generated by different speech synthesizers such like LP synthesizer, HSM synthesizer, STRAIGHT synthesizer and several H...
Article
This paper aims at evaluating the performance of a “Lombard effect model” for improving speech intelligibility over telephone channel. It is well known that the naturalness and intelligibility of speech degrades rapidly in communication channels, such as phone networks or public address systems. To reduce the degradation, a ”Lombard effect mimickin...
Article
Full-text available
Seeing that speakers increase the intensity of their voice when speaking in loud noise (Lombard effect), this paper proposes a speech transformation approach to mimic this Lombard ef-fect for improving the intelligibility of speech in noisy envi-ronments. The approach attempts to simulate the variations of duration, formant frequencies, formant ban...
Conference Paper
Full-text available
This paper presents high-level strategies for controlling emotional speech morphing algorithms. Emotion morphing is realized by representing the acoustic features in their time-frequency plan that is warped and modified to generate natural morphed emotional speech. These acoustic features are desirable to be decomposed into multidimensional space a...
Conference Paper
This paper describes a method to increase speech intelligibility when the speech signal is being transmitted over telephone lines. In order to detect all factors which affect speech intelligibility, we use telephone simulation tool in ITUT Software Tools Library release 2005 (STL2005) to identify the most problematic telephone-channel deterioration...
Article
Full-text available
This paper describes I2R's submission to the Blizzard Challenge 2009. This is our second time participating in this challenge. In this paper, we will describe our main approach to building the required voices. We will introduce the procedure of database processing, the definitions of the acoustic, prosodic and linguistic parameters, the components...
Conference Paper
In this paper, we use a stochastic fixed-point theorem to study the stochastic convergence properties (in mean-squares sense) of the cascaded LMS predictor including conditions on the stepsize for the adaptive algorithm convergence and the misadjustment. An analytic expression for the misadjustment is derived for Gaussian statistical signals and sh...
Article
Full-text available
In a low bit-rate audio transform coding system, the quantization noise is spread out over the entire analysis block in the time domain after the inverse transformation in the decoder. This increases the likelihood of pre-echoes occurring prior to a transient event. In order to control preechoes level, a suitable quantizer should be applied in the...
Conference Paper
Full-text available
In this paper, we use a stochastic fixed-point theorem to analyze the stochastic convergence properties of the cascaded RLS-LMS prediction filter in terms of conditions of convergence and the misadjustment. It is shown that the cascaded RLS-LMS prediction filter converges to almost the same optimal solution of the conventional RLS filter. The misad...
Article
This paper describes the cascaded recursive least square-least mean square (RLS-LMS) prediction, which is part of the recently published MPEG-4 Audio Lossless Coding international standard. The predictor consists of cascaded stages of simple linear predictors, with the prediction error at the output of one stage passed to the next stage as the inpu...
Conference Paper
It is well known that most adaptive filtering algorithms are developed based on the methods of least mean squares or of least squares. The popular adaptive algorithms such like the LMS, the RLS and their variants have been developed for different applications. In this paper, we propose to use maximum a posteriori (MAP) probability approach to estim...
Article
Full-text available
This paper addresses the problem of adaptively optimizing a two-channel lossless finite-impulse-response (FIR) filter bank, which finds application in subband coding and wavelet signal analysis. Instead of using a gradient decent procedure-with its inherent problem of becoming trapped in local minima of a nonquadratic cost function-two eigenstructu...
Conference Paper
We study the performance of an FIR cascade structure for adaptive linear prediction, in which each FIR filter stage is independently adapted using an LMS algorithm. The performance bound is derived for the cascade LMS predictor under some assumptions. We discover that it is possible for this bound to be better than that of the linear predictive cod...
Conference Paper
Full-text available
An FIR cascade structure for adaptive linear prediction is studied in which each stage FIR filter is independently adapted using an LMS algorithm. Theoretical analysis shows that the cascade performs a linear prediction in a way of successive refinement and each stage tries to obliterate the dominant mode of its input. Experimental results show tha...
Conference Paper
In this paper, we study the issue of the high sampling rate audio modeling for lossless audio coding. We propose a cascade LMS structure to successfully model all high sampling rate audio signals. This cascade structure predictor, not only performs better than its counterpart FTR linear prediction coding (LPC) technique in modeling general audio si...
Conference Paper
Full-text available
We analyze the effect of both correlated and uncorrelated input signals on the performance of a cascade RLS-LMS based adaptive linear prediction algorithm for lossless compression of different resolution audio signals. We show that the sensitivity of the cascade RLS-LMS predictor to the input signal depends mainly on that of the RLS pre-whitener. T...
Conference Paper
In this paper a new approach to pitch detection for spontaneous speech in noisy environment is proposed. It uses multi-rate lossless FIR filter to build a model which connects acoustic tubes and wavelet transform. An "on-line" eigen-structure algorithm is used to control the parameters of the model based on the state description of the physical sys...
Article
Full-text available
In this paper, an efficient implementation of the forward and inverse MDCT is proposed for even-length MDCT. The algorithm uses discrete cosine transform of type II (DCT-II) to compute the forward MDCT and their inverse DCT-III to compute the inverse MDCT. The lifting scheme is used to approximate multiplications appearing in the MDCT lattice struc...
Conference Paper
In this paper, four frame synchronisation algorithms for OFDM systems are examined for preamble-based time synchronisation. However, the performance of the frame synchronisation depends on the threshold pre-defined. The performance of these algorithms is analysed in terms of estimation mean and variance of optimum frame position. The threshold can...
Conference Paper
A novel echo embedding technique is proposed to overcome inherent trade-off between in-audibility and robustness in conventional echo hiding. It makes use of masking model to embed two echoes by both positive and negative pulses (closely located) and high energy to host audio signals. Subjective listening tests show that the proposed method could i...
Conference Paper
Full-text available
An adaptive and content-based audio watermarking system based on echo hiding is presented. The proposed system aims to overcome the problems of audible echoes in a simple echo hiding approach. Audibility of echoes is reduced especially for problematic signals by the application of signal-based attenuation, psycho-acoustic model and perceptual filte...
Conference Paper
In this paper, we present the implementation of the MPEG-4 advanced audio coding (MPEG-4 AAC) encoder on a floating-point DSP (ADSP-21060 SHARC) based hardware. We discuss how selective optimization of the encoder verification model (VM) software structure allows robust performance using limited resources, highlight some of the problems inherent in...
Conference Paper
Full-text available
This paper presents a comparative study of adaptive wavelet design for a speech coder based on wavelet-type multiresolution transforms. It is concerned with the problem of choosing suitable transforms that are adapted to the given speech signal in the sense that they maximize the coding gain at each resolution level. Four adaptive algorithms are re...
Conference Paper
Full-text available
This paper compares the eigenstructure and modulation algorithms, which are used for two-channel lossless FIR filter optimization. We study the effects of eigenvalue separation of the input covariance matrix and the step size on their convergence behavior. First, we show that the convergence rate of two algorithms increases as the separation of eig...
Conference Paper
Full-text available
We consider the problem of adaptively optimizing a two-channel lossless FIR filter bank, which finds application in subband coding or wavelet signal analysis. Instead of using a gradient descent procedure-with its inherent problem of possible convergence to local minima-we consider two eigenstructure algorithms. Both algorithms feature a priori bou...
Article
Full-text available
We develop a new algorithm for multirate filter bank op- timization, which finds application in subband coding or wavelet signal analysis. Although some impressive off-line algorithms have recently been developed for this purpose, the computation demand of such algorithms often renders them prohibitive for real-time applications. In this vein, adap...
Article
Dans cet article, nous appliquons un résultat de Benveniste pour l'analyse des performances aymptotiques des deux algorithmes proposés dans [3] et réalisons une étude comparative. Nous obtenons des expressions analytiques sur la variance asymptotique des paramètres de bancs de filtres sans pertes et celle de la sortie (y2(n, Θ) -y2(n, Θ*)). L...

Network

Cited By

Projects

Projects (6)
Project
It aims at the study of vocal signals beyond the basic verbal message or speech. It includes accent, intonation, pitch, prosody, rhythm, modulation, and fluency and non-vocal phenomena such as facial expression, hand gestures, eye movements, and so on.
Project
We develop innovative architectures and algorithms for emotion recognition and generation by exploring different modalities (e.g., speech, image, video, text, and physiological signals)
Project
To recognition emotion from speech or generate emotional speech.