Xiaohai Tian

Xiaohai Tian
  • Ph.D
  • Research Associate at National University of Singapore

About

80
Publications
22,047
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,504
Citations
Introduction
I’m interested in speech signal processing, in particular speech synthesis. My research topics include voice conversion, speech synthesis, singing synthesis and anti-spoofing.
Current institution
National University of Singapore
Current position
  • Research Associate
Additional affiliations
October 2018 - present
National University of Singapore
Position
  • Research Associate
Education
January 2014 - March 2019

Publications

Publications (80)
Conference Paper
Full-text available
This study explores the impact of using non-native speech data in acoustic model training for pronunciation assessment systems. The goal is to determine how introducing non-native data in acoustic model training can influence alignment accuracy and assessment performance. Acoustic models are trained using different combinations of native and non-na...
Preprint
Full-text available
Speech fluency/disfluency can be evaluated by analyzing a range of phonetic and prosodic features. Deep neural networks are commonly trained to map fluency-related features into the human scores. However, the effectiveness of deep learning-based models is constrained by the limited amount of labeled training samples. To address this, we introduce a...
Preprint
Full-text available
Recent studies on pronunciation scoring have explored the effect of introducing phone embeddings as reference pronunciation, but mostly in an implicit manner, i.e., addition or concatenation of reference phone embedding and actual pronunciation of the target phone as the phone-level pronunciation quality representation. In this paper, we propose to...
Preprint
Full-text available
A typical fluency scoring system generally relies on an automatic speech recognition (ASR) system to obtain time stamps in input speech for either the subsequent calculation of fluency-related features or directly modeling speech fluency with an end-to-end approach. This paper describes a novel ASR-free approach for automatic fluency assessment usi...
Article
Full-text available
Accent Conversion (AC) seeks to change the accent of speech from one (source) to another (target) while preserving the speech content and speaker identity. However, many existing AC approaches rely on source-target parallel speech data during training or reference speech at run-time. We propose a novel accent conversion framework without the need f...
Article
Full-text available
Cross-lingual voice conversion (XVC) transforms the speaker identity of a source speaker to that of a target speaker who speaks a different language. Due to the intrinsic differences between languages, the converted speech may carry an unwanted foreign accent. In this paper, we first investigate the intelligibility of the converted speech and confi...
Preprint
Accent Conversion (AC) seeks to change the accent of speech from one (source) to another (target) while preserving the speech content and speaker identity. However, many AC approaches rely on source-target parallel speech data. We propose a novel accent conversion framework without the need of parallel data. Specifically, a text-to-speech (TTS) sys...
Preprint
Full-text available
Deep learning-based pronunciation scoring models highly rely on the availability of the annotated non-native data, which is costly and has scalability issues. To deal with the data scarcity problem, data augmentation is commonly used for model pretraining. In this paper, we propose a phone-level mixup, a simple yet effective data augmentation metho...
Article
Full-text available
Cross-lingual personalized speech generation seeks to synthesize a target speakers voice from only a few training samples that are in a different language. One popular technique is to condition a speech synthesizer on a speaker embedding, that characterizes the target speaker. Unfortunately, such a speaker embedding is usually affected by the langu...
Article
Full-text available
We present a database of parallel recordings of speech and singing, collected and released by the Human Language Technology (HLT) laboratory at the National University of Singapore (NUS), that is called NUS-HLT Speak-Sing (NHSS) database. We release this database¹ to the public to support research activities, that include, but not limited to compar...
Preprint
Full-text available
Cross-Lingual Voice Conversion (XVC) aims to modify a source speaker identity towards a target while preserving the source linguistic content. This paper introduces a cycle consistency loss on linguistic representation to ensure the speech content unchanged after conversion. The proposed XVC model consists of two loss functions during optimization:...
Preprint
Full-text available
The Multi-speaker Multi-style Voice Cloning Challenge (M2VoC) aims to provide a common sizable dataset as well as a fair testbed for the benchmarking of the popular voice cloning task. Specifically, we formulate the challenge to adapt an average TTS model to the stylistic target voice with limited data from target speaker, evaluated by speaker iden...
Article
WaveNet is introduced for waveform generation. It produces high quality text-to-speech synthesis, music generation, and voice conversion. However, it generally requires a large amount of training data, that limits its scope of applications, e.g. in voice conversion. In this paper, we propose a factorized WaveNet for limited data tasks. Specifically...
Preprint
Full-text available
We present a database of parallel recordings of speech and singing, collected and released by the Human Language Technology (HLT) laboratory at the National University of Singapore (NUS), that is called NUS-HLT Speak-Sing (NHSS) database. We release this database to the public to support research activities, that include, but not limited to compara...
Preprint
Full-text available
We propose a novel training scheme to optimize voice conversion network with a speaker identity loss function. The training scheme not only minimizes frame-level spectral loss, but also speaker identity loss. We introduce a cycle consistency loss that constrains the converted speech to maintain the same speaker identity as reference speech at utter...
Conference Paper
Full-text available
The Voice Conversion Challenge 2020 is the third edition under its flagship that promotes intra-lingual semiparallel and crosslingual voice conversion (VC). While the primary evaluation of the challenge submissions was done through crowd-sourced listening tests, we also performed an objective assessment of the submitted systems. The aim of the obje...
Preprint
Full-text available
The Voice Conversion Challenge 2020 is the third edition under its flagship that promotes intra-lingual semiparallel and cross-lingual voice conversion (VC). While the primary evaluation of the challenge submissions was done through crowd-sourced listening tests, we also performed an objective assessment of the submitted systems. The aim of the obj...
Preprint
Full-text available
The voice conversion challenge is a bi-annual scientific event held to compare and understand different voice conversion (VC) systems built on a common dataset. In 2020, we organized the third edition of the challenge and constructed and distributed a new database for two tasks, intra-lingual semi-parallel and cross-lingual VC. After a two-month ch...
Article
Full-text available
Spoken languages are similar phonetically because humans have a common vocal production system. However, each language has a unique phonetic repertoire and phonotactic rule. In cross-lingual voice conversion, source speaker and target speaker speak different languages. The challenge is how to project the speaker identity of the source speaker to th...
Conference Paper
Full-text available
In this paper, we formulate a personalized singing voice generation (SVG) framework using WaveRNN with non-parallel training data. We develop an average singing voice generation model using WaveRNN from multi-singer's vocals. To map singing Phonetic PosteriorGrams and prosody features from singing template to time-domain singing samples, a speaker...
Preprint
Full-text available
Security of automatic speaker verification (ASV) systems is compromised by various spoofing attacks. While many types of non-proactive attacks (and their defenses) have been studied in the past, attacker's perspective on ASV, represents a far less explored direction. It can potentially help to identify the weakest parts of ASV systems and be used t...
Conference Paper
Full-text available
This paper presents a cross-lingual voice conversion framework that adopts a modularized neural network. The modularized neural network has a common input structure that is shared for both languages, and two separate output modules, one for each language. The idea is motivated by the fact that phonetic systems of languages are similar because human...
Preprint
This paper presents a cross-lingual voice conversion framework that adopts a modularized neural network. The modularized neural network has a common input structure that is shared for both languages, and two separate output modules, one for each language. The idea is motivated by the fact that phonetic systems of languages are similar because human...
Preprint
Full-text available
Automatic speaker verification (ASV) systems in practice are greatly vulnerable to spoofing attacks. The latest voice conversion technologies are able to produce perceptually natural-sounding speech that mimics any target speakers. However, the perceptual closeness to a speaker's identity may not be enough to deceive an ASV system. In this work, we...
Conference Paper
Full-text available
Speech-to-Singing (STS) conversion aims at converting one's reading speech into his/her singing vocal. The prior work was mainly focused on transforming the prosody of speech to singing, however, there exist prominent differences between the spectra of speech and singing, which need to be transformed as well. In this paper, we propose to make use o...
Conference Paper
Full-text available
Among various voice conversion (VC) techniques, average modeling approach has achieved good performance as it benefits from training data of multiple speakers, therefore, reducing the reliance on training data from the target speaker. Many existing average modeling approaches rely on the use of i-vector to represent the speaker identity for model a...
Conference Paper
Full-text available
This paper presents a cross-lingual voice conversion approach using bilingual Phonetic PosteriorGram (PPG) and average modeling. The proposed approach makes use of bilingual PPGs to represent speaker-independent features of speech signals from different languages in the same feature space. In particular, a bilingual PPG is formed by stacking two mo...
Preprint
Full-text available
In a typical voice conversion system, vocoder is commonly used for speech-to-features analysis and features-to-speech synthesis. However, vocoder can be a source of speech quality degradation. This paper presents a vocoder-free voice conversion approach using WaveNet for non-parallel training data. Instead of dealing with the intermediate features,...
Article
Full-text available
The voice conversion's task is to modify a source speaker's voice to sound like that of a target speaker. A conversion method is considered successful when the produced speech sounds natural and similar to the target speaker. This paper presents a new voice conversion framework in which we combine frequency warping and exemplar-based method for voi...
Conference Paper
In this paper, we present an age-friendly E-commerce system with novel assistive functional technologies, aiming at providing a comfortable online shopping environment for the elderly. Besides incorporating human factors for the elderly into the design of user interface, we build an age-friendly system by improving the functional usability. First,...
Conference Paper
Full-text available
Spoofing speech detection aims to differentiate spoofing speech from natural speech. Frame-based features are usually used in most of previous works. Although multiple frames or dynamic features are used to form a super-vector to represent the temporal information, the time span covered by these features are not sufficient. Most of the systems fail...
Conference Paper
Full-text available
Spoofing detection for automatic speaker verification (ASV), which is to discriminate between live and artificial speech, has received increasing attentions recently. However, the previous studies have been done on the clean data without significant noise. It is still not clear whether the spoofing detectors trained on clean speech can generalise w...
Conference Paper
This paper presents an age-friendly system for improving the elderly's online shopping experience. Different from most related studies focusing on website design and content organization, we propose to integrate three assistive techniques to facilitate the elderly's browsing of products in E-commerce platforms, including the crowd-improved speech r...
Article
Full-text available
Voice conversion methods have advanced rapidly over the last decade. Studies have shown that speaker characteristics are captured by spectral feature as well as various prosodic features. Most existing conversion methods focus on the spectral feature as it directly represents the timbre characteristics, while some conversion methods have focused on...
Conference Paper
Full-text available
Spoofing detection, which discriminates the spoofed speech from the natural speech, has gained much attention recently. Low-dimensional features that are used in speaker recognition/verification are also used in spoofing detection. Unfortunately, they don't capture sufficient information required for spoofing detection. In this work, we investigate...
Article
Full-text available
Spoofing detection for automatic speaker verification (ASV), which is to discriminate between live speech and attacks, has received increasing attentions recently. However, all the previous studies have been done on the clean data without significant additive noise. To simulate the real-life scenarios, we perform a preliminary investigation of spoo...
Conference Paper
Full-text available
State-of-the-art statistical parametric speech synthesis (SPSS) generally uses a vocoder to represent speech signals and parameterize them into features for subsequent modeling. Magnitude spectrum has been a dominant feature over the years. Although perceptual studies have shown that phase spectrum is essential to the quality of synthesized speech,...
Conference Paper
Full-text available
Recently, a number of voice conversion methods have been developed. These methods attempt to improve conversion performance by using diverse mapping techniques in various acoustic domains, e.g. high-resolution spectra and low-resolution Mel-cepstral coefficients. Each individual method has its own pros and cons. In this paper, we introduce a system...
Conference Paper
Full-text available
Recent improvement in text-to-speech (TTS) and voice conversion (VC) techniques imposes a threat to automatic speaker verification (ASV) systems. An attacker can use the TTS or VC systems to synthesize a target speaker's voice to cheat the ASV system. To address this challenge, we study the detection of such synthetic speech (called spoofing speech...
Article
Full-text available
This paper describes the current state of the work that is being carried out in the framework of the ZureTTS project to give a personalized voice to people who cannot speak in their own. Despite the availability of tools and algorithms to synthesize speech and adapt it to new speakers, this process is affordable only for experts. To overcome this p...
Conference Paper
Synthetic speech is speech signals generated by text-to-speech (TTS) and voice conversion (VC) techniques. They impose a threat to speaker verification (SV) systems as an attacker may make use of TTS or VC to synthesize a speakers voice to cheat the SV system. To address this challenge, we study the detection of synthetic speech using long term mag...
Conference Paper
Full-text available
Frequency warping (FW) based voice conversion aims to modify the frequency axis of source spectra towards that of the target. In previous works, the optimal warping function was calculated by minimizing the spectral distance of converted and target spectra without considering the spectral shape. Nevertheless, speaker timbre and identity greatly dep...
Conference Paper
Full-text available
Studies show that professional singing matches well the associated melody and typically exhibits spectra different from speech in resonance tuning and singing formant. Therefore, one of the important topics in speech-to-singing conversion is to characterize the spectral transformation between speech and singing. This paper extends two types of spec...
Conference Paper
Joint density Gaussian mixture model (JD-GMM) based method has been widely used in voice conversion task due to its flexible implementation. However, the statistical averaging effect during estimating the model parameters will result in over-smoothing the target spectral trajectories. Motivated by the local linear transformation method, which uses...
Conference Paper
Full-text available
In this demonstration, we introduce our recent progress on speech and auditory technologies for potential ubiquitous, immersive and personalized applications. The first demo shows an intelligent spoken question answering system, which enables users to interact with a talking avatar via natural speech dialogues. The prototype system demonstrates our...
Conference Paper
Full-text available
Head related transfer functions (HRTFs) vary with anatomical structures in head, pinna and torso. As we know, Chinese are highly different from their Western counterparts in anatomical characteristics. In this paper, we perform an experimental study on HRTF-based virtual auditory, in which the HRTF data are collected from two dummy heads: KEMAR and...

Network

Cited By