Yi-Chiao Wu

Yi-Chiao Wu
  • Doctor of Philosophy
  • Research Scientist at Meta

About

90
Publications
11,096
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,621
Citations
Introduction
I'm currently a research scientist at Meta FAIR, NYC. I have worked in Meta Codec Avatars Lab, Nagoya University, Academia Sinica, ASUS, and Realtek for more than 8 years. I received my Ph.D. (2021) in the Graduate School of Informatics at Nagoya University and M.S. (2011) and B.S. (2009) degrees in the School of Communication Engineering at National Chiao Tung University.
Current institution
Meta
Current position
  • Research Scientist
Additional affiliations
January 2022 - October 2023
Meta
Position
  • Research Scientist
Description
  • AudioDec: https://github.com/facebookresearch/audiodec ScoreDec: https://ieeexplore.ieee.org/document/10448371 EARS dataset: https://github.com/facebookresearch/ears_dataset
October 2013 - March 2015
Asus Tek Computer Inc. (Asus)
Position
  • Software Engineer
March 2012 - September 2013
Realtek Semiconductor Corp.
Position
  • System designer
Education
October 2017 - October 2020
Nagoya University
Field of study
  • Speech Generation
September 2009 - August 2011
National Yang Ming Chiao Tung University
Field of study
  • Speech Processing
September 2005 - June 2009
National Yang Ming Chiao Tung University
Field of study
  • Communication Engineering

Publications

Publications (90)
Preprint
We propose FlowDec, a neural full-band audio codec for general audio sampled at 48 kHz that combines non-adversarial codec training with a stochastic postfilter based on a novel conditional flow matching method. Compared to the prior work ScoreDec which is based on score matching, we generalize from speech to general audio and move from 24 kbit/s t...
Preprint
Full-text available
Neural audio codecs have been widely adopted in audio-generative tasks because their compact and discrete representations are suitable for both large-language-model-style and regression-based generative models. However, most neural codecs struggle to model out-of-domain audio, resulting in error propagations to downstream generative tasks. In this...
Preprint
We release the EARS (Expressive Anechoic Recordings of Speech) dataset, a high-quality speech dataset comprising 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data. The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conv...
Preprint
The trend of scaling up speech generation models poses a threat of biometric information leakage of the identities of the voices in the training data, raising privacy and security concerns. In this paper, we investigate training multi-speaker text-to-speech (TTS) models using data that underwent speaker anonymization (SA), a process that tends to h...
Article
Full-text available
The trend of scaling up speech generation models poses the threat of biometric information leakage of the identities of the voices in the training data, raising privacy and security concerns. In this paper, we investigate the training of multispeaker text-to-speech (TTS) models using data that underwent speaker anonymization (SA), a process that te...
Preprint
Full-text available
A good audio codec for live applications such as telecommunication is characterized by three key properties: (1) compression, i.e.\ the bitrate that is required to transmit the signal should be as low as possible; (2) latency, i.e.\ encoding and decoding the signal needs to be fast enough to enable communication without or with only minimal noticea...
Article
Full-text available
We introduce unified source-filter generative adversarial networks (uSFGAN), a waveform generative model conditioned on acoustic features, which represents the source-filter architecture in a generator network. Unlike the previous neural-based source-filter models in which parametric signal process modules are combined with neural networks, our app...
Preprint
Our previous work, the unified source-filter GAN (uSFGAN) vocoder, introduced a novel architecture based on the source-filter theory into the parallel waveform generative adversarial network to achieve high voice quality and pitch controllability. However, the high temporal resolution inputs result in high computation costs. Although the HiFi-GAN v...
Preprint
Full-text available
Neural-based text-to-speech (TTS) systems achieve very high-fidelity speech generation because of the rapid neural network developments. However, the huge labeled corpus and high computation cost requirements limit the possibility of developing a high-fidelity TTS system by small companies or individuals. On the other hand, a neural vocoder, which...
Preprint
This paper introduces a unified source-filter network with a harmonic-plus-noise source excitation generation mechanism. In our previous work, we proposed unified Source-Filter GAN (uSFGAN) for developing a high-fidelity neural vocoder with flexible voice controllability using a unified source-filter neural network architecture. However, the capabi...
Preprint
Full-text available
Beyond the conventional voice conversion (VC) where the speaker information is converted without altering the linguistic content, the background sounds are informative and need to be retained in some real-world scenarios, such as VC in movie/video and VC in music where the voice is entangled with background sounds. As a new VC framework, we have de...
Preprint
Full-text available
Without the need of a clean reference, non-intrusive speech assessment methods have caught great attention for objective evaluations. Recently, deep neural network (DNN) models have been applied to build non-intrusive speech assessment approaches and confirmed to provide promising performance. However, most DNN-based approaches are designed for nor...
Preprint
In a conventional voice conversion (VC) framework, a VC model is often trained with a clean dataset consisting of speech data carefully recorded and selected by minimizing background interference. However, collecting such a high-quality dataset is expensive and time-consuming. Leveraging crowd-sourced speech data in training is more economical. Mor...
Preprint
Full-text available
Nowadays, neural vocoders can generate very high-fidelity speech when a bunch of training data is available. Although a speaker-dependent (SD) vocoder usually outperforms a speaker-independent (SI) vocoder, it is impractical to collect a large amount of data of a specific target speaker for most real-world applications. To tackle the problem of lim...
Preprint
We propose a unified approach to data-driven source-filter modeling using a single neural network for developing a neural vocoder capable of generating high-quality synthetic speech waveforms while retaining flexibility of the source-filter model to control their voice characteristics. Our proposed network called unified source-filter generative ad...
Preprint
Full-text available
This paper describes the AS-NU systems for two tracks in MultiSpeaker Multi-Style Voice Cloning Challenge (M2VoC). The first track focuses on using a small number of 100 target utterances for voice cloning, while the second track focuses on using only 5 target utterances for voice cloning. Due to the serious lack of data in the second track, we sel...
Preprint
In this paper, we present an open-source software for developing a nonparallel voice conversion (VC) system named crank. Although we have released an open-source VC software based on the Gaussian mixture model named sprocket in the last VC Challenge, it is not straightforward to apply any speech corpus because it is necessary to prepare parallel ut...
Article
Full-text available
In this paper, a pitch-adaptive waveform generative model named Quasi-Periodic WaveNet (QPNet) is proposed to improve the pitch controllability of vanilla WaveNet (WN) using pitch-dependent dilated convolution neural networks (PDCNNs). Specifically, as a probabilistic autoregressive generation model with stacked dilated convolution layers, WN achie...
Article
Full-text available
In this paper, we propose a quasi-periodic parallel WaveGAN (QPPWG) waveform generative model, which applies a quasi-periodic (QP) structure to a parallel WaveGAN (PWG) model using pitch-dependent dilated convolution networks (PDCNNs). PWG is a small-footprint GAN-based raw waveform generative model, whose generation time is much faster than real t...
Article
Full-text available
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody. Nonetheless, without sufficient data, seq2seq VC models can suffer from unstable training and mispronunciation problems in the converted speech, thus far from practical. To tackle these shortcomings, we propose to transfer knowledge...
Article
Full-text available
This paper presents an evaluation of parallel voice conversion (VC) with neural network (NN)-based statistical models for spectral mapping and waveform generation. The NN-based architectures for spectral mapping include deep NN (DNN), deep mixture density network (DMDN), and recurrent NN (RNN) models. WaveNet (WN) vocoder is employed as a high-qual...
Preprint
We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence (seq2seq) framework. A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker. We utilize vq-wav2vec (VQW2V), a discretized self-supervised speech representation that was learned from massive unlabeled data, w...
Preprint
In this paper, we present the voice conversion (VC) systems developed at Nagoya University (NU) for the Voice Conversion Challenge 2020 (VCC2020). We aim to determine the effectiveness of two recent significant technologies in VC: sequence-to-sequence (seq2seq) models and autoregressive (AR) neural vocoders. Two respective systems were developed fo...
Preprint
In this paper, we present a description of the baseline system of Voice Conversion Challenge (VCC) 2020 with a cyclic variational autoencoder (CycleVAE) and Parallel WaveGAN (PWG), i.e., CycleVAEPWG. CycleVAE is a nonparallel VAE-based voice conversion that utilizes converted acoustic features to consider cyclically reconstructed spectra during opt...
Preprint
Full-text available
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody. Nonetheless, without sufficient data, seq2seq VC models can suffer from unstable training and mispronunciation problems in the converted speech, thus far from practical. To tackle these shortcomings, we propose to transfer knowledge...
Preprint
In this paper, we propose a quasi-periodic parallel WaveGAN (QPPWG) waveform generative model, which applies a quasi-periodic (QP) structure to a parallel WaveGAN (PWG) model using pitch-dependent dilated convolution networks (PDCNNs). PWG is a small-footprint GAN-based raw waveform generative model, whose generation time is much faster than real t...
Preprint
Full-text available
In this paper, a pitch-adaptive waveform generative model named Quasi-Periodic WaveNet (QPNet) is proposed to improve the pitch controllability of vanilla WaveNet (WN) using pitch-dependent dilated convolution neural networks (PDCNNs). Specifically, as a probabilistic autoregressive generation model with stacked dilated convolution layers, WN achie...
Preprint
In this paper, we propose a parallel WaveGAN (PWG)-like neural vocoder with a quasi-periodic (QP) architecture to improve the pitch controllability of PWG. PWG is a compact non-autoregressive (non-AR) speech generation model, whose generation speed is much faster than real-time. While PWG is taken as a vocoder to generate speech on the basis of aco...
Preprint
Recently, the effectiveness of text-to-speech (TTS) systems combined with neural vocoders to generate high-fidelity speech has been shown. However, collecting the required training data and building these advanced systems from scratch is time and resource consuming. A more economical approach is to develop a neural vocoder to enhance the speech gen...
Article
Full-text available
Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as “presentation attacks.” These vulnerabilities are generally unacceptable and call for spoofing countermeasures or “presentation...
Article
Full-text available
In this paper, we integrate a simple non-parallel voice conversion (VC) system with a WaveNet (WN) vocoder and a proposed collapsed speech suppression technique. The effectiveness of WN as a vocoder for generating high-fidelity speech waveforms on the basis of acoustic features has been confirmed in recent works. However, when combining the WN voco...
Preprint
In this paper, we integrate a simple non-parallel voice conversion (VC) system with a WaveNet (WN) vocoder and a proposed collapsed speech suppression technique. The effectiveness of WN as a vocoder for generating high-fidelity speech waveforms on the basis of acoustic features has been confirmed in recent works. However, when combining the WN voco...
Preprint
Full-text available
We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-to-speech (TTS) pretraining. Seq2seq VC models are attractive owing to their ability to convert prosody. While seq2seq models based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs) have been suc...
Article
Full-text available
In this paper, we present a novel framework for a voice conversion (VC) system based on a cyclic recurrent neural network (CycleRNN) and a finely tuned WaveNet vocoder. Even though WaveNet is capable of producing natural speech waveforms when fed with natural speech features, it still suffers from speech quality degradation when fed with oversmooth...
Preprint
Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as ''presentation attacks.'' These vulnerabilities are generally unacceptable and call for spoofing countermeasures or "presentati...
Preprint
Full-text available
We present a modification to the spectrum differential based direct waveform modification for voice conversion (DIFFVC) so that it can be directly applied as a waveform generation module to voice conversion models. The recently proposed DIFFVC avoids the use of a vocoder, meanwhile preserves rich spectral details hence capable of generating high qu...
Preprint
In this paper, we present a novel technique for a non-parallel voice conversion (VC) with the use of cyclic variational autoencoder (CycleVAE)-based spectral modeling. In a variational autoencoder(VAE) framework, a latent space, usually with a Gaussian prior, is used to encode a set of input features. In a VAE-based VC, the encoded latent features...
Preprint
Full-text available
In this paper, we investigate the effectiveness of a quasi-periodic WaveNet (QPNet) vocoder combined with a statistical spectral conversion technique for a voice conversion task. The WaveNet (WN) vocoder has been applied as the waveform generation module in many different voice conversion frameworks and achieves significant improvement over convent...
Preprint
Full-text available
In this paper, we propose a quasi-periodic neural network (QPNet) vocoder with a novel network architecture named pitch-dependent dilated convolution (PDCNN) to improve the pitch controllability of WaveNet (WN) vocoder. The effectiveness of the WN vocoder to generate high-fidelity speech samples from given acoustic features has been proved recently...
Preprint
Full-text available
In this work, we investigate the effectiveness of two techniques for improving variational autoencoder (VAE) based voice conversion (VC). First, we reconsider the relationship between vocoder features extracted using the high quality vocoders adopted in conventional VC systems, and hypothesize that the spectral features are in fact F0 dependent. Su...
Preprint
Full-text available
This paper presents a refinement framework of WaveNet vocoders for variational autoencoder (VAE) based voice conversion (VC), which reduces the quality distortion caused by the mismatch between the training data and testing data. Conventional WaveNet vocoders are trained with natural acoustic features but condition on the converted features in the...
Preprint
Full-text available
In this paper, we propose a technique to alleviate quality degradation caused by collapsed speech segments sometimes generated by WaveNet vocoder. The effectiveness of WaveNet vocoder to generate natural speech from the acoustic features has been proved in recent works. However, WaveNet vocoder sometimes generates very noisy speech suffering from t...
Article
Full-text available
This paper presents a novel locally linear embedding (LLE)-based framework for exemplar-based spectral conversion (SC). The key feature of the proposed SC framework is that it integrates the LLE algorithm, a manifold learning method, with the conventional exemplar-based SC method. One important advantage of the LLE-based SC framework is that it can...
Article
We present a novel speech enhancement method based on locally linear embedding (LLE). The proposed method works as a post-filter to further suppress the residual noises in the enhanced speech signals obtained by a speech enhancement system to attain improved speech quality and intelligibility. We design two types of LLE-based post-filters: the dire...
Article
Full-text available
Building a voice conversion (VC) system from non-parallel speech corpora is challenging but highly valuable in real application scenarios. In most situations, the source and the target speakers do not repeat the same texts or they may even speak different languages. In this case, one possible, although indirect, solution is to build a generative mo...
Conference Paper
Full-text available
We propose a flexible framework for spectral conversion (SC) that facilitates training with unaligned corpora. Many SC frameworks require parallel corpora, phonetic alignments, or explicit frame-wise correspondence for learning conversion functions or for synthesizing a target spectrum with the aid of alignments. However, these requirements gravely...
Article
Full-text available
In this paper, we propose a dictionary update method for Nonnegative Matrix Factorization (NMF) with high dimensional data in a spectral conversion (SC) task. Voice conversion has been widely studied due to its potential applications such as personalized speech synthesis and speech enhancement. Exemplar-based NMF (ENMF) emerges as an effective and...
Article
Full-text available
We propose a flexible framework for spectral conversion (SC) that facilitates training with unaligned corpora. Many SC frameworks require parallel corpora, phonetic alignments, or explicit frame-wise correspondence for learning conversion functions or for synthesizing a target spectrum with the aid of alignments. However, these requirements gravely...
Patent
The present invention discloses a human identification system by fusion of face recognition and speaker recognition, a method and a service robot thereof. The system fuses results of the face recognition and the speaker recognition, and further uses confidence index to estimate the confidence level of the two recognition results. If only one of the...
Conference Paper
Intelligent human-robot interface helps a mobile robot to extract external information and interact with a user. User identification information allows a robot to generate appropriate behaviors and make personalized human-robot interaction (PHRI) more natural and safe. Most of service robots move around in various application settings, and biometri...

Network

Cited By