Zhizheng Wu

Zhizheng Wu
Apple Inc. · Siri

PhD

About

67
Publications
17,691
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,478
Citations
Citations since 2017
2 Research Items
3545 Citations
20172018201920202021202220230200400600
20172018201920202021202220230200400600
20172018201920202021202220230200400600
20172018201920202021202220230200400600
Introduction
Skills and Expertise
Additional affiliations
May 2014 - April 2016
The University of Edinburgh
Position
  • Research Associate

Publications

Publications (67)
Article
Full-text available
The voice conversion's task is to modify a source speaker's voice to sound like that of a target speaker. A conversion method is considered successful when the produced speech sounds natural and similar to the target speaker. This paper presents a new voice conversion framework in which we combine frequency warping and exemplar-based method for voi...
Article
Full-text available
Concerns regarding the vulnerability of automatic speaker verification (ASV) technology against spoofing can undermine confidence in its reliability and form a barrier to exploitation. The absence of competitive evaluations and the lack of common datasets has hampered progress in developing effective spoofing countermeasures. This paper describes t...
Article
Accurate modeling and prediction of speech-sound durations is important for generating more natural synthetic speech. Deep neural networks (DNNs) offer powerful models, and large, found corpora of natural speech are easily acquired for training them. Unfortunately, poor quality control (e.g., transcription errors) and phenomena such as reductions a...
Conference Paper
Full-text available
Spoofing detection for automatic speaker verification (ASV), which is to discriminate between live and artificial speech, has received increasing attentions recently. However, the previous studies have been done on the clean data without significant noise. It is still not clear whether the spoofing detectors trained on clean speech can generalise w...
Article
Full-text available
Automatic speaker verification (ASV) is to automatically accept or reject a claimed identity based on a speech sample. Recently, individual studies have confirmed the vulnerability of state-of-the-art text-independent ASV systems under replay, speech synthesis and voice conversion attacks on various databases. However, the behaviours of text-depend...
Article
In this paper, we present a systematic study of the vulnerability of automatic speaker verification to a diverse range of spoofing attacks.We start with a thorough analysis of the spoofing effects of five speech synthesis and eight voice conversion systems, and the vulnerability of three speaker verification systems under those attacks. We then int...
Conference Paper
Full-text available
Spoofing detection, which discriminates the spoofed speech from the natural speech, has gained much attention recently. Low-dimensional features that are used in speaker recognition/verification are also used in spoofing detection. Unfortunately, they don't capture sufficient information required for spoofing detection. In this work, we investigate...
Article
Full-text available
Spoofing detection for automatic speaker verification (ASV), which is to discriminate between live speech and attacks, has received increasing attentions recently. However, all the previous studies have been done on the clean data without significant additive noise. To simulate the real-life scenarios, we perform a preliminary investigation of spoo...
Article
Full-text available
Recently, recurrent neural networks (RNNs) as powerful sequence models have re-emerged as a potential acoustic model for statistical parametric speech synthesis (SPSS). The long short-term memory (LSTM) architecture is particularly attractive because it addresses the vanishing gradient problem in standard RNNs, making them easier to train. Although...
Conference Paper
Full-text available
Recently, a number of voice conversion methods have been developed. These methods attempt to improve conversion performance by using diverse mapping techniques in various acoustic domains, e.g. high-resolution spectra and low-resolution Mel-cepstral coefficients. Each individual method has its own pros and cons. In this paper, we introduce a system...
Research
Full-text available
It has recently been shown that deep neural networks (DNN) can improve the quality of statistical parametric speech syn- thesis (SPSS) when using a source-filter vocoder. Our own previous work has furthermore shown that a dynamic sinu- soidal model (DSM) is also highly suited to DNN-based SPSS, whereby sinusoids may either be used themselves as a “...
Article
Any biometric recognizer is vulnerable to spoofing attacks and hence voice biometric, also called automatic speaker verification (ASV), is no exception; replay, synthesis, and conversion attacks all provoke false acceptances unless countermeasures are used. We focus on voice conversion (VC) attacks considered as one of the most challenging for mode...
Article
Full-text available
A speaker verification system automatically accepts or rejects a claimed identity of a speaker based on a speech sample. Recently, a major progress was made in speaker verification which leads to mass market adoption, such as in smartphone and in online commerce for user authentication. A major concern when deploying speaker verification technology...
Article
While biometric authentication has advanced significantly in recent years, evidence shows the technology can be susceptible to malicious spoofing attacks. The research community has responded with dedicated countermeasures which aim to detect and deflect such attacks. Even if the literature shows that they can be effective, the problem is far from...
Article
Full-text available
We propose a nonparametric framework for voice conversion, that is, exemplar-based sparse representation with residual compensation. In this framework, a spectrogram is reconstructed as a weighted linear combination of speech segments, called exemplars, which span multiple consecutive frames. The linear combination weights are constrained to be spa...
Conference Paper
Full-text available
Frequency warping (FW) based voice conversion aims to modify the frequency axis of source spectra towards that of the target. In previous works, the optimal warping function was calculated by minimizing the spectral distance of converted and target spectra without considering the spectral shape. Nevertheless, speaker timbre and identity greatly dep...
Conference Paper
Full-text available
Studies show that professional singing matches well the associated melody and typically exhibits spectra different from speech in resonance tuning and singing formant. Therefore, one of the important topics in speech-to-singing conversion is to characterize the spectral transformation between speech and singing. This paper extends two types of spec...
Article
Exemplar-based sparse representation is a nonparametric framework for voice conversion. In this framework, a target spectrum is generated as a weighted linear combination of a set of basis spectra, namely exemplars, extracted from the training data. This framework adopts coupled source-target dictionaries consisting of acoustically aligned source-t...
Chapter
Full-text available
Progress in the development of spoofing countermeasures for automatic speaker recognition is less advanced than equivalent work related to other biometric modalities. This chapter outlines the potential for even state-of-the-art automatic speaker recognition systems to be spoofed. While the use of a multitude of different datasets, protocols and me...
Chapter
Full-text available
While biometric authentication has advanced significantly in recent years, evidence shows the technology can be susceptible to malicious spoofing attacks. The research community has responded with dedicated countermeasures which aim to detect and deflect such attacks. Even if the literature shows that they can be effective, the problem is far from...
Chapter
Full-text available
Voice conversion is a process which converts or transforms one speaker's voice towards that of another. The literature shows that voice conversion can be used to spoof or fool an automatic speaker verification system. State-of-the-art voice conversion algorithms can produce high-quality speech signals in real time and are capable of fooling both hu...
Chapter
Full-text available
As with any task involving statistical pattern recognition, the assessment of spoofing and anti-spoofing approaches for voice recognition calls for significant-scale databases of spoofed speech signals. Depending on the application, these signals should normally reflect spoofing attacks performed prior to acquisition at the sensor or microphone. Si...
Article
Any biometric recognizer is vulnerable to direct spoofing attacks and automatic speaker verification (ASV) is no exception; replay, synthesis and conversion attacks all provoke false acceptances unless countermeasures are used. We focus on voice conversion (VC) attacks. Most existing countermeasures use full knowledge of a particular VC system to d...
Conference Paper
Voice conversion and speaker adaptation techniques present a threat to current state-of-the-art speaker verification systems. To prevent such spoofing attack and enhance the security of speaker verification systems, the development of anti-spoofing techniques to distinguish synthetic and human speech is necessary. In this study, we continue the que...
Conference Paper
Joint density Gaussian mixture model (JD-GMM) based method has been widely used in voice conversion task due to its flexible implementation. However, the statistical averaging effect during estimating the model parameters will result in over-smoothing the target spectral trajectories. Motivated by the local linear transformation method, which uses...
Conference Paper
Speaker verification system automatically accepts or rejects the claimed identity of a speaker. Recently, we have made major progress in speaker verification which leads to mass market adoption, such as in smartphone and in online commerce for user authentication. A major concern when deploying speaker verification technology is whether a system is...
Conference Paper
The conventional statistical-based transformation functions for voice conversion have been shown to suffer over-smoothing and over-fitting problems. The over-smoothing problem arises because of the statistical average during estimating the model parameters for the transformation function. In addition, the large number of parameters in the statistic...
Article
Voice conversion, a technique to change one's voice to sound like that of another, poses a threat to even high performance speaker verification system. Vulnerability of text-independent speaker verification systems under spoofing attack, using statistical voice conversion technique, was evaluated and confirmed in our previous work. In this paper, w...
Article
Although temporal information of speech has been shown to play an important role in perception, most of the voice conversion approaches assume the speech frames are independent of each other, thereby ignoring the temporal information. In this study, we improve conventional unit selection approach by using exemplars which span multiple frames as bas...
Article
A robust voice conversion function relies on a large amount of parallel training data, which is difficult to collect in practice. To tackle the sparse parallel training data problem in voice conversion, this paper describes a mixture of factor analyzers method which integrates prior knowledge from non-parallel speech into the training of conversion...
Article
Full-text available
Voice conversion techniques present a threat to speaker verification systems. To enhance the security of speaker verification systems, We study how to automat-ically distinguish natural speech and synthetic/converted speech. Motivated by the research on phase spectrum in speech perception, in this study, we propose to use fea-tures derived from pha...
Conference Paper
Full-text available
Voice conversion technique, which modifies one speaker's (source) voice to sound like another speaker (target), presents a threat to automatic speaker verification. In this paper, we first present new results of evaluating the vulnerability of current state-of-the-art speaker verification systems: Gaussian mixture model with joint factor analysis (...
Article
Full-text available
The current state-of-the-art hidden Markov model (HMM)-based text-to-speech (TTS) can produce highly intelligible, synthesized speech with decent segmental quality. However, its prosody, especially at phrase or sentence level, still tends to be bland. This blandness is partially due to the fact that the state-based HMM is inadequate in capturing gl...
Article
Full-text available
While the current TTS systems can deliver quite acceptable segmental quality of synthesized speech for voice user interface applications, its prosody is still perceived by users as “robotic” or not expressive. In this paper, we investigate how to improve TTS prosody prediction and detection. Conditional Random Field (CRF), a discriminative probabil...
Conference Paper
Full-text available
The current state-of-art HMM-bsed TTS can produce highly intelligible output speech and deliver a decent segmental quality. However, its prosody, especially at the phrase or sentence level, tends to be bland. The blandness of synthesized prosody is partially due to the fact that a state-based HMM is rather inadequate in modeling a global, hierarchi...
Conference Paper
Full-text available
This paper models F0 curves with discrete cosine transform (DCT) representations on both syllable-level tone and phrase-level intonation for Chinese Mandarin speech. Decision trees growing with maximum likelihood (ML) and stopping with minimum description length (MDL) are used to cluster very rich context-dependent DCT models into generalized ones...
Conference Paper
The HMM-based TTS can produce a highly intelligible and decent quality voice. However, HMM model degrades when feature vectors used in training are noisy. Among all noisy features, pitch tracking errors and corresponding flawed voiced/unvoiced (v/u) decisions are identified as two key factors in voice quality problems. In this paper, we propose a m...
Article
Voice conversion technique, which modifies one's (source speaker) voice to sound like another (target speaker), is a threat to automatic speaker verification. In this paper, we present new results evaluating the current state-of-the-art speaker verifica-tion system, Gaussian mixture model supervector with joint fac-tor analysis (GMM-JFA) system, ag...

Network

Cited By