Frank K. SoongMicrosoft · Speech Research Group
Frank K. Soong
Bachelor of Engineering
About
370
Publications
55,192
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
9,742
Citations
Introduction
Skills and Expertise
Additional affiliations
January 2016 - February 2017
January 2015 - December 2016
May 2005 - present
Publications
Publications (370)
While state-of-the-art Text-to-Speech systems can generate natural speech of very high quality at sentence level, they still meet great challenges in speech generation for paragraph / long-form reading. Such deficiencies are due to i) ignorance of cross-sentence contextual information, and ii) high computation and memory cost for long-form synthesi...
This paper aims to improve neural TTS with vector-quantized, compact speech representations. We propose a Vector-Quantized Variational AutoEncoder (VQ-VAE) based feature analyzer to encode acoustic features into sequences with different time resolutions, and quantize them with multiple VQ codebooks to form the Multi-Stage Multi-Codebook Representat...
We propose a Multi-Stage, Multi-Codebook (MSMC) approach to high-performance neural TTS synthesis. A vector-quantized, variational autoencoder (VQ-VAE) based feature analyzer is used to encode Mel spectrograms of speech training data by down-sampling progressively in multiple stages into MSMC Representations (MSMCRs) with different time resolutions...
Recent advancements in neural end-to-end TTS models have shown high-quality, natural synthesized speech in a conventional sentence-based TTS. However, it is still challenging to reproduce similar high quality when a whole paragraph is considered in TTS, where a large amount of contextual information needs to be considered in building a paragraph-ba...
End-to-end neural TTS has shown improved performance in speech style transfer. However, the improvement is still limited by the available training data in both target styles and speakers. Additionally, degenerated performance is observed when the trained TTS tries to transfer the speech to a target style from a new speaker with an unknown, arbitrar...
Recent advancements in neural end-to-end text-to-speech (TTS) models have shown high-quality, natural synthesized speech in a conventional sentence-based TTS. However, it is still challenging to reproduce similar high quality when a whole paragraph is considered in TTS, where a large amount of contextual information needs to be considered in buildi...
End-to-end TTS suffers from high data requirements as it is difficult for both costly speech corpora to cover all necessary knowledge and neural models to learn the knowledge, hence additional knowledge needs to be injected manually. For example, to capture pronunciation knowledge on languages without regular orthography, a complicated grapheme-to-...
End-to-end neural TTS training has shown improved performance in speech style transfer. However, the improvement is still limited by the training data in both target styles and speakers. Inadequate style transfer performance occurs when the trained TTS tries to transfer the speech to a target style from a new speaker with an unknown, arbitrary styl...
This paper presents a speech BERT model to extract embedded prosody information in speech segments for improving the prosody of synthesized speech in neural text-to-speech (TTS). As a pre-trained model, it can learn prosody attributes from a large amount of speech data, which can utilize more data than the original training data used by the target...
This paper presents a speech BERT model to extract embedded prosody information in speech segments for improving the prosody of synthesized speech in neural text-to-speech (TTS). As a pre-trained model, it can learn prosody attributes from a large amount of speech data, which can utilize more data than the original training data used by the target...
End-to-end TTS advancement has shown that synthesized speech prosody can be controlled by conditioning the decoder with speech prosody attribute labels. However, to annotate quantitatively the prosody patterns of a large set of training data is both time consuming and expensive. To use unannotated data, variational autoencoder (VAE) has been propos...
In this paper, we propose a cycle consistent network based end-to-end TTS for speaking style transfer, including intra-speaker, inter-speaker, and unseen speaker style transfer for both parallel and unparallel transfer. The proposed approach is built upon a multi-speaker Variational Autoencoder (VAE) TTS model. The model is usually trained in a pai...
Neural end-to-end text-to-speech (TTS) , which adopts either a recurrent model, e.g. Tacotron, or an attention one, e.g. Transformer, to characterize a speech utterance, has achieved significant improvement of speech synthesis. However, it is still very challenging to deal with different sentence lengths, particularly, for long sentences where sequ...
Sentence level pronunciation assessment is important for Computer Assisted Language Learning (CALL). Traditional speech pronunciation assessment, based on the Goodness of Pronunciation (GOP) algorithm, has some weakness in assessing a speech utterance: 1) Phoneme GOP scores cannot be easily translated into a sentence score with a simple average for...
Spoken language understanding (SLU) in human machine conversational systems is the process of interpreting the semantic meaning conveyed by a user’s spoken utterance. Traditional SLU approaches transform the word string transcribed by an automatic speech recognition (ASR) system into a semantic label that determines the machine’s subsequent respons...
End-to-end neural TTS has achieved superior performance on reading style speech synthesis. However, it's still a challenge to build a high-quality conversational TTS due to the limitations of the corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework. We first...
We propose an early-stop strategy for improving the performance of speaker diarization, based upon agglomerative hierarchical clustering (AHC). The strategy generates more clusters than the anticipated number of speakers. Based on these initial clusters, an exhaustive search is used to find the best possible combination of clusters. The resultant c...
We propose a linear prediction (LP)-based waveform generation method via WaveNet vocoding framework.
A WaveNet-based neural vocoder has significantly improved the quality of parametric text-to-speech (TTS) systems.
However, it is challenging to effectively train the neural vocoder when the target database contains massive amount of acoustical infor...
In this paper, we propose an improved LPCNet vocoder using a linear prediction (LP)-structured mixture density network (MDN).
The recently proposed LPCNet vocoder has successfully achieved high-quality and lightweight speech synthesis systems by combining a vocal tract LP filter with a WaveRNN-based vocal source (i.e., excitation) generator.
Howe...
Neural end-to-end TTS can generate very high-quality synthesized speech, and even close to human recording within similar domain text. However, it performs unsatisfactory when scaling it to challenging test sets. One concern is that the encoder-decoder with attention-based network adopts autoregressive generative sequence model with the limitation...
The end-to-end TTS, which can predict speech directly from a given sequence of graphemes or phonemes, has shown improved performance over the conventional TTS. However, its predicting capability is still limited by the acoustic/phonetic coverage of the training data, usually constrained by the training set size. To further improve the TTS quality i...
End-to-end, autoregressive model-based TTS has shown significant performance improvements over the conventional one. However, the autoregressive module training is affected by the exposure bias, or the mismatch between the different distributions of real and predicted data. While real data is available in training, but in testing, only predicted da...
Despite significant advancements of deep learning on separating speech sources mixed in a single channel, same gender speaker mix, i.e., male-male or female-female, is still more difficult to separate than the case of opposite gender mix. In this study, we propose a pitch-aware speech separation approach to improve the speech separation performance...
In this paper, we propose a feature reinforcement method under the sequence-to-sequence neural text-to-speech (TTS) synthesis framework. The proposed method utilizes the multiple input encoder to take three levels of text information, i.e., phoneme sequence, pre-trained word embedding, and grammatical structure of sentences from parser as the input...
Neural TTS has shown it can generate high quality synthesized speech. In this paper, we investigate the multi-speaker latent space to improve neural TTS for adapting the system to new speakers with only several minutes of speech or enhancing a premium voice by utilizing the data from other speakers for richer contextual coverage and better generali...
We propose a linear prediction (LP)-based waveform generation method via WaveNet speech synthesis. The WaveNet vocoder, which uses speech parameters as a conditional input of WaveNet, has significantly improved the quality of statistical parametric speech synthesis system. However, it is still challenging to effectively train the neural vocoder whe...
We propose a Speaker Independent Deep Neural Net (SI-DNN) and Kullback- Leibler Divergence (KLD) based mapping approach to voice conversion without using parallel training data. The acoustic difference between source and target speakers is equalized with SI-DNN via its estimated output posteriors, which serve as a probabilistic mapping from acousti...
In this paper, given the speaker bottleneck feature vectors extracted with speaker discriminant neural networks, we focus on using the sequential speaker characteristics for text-dependent speaker verification. In each evaluation trial, speaker supervectors are used as the representations of the sequential speaker characteristics rendered in the co...
In this article, we report research results on modeling the parameters of an improved time-frequency trajectory excitation (ITFTE) and spectral envelopes of an LPC vocoder with a long short-term memory (LSTM)-based recurrent neural network (RNN) for high-quality text-to-speech (TTS) systems. The ITFTE vocoder has been shown to significantly improve...
We propose to detect mispronunciations in a language learners speech via a discriminatively trained DNN in the phonetic space. The posterior probabilities of “senones” populated in a decision tree are trained and predicted speaker independently. Acoustic features of each input segment (with preceding and succeeding contexts of several frames) are m...
This paper presents a two-pass framework with discriminative acoustic modeling for mispronunciation detection and diagnoses (MD&D). The first pass of mispronunciation detection does not require explicit phonetic error pattern modeling. The framework instantiates a set of antiphones and a filler model to augment the original phone model for each can...
This paper proposes a deep bidirectional long short-term memory approach in modeling the long contextual, nonlinear mapping between audio and visual streams for video-realistic talking head. In training stage, an audio-visual stereo database is firstly recorded as a subject talking to a camera. The audio streams are converted into acoustic feature,...
Speaker verification performance degrades when input speech is tested in different sessions over a long period of time chronologically. Common ways to alleviate the long-term impact on performance degradation are enrollment data augmentation, speaker model adaptation, and adapted verification thresholds. From a point of view in features of a patter...
This paper presents a two-pass framework of mispronunciation detection and diagnosis (MD&D) — detection followed by diagnosis, without the need of explicit error pattern modeling, so that the main efforts can be devoted to improving acoustic modeling by discriminative training (or by applying alternative models like neural nets). The framework inst...
Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN) has
been shown to be very effective for modeling and predicting sequential data,
e.g. speech utterances or handwritten documents. In this study, we propose to
use BLSTM-RNN for a unified tagging solution that can be applied to various
tagging tasks including part-of-speech ta...
This paper investigates F0 modeling of speech in deep neural networks (DNN) for statistical parametric speech synthesis (SPSS). Recently, DNN has been applied to the acoustic modeling of SPSS and has shown good performance in characterizing complex dependencies between contextual features and acoustic observations. However, the additive nature and...
Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN) has
been shown to be very effective for tagging sequential data, e.g. speech
utterances or handwritten documents. While word embedding has been demoed as a
powerful representation for characterizing the statistical properties of
natural language. In this study, we propose to...
Long short-term memory (LSTM) is a specific recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. In this paper, we propose to use deep bidirec-tional LSTM (BLSTM) for audio/visual modeling in our photo-real talking head system. An audio/vis...
The popular i-vector approach to speaker recognition represents a speech segment as an i-vector in a low-dimensional space. It is well known that i-vectors involve both speaker and session variances, and therefore additional discriminative approaches are required to extract speaker information from the 'total variance' space. Among various methods,...
Mispronunciation detection is an important part in a Computer-Aided Language Learning (CALL) system. By automatically pointing out where mispronunciations occur in an utterance, a language learner can receive informative and to-the-point feedbacks. In this paper, we improve mispronunciation detection performance with a Deep Neural Network (DNN) tra...
In this paper, we propose a Neural Network (NN) based, Logistic Regression (LR) classifier for improving phone mispronunciation detection rate in a Computer-Aided Language Learning (CALL) system. A general neural network with multiple hidden layers for extracting useful speech features is first trained with pooled, training data, and then phone-dep...
In voice conversion task, prosody conversion especially pitch conversion is a very challenging research topic because of the discontinuity property of pitch. Conventionally pitch conversion is always achieved by adjusting the mean and variance of the source pitch distribution to the target pitch distribution. This method removes most of the detaile...
In this paper, we propose an HMM trajectory-guided, real image sample concatenation approach to photo-realistic talking head synthesis. An audio-visual database of a person is recorded first for training a statistical Hidden Markov Model (HMM) of Lips movement. The HMM is then used to generate the dynamic trajectory of lips movement for given speec...
Embodiments of an audio-to-video engine are disclosed. In operation, the audio-to-video engine generates facial movement (e.g., a virtual talking head) based on an input speech. The audio-to-video engine receives the input speech and recognizes the input speech as a source feature vector. The audio-to-video engine then determines a Maximum A Poster...
An automated method of providing a pronunciation of a word to a remote device is disclosed. The method includes receiving an input indicative of the word to be pronounced. The method further includes searching a database having a plurality of records. Each of the records has an indication of a textual representation and an associated indication of...
We propose a novel bandwidth expansion algorithm for extending narrowband speech signal to wideband by exploiting segment examples pre-stored in a speaker independent database. Both narrowband and wideband representation of speech signals are pre-stored in the corpus and they are dynamically chopped into variable length segments. Narrowband segment...
In this paper we investigate a Deep Neural Network (DNN) based approach to acoustic modeling of tonal language and assess its speech recognition performance with different features and modeling techniques. Mandarin Chinese, the most widely spoken tonal language, is chosen for testing the tone related ASR performance. Furthermore, the DNN-trained, t...
Deep Neural Network (DNN), which can model a long-span, intricate transform compactly with a deep-layered structure, has recently been investigated for parametric TTS synthesis with a fairly large corpus (33,000 utterances) [6]. In this paper, we examine DNN TTS synthesis with a moderate size corpus of 5 hours, which is more commonly used for param...
Neural network (NN) based voice conversion, which employs a nonlinear function to map the features from a source to a target speaker, has been shown to outperform GMM-based voice conversion approach [4-7]. However, there are still limitations to be overcome in NN-based voice conversion, e.g. NN is trained on a Frame Error (FE) minimization criterio...
Feed-forward, Deep neural networks (DNN)-based text-to-speech (TTS) systems have been recently shown to outperform decision-tree clustered context-dependent HMM TTS systems [1, 4]. However, the long time span contextual effect in a speech utterance is still not easy to accommodate, due to the intrinsic, feed-forward nature in DNN-based modeling. Al...
In the conventional HMM-based TTS, the micro structure of F0 contour is modeled at the state level via a (clustered) decision tree. However, the decision tree based state-level modeling is difficult to capture the long term structure of speech prosody, say at intonation phrase level, due to its greedy search nature and usually sparse training data...
Frame mapping-based cross-lingual voice transformation may transform a target speech corpus in a particular language into a transformed target speech corpus that remains recognizable, and has the voice characteristics of a target speaker that provided the target speech corpus. A formant-based frequency warping is performed on the fundamental freque...
Described is a technology by which synthesized speech generated from text is evaluated against a prosody model (trained offline) to determine whether the speech will sound unnatural. If so, the speech is regenerated with modified data. The evaluation and regeneration may be iterative until deemed natural sounding. For example, text is built into a...