Guanlong Zhao's research while affiliated with Texas A&M University and other places

Publications (19)

Article
Foreign accent conversion (FAC) aims to create a new voice that has the voice identity of a given second-language (L2) speaker but with a native (L1) accent. Previous FAC approaches usually require training a separate model for each L2 speaker and, more importantly, generally require considerable speech data from each L2 speaker for training. To ad...
Article
Foreign accent conversion (FAC) is the problem of generating a synthetic voice that has the voice identity of a second-language (L2) learner and the pronunciation patterns of a native (L1) speaker. This synthetic voice has been referred to as a “golden-speaker” in the pronunciation-training literature. FAC is generally achieved by building a voice-...
Conference Paper
Full-text available
This paper presents a methodology to study the role of non-native accents on talker recognition by humans. The methodology combines a state-of-the-art accent-conversion system to resynthesize the voice of a speaker with a different accent of her/his own, and a protocol for perceptual listening tests to measure the relative contribution of accent an...
Preprint
Full-text available
Automated speech recognition coverage of the world's languages continues to expand. However, standard phoneme based systems require handcrafted lexicons that are difficult and expensive to obtain. To address this problem, we propose a training methodology for a grapheme-based speech recognizer that can be trained in a purely data-driven fashion. Bu...
Article
The accurate identification of likely segmental pronunciation errors produced by nonnative speakers of English is a longstanding goal in pronunciation teaching. Most lists of pronunciation errors for speakers of a particular first language (L1) are based on the experience of expert linguists or teachers of English as a second language (ESL) and Eng...
Article
Sparse-coding techniques for voice conversion assume that an utterance can be decomposed into a sparse code that only carries linguistic contents, and a dictionary of atoms that captures the speakers' characteristics. However, conventional dictionary-construction and sparse-coding algorithms rarely meet this assumption. The result is that the spars...
Article
The type of voice model used in Computer Assisted Pronunciation Instruction is a crucial factor in the quality of practice and the amount of uptake by language learners. As an example, prior research indicates that second-language learners are more likely to succeed when they imitate a speaker with a voice similar to their own, a so-called “golden...
Article
Accent conversion (AC) aims to transform non-native utterances to sound as if the speaker had a native accent. This can be achieved by mapping source speech spectra from a native speaker into the acoustic space of the target non-native speaker. In prior work, we proposed an AC approach that matches frames between the two speakers based on their aco...
Article
Full-text available
In this paper, we introduce L2-ARCTIC, a speech corpus of non-native English that is intended for research in voice conversion, accent conversion, and mispronunciation detection. This initial release includes recordings from ten non-native speakers of English whose first languages (L1s) are Hindi, Korean, Mandarin, Spanish, and Arabic, each L1 cont...
Conference Paper
In previous work we presented a Sparse, Anchor-Based Representation of speech (SABR) that uses phonemic “anchors” to represent an utterance with a set of sparse non-negative weights. SABR is speaker-independent: combining weights from a source speaker with anchors from a target speaker can be used for voice conversion. Here, we present an extension...

Citations

... Specifically, speech accent reduction aims to endow accented speech uttered by non-native speakers with normal pronunciation pattern of native speakers, while keeping the speaker identity unchanged. Most of previous works in accent reduction follow the paradigm of voice conversion [6][7][8][9][10][11] and generate accent-reduced speech in utterance level. Correct-Speech provides a new paradigm to deal with accent reduction problem. ...
... Siamese neural networks were adopted to contrast hypothetically disordered consonant segments with typical ones [7,8]. Posterior features were derived from automatic speech recognition systems to facilitate mispronunciation detection in disordered child speech [9,10]. Considering that consonant error could alter the acoustical characteristics of its neighbouring vowel, our recent work proposed to detect consonant errors from consonant-vowel speech segments [11]. ...
... Recognition-synthesis based VC is another representative approach of the non-parallel VC and has been shown to achieve noticeable performance improvement [4][5][6][7]. Such methods provide more robust linguistic representations based on the knowledge from an automatic speech recognition (ASR) model trained with huge amount of data. ...
... Work on non-English second language datasets include Italian, Chinese, Finnish, Spanish, Arabic, and German [45]. Some of the interesting research studies related to NLI, involving non-native English speech, are listed in Table 1 [4,11,15,38]. The phonologies of different languages spoken in India determine the similarities and differences in the Indian English (IE) [41]. ...
... Teachers can use multimodality. The synergistic effect of attitude enables students to understand the characteristics of English pronunciation from hearing, vision, and touch and improve English pronunciation. Figure 1 shows the multimodel [1][2][3][4][5][6][7][8][9][10]. ...
... Voice conversion (VC) aims to convert utterances from a source speaker to make it sound as if a target speaker had produced it. Conventional VC approaches [1,2,3,4,5] usually require training a model for each speaker pair using parallel corpora. Alternative approaches have emerged in recent years that do not require parallel corpora and can build a universal model for all pairs of speakers [6,7,8,9,10,11,12,13]. ...
... Pronunciation diagnosis, training, and evaluation systems were developed using the attention mechanism and various types of NN (e.g., convolutional, long-short term memory) [23]- [27]. For instance, a multimodal system illustrating speech features [28] and an interactive tool generating personalized voice models [29] have recently been developed. ...
... This work has been mainly inspired from the system presented in (Zhao et al., 2019). The main modifications are the use of the French language and the conversion between two different speakers or different styles in the same language instead of foreign accent conversion. ...
... Specifically, speech accent reduction aims to endow accented speech uttered by non-native speakers with normal pronunciation pattern of native speakers, while keeping the speaker identity unchanged. Most of previous works in accent reduction follow the paradigm of voice conversion [6][7][8][9][10][11] and generate accent-reduced speech in utterance level. Correct-Speech provides a new paradigm to deal with accent reduction problem. ...
... Other researchers employ an accent classifier to explicitly annotate utterances with certain accent-indicating features that are in turn added to the input of ASR models [13]. Modeling approaches include accent conversion [14], where a transformation is applied to a non-native utterance to make it sound as if the speaker had a native accent. For a thorough overview of research on improving speech recognition for accents, see [15]. ...