Yi Xu

Yi Xu
University College London | UCL · Department of Speech, Hearing and Phonetic Sciences

PhD

About

284
Publications
75,030
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
8,435
Citations
Introduction
My research is concerned with what I believe is the central question about human speech, namely, how exactly can it be so effective in transmitting vast amount of meanings through a highly restrictive articulation process? Directly related to this question are a number of sub-questions that are just as fundamental: • What exactly are the kinds of meanings transmitted by speech? • How are the meanings encoded by the articulatory system? • And how are the meanings decoded in perception?
Additional affiliations
October 2004 - present
University College London
Position
  • Professor
September 2003 - September 2004
Haskins Laboratories
Position
  • Senior Researcher
September 2002 - September 2003
University of Chicago
Position
  • Research Assistant Professor
Education
September 1987 - December 1993
University of Connecticut
Field of study
  • Linguistics

Publications

Publications (284)
Chapter
Full-text available
The phonetics of emotion is about the acoustic-phonetic properties of the emotional facets of human vocalization. Conventionally, these properties are studied as correlates of a person’s internal states arising from reactions to the environment, where the internal states are defined by influential psychological theories of emotion. A more recent pe...
Article
This study is a preliminary investigation of intonation in Emirati Arabic (EA) (an under-researched Arabic dialect), using systematic acoustic analysis and computational modelling. First, we investigated the prosodic realisation of information focus and contrastive focus at sentence-initial, -penultimate and -final positions. The analysis of 1980 E...
Article
Full-text available
Computational approaches have an important role to play in understanding the complex process of speech acquisition, in general, and have recently been popular in studies of vocal learning in particular. In this article we suggest that two significant problems associated with imitative vocal learning of spoken language, the speaker normalisation and...
Article
Full-text available
In English, a sentence like "He made out our intentions." could be misperceived as "He may doubt our intentions." because the coda /d/ sounds like it has become the onset of the next syllable. The nature and the occurrence condition of this resyllabification phenomenon are unclear, however. Previous empirical studies mainly relied on listener judgm...
Article
Full-text available
It has been long held that languages of the world are divided into rhythm classes so that they are either stress-timed, syllable-timed or mora-timed. It is also known for a long time that duration serves various informational functions in speech. But it is unclear whether these two kinds of uses of duration are complementary to each other, or they...
Article
This paper introduces a paradigm shift regarding vocal learning simulations, in which the communicative function of speech acquisition determines the learning process and intelligibility is considered the primary measure of learning success. Thereby, a novel approach for artificial vocal learning is presented that utilizes deep neural network-based...
Article
Full-text available
This study explored the contexts in which native Japanese listeners have difficulty identifying prosodic focus. Using a 4AFC identification task, we compared native Japanese listeners’ focus identification accuracy in different lexical accent × focus location conditions using resynthesised speech stimuli, which varied only in fundamental frequency....
Conference Paper
Full-text available
Evoc-Learn is a system for simulating early vocal learning of spoken language in ways that can overcome some of the major bottlenecks in vocal learning. The system consists of VocalTractLab, a geometrical three-dimensional vocal tract model for simulating aeroacoustics and articulatory dynamics, a coarticulation model for controlling the temporal d...
Conference Paper
Full-text available
While numerous studies on automatic speech recognition have been published in recent years describing data augmentation strategies based on time or frequency domain signal processing, few works exist on the artificial extensions of training data sets using purely synthetic speech data. In this work, the German KIEL corpus was augmented with synthet...
Conference Paper
Full-text available
High-quality articulatory speech synthesis has many potential applications in speech science and technology. However, developing appropriate mappings from linguistic specification to articulatory gestures is difficult and time consuming. In this paper we construct an optimisation-based framework as a first step towards learning these mappings witho...
Preprint
Full-text available
p>This paper introduces a paradigm shift regarding vocal learning simulations, in which the communicative function of speech acquisition determines the learning process and intelligibility is considered the main measure of learning success. Thereby, a novel approach for artificial early vocal learning is presented that utilizes deep neural network-...
Preprint
Full-text available
p>This paper introduces a paradigm shift regarding vocal learning simulations, in which the communicative function of speech acquisition determines the learning process and intelligibility is considered the main measure of learning success. Thereby, a novel approach for artificial early vocal learning is presented that utilizes deep neural network-...
Chapter
This chapter reviews studies on contextual tonal variation in Chinese languages, often referred to as ‘tonal coarticulation’ in the literature. We start by explaining why the term ‘contextual variation’ is preferred to ‘coarticulation’ for tones, before introducing different types of contextual variation observed in Chinese languages. The following...
Article
Previous research has shown that post-focus compression (PFC) — the reduction of pitch range and intensity after a focused word in an utterance, is a robust means of marking focus, but it is present only in some languages. The presence of PFC appears to follow language family lines. The present study is a further exploration of the distribution of...
Preprint
Full-text available
The nature of English diphthongs has been much disputed. Bynow, the most influential account argues that diphthongs arephoneme entities rather than vowel combinations. However,mixed results have been reported regarding whether the rate offormant transition is the most reliable attribute in the perceptionand production of diphthongs. Here, we used c...
Article
Full-text available
The nature of English diphthongs has been much disputed. By now, the most influential account argues that diphthongs are phoneme entities rather than vowel combinations. However, mixed results have been reported regarding whether the rate of formant transition is the most reliable attribute in the perception and production of diphthongs. Here, we u...
Preprint
Full-text available
High-quality articulatory speech synthesis has many potential applications in speech science and technology. However, developing appropriate mappings from linguistic specification to articulatory gestures is difficult and time consuming. In this paper we construct an optimisation-based framework as a first step towards learning these mappings witho...
Article
Full-text available
Representation learning is one of the fundamental issues in modeling articulatory-based speech synthesis using target-driven models. This paper proposes a computational strategy for learning underlying articulatory targets from a 3D articulatory speech synthesis model using a bi-directional long short-term memory recurrent neural network based on a...
Article
Full-text available
Abstract: It has been widely assumed that in speech perception it is imperative to first detect a set of distinctive properties or features and then use them to recognize phonetic units like consonants, vowels, and tones. Those features can be auditory cues or articulatory gestures, or a combination of both. There have been no clear demonstrations...
Chapter
Full-text available
An introduction to the the range of current theoretical approaches to the prosody of spoken utterances, with practical applications of those theories. Prosody is an extremely dynamic field, with a rapid pace of theoretical development and a steady expansion of its influence beyond linguistics into such areas as cognitive psychology, neuroscience, c...
Chapter
Full-text available
An introduction to the the range of current theoretical approaches to the prosody of spoken utterances, with practical applications of those theories. Prosody is an extremely dynamic field, with a rapid pace of theoretical development and a steady expansion of its influence beyond linguistics into such areas as cognitive psychology, neuroscience, c...
Article
Full-text available
This study tested the hypothesis that consonant and vowel are synchronised at the syllable onset, and that such synchronised co-onset is the essence of coarticulation. Articulatory data were collected for Mandarin Chinese, using Electromagnetic Articulography (EMA), and acoustic data were collected simultaneously. As a departure from conventional a...
Conference Paper
Full-text available
While the acoustic vowel space has been extensively studied in previous research, little is known about the high-dimensional articulatory space of vowels. The articulatory imaging techniques are limited to tracking only a few key articulators, leaving the rest of the articulators unmonitored. In the present study, we attempted to develop a detailed...
Article
When pitch is explicitly modelled for parametric speech synthesis, microprosodic variations of the fundamental frequency f0 are usually disregarded by current intonation models. While there are numerous studies dealing with the nature and the origin of microprosody, little research has been done on its audibility and its effect on the naturalness o...
Article
Full-text available
F0 variation is a crucial feature in speech prosody, which can convey linguistic information such as focus and paralinguistic meanings such as surprise. How can multiple layers of information be represented with F0 in speech: are they divided into discrete layers of pitch or overlapped without clear divisions? We investigated this question by asses...
Article
Full-text available
In this study, we revisit consonantal perturbation of F0 in English, taking into particular consideration the effect of alignment of F0 contours to segments and the F0 extraction method in the acoustic analysis. We recorded words differing in consonant voicing, manner of articulation, and position in syllable, spoken by native speakers of American...
Chapter
This study examines the effects of segments, intonation and rhythm on the perception of second language (L2) accentedness and comprehensibility by focusing on a tone language, Mandarin Chinese. Fifteen Chinese sentences were manipulated by transferring the segments, intonation and rhythm between native and L2 speakers. 64 Chinese judges listened to...
Article
Full-text available
Although pre-low raising (PLR) has been extensively studied as a type of contextual tonal variation, its underlying mechanism is barely understood. This paper explored the effects of phonetic vs phonological duration on PLR in Cantonese and Thai and examined how speech rate and vowel quantity interact with its realization in these lan- guages, resp...
Conference Paper
Full-text available
The complex f0 variations in continuous speech make it rather difficult to perform automatic recognition of tones in a language like Mandarin Chinese. In this study, we tested the use of target approximation model (TAM) for continuous tone recognition on two datasets. TAM simulates f0 production from the articulatory point of view and so allow to d...
Article
The current study investigated the contribution of different acoustic dimensions to tonal contrasts in Pahari, an understudied language in the Pakistan-administrated part of Kashmir. While previous research on the tonal languages of the region focused only on overall pitch patterns, the present study analyzed fundamental frequency (F0), duration, i...
Conference Paper
Full-text available
In this study we tested the hypothesis that consonant and vowel articulations start at the same time at syllable onset [1]. Articulatory data was collected for Mandarin Chinese using Electromagenetic Articulography (EMA), which tracks flesh-point movements in time and space. Unlike the traditional velocity threshold method [2], we used a triplet me...
Article
The current study investigated the contribution of different acoustic dimensions to tonal contrasts in Pahari, an understudied language in the Pakistan-administrated part of Kashmir. While previous research on the tonal languages of the region focused only on overall pitch patterns, the present study analyzed fundamental frequency (F0), duration, i...
Conference Paper
Full-text available
In this study, a state-of-the-art articulatory speech synthesiser was used as the basis for simulating the exploration of CV sounds imitating speech stimuli. By adopting a relevant kinematic model and systematically reducing the search space of consonant articulatory targets, intelligible CV sounds can be found. Derivative-free optimisation strateg...
Conference Paper
Full-text available
Typologically, some languages mark narrow focus with ‘post-focus compression’ (PFC) while others do not. For those which do, PFC is easily lost through bilingualism, at both societal and individual levels. At the societal level, when in contact with a –PFC language (e.g. Southern Min), a likely +PFC language can lose this prosodic feature (e.g. Tai...
Conference Paper
Full-text available
Many theories assume that speech perception is done by first extracting features like the distinctive features, tonal features or articulatory gestures before recognizing phonetic units such as segments and tones. But it is unclear how exactly extracted features can lead to effective phonetic recognition. In this study we explore this issue by usin...
Conference Paper
Full-text available
This study investigates Chongqing Dialect, a language largely used in Southwest China which is mutually intelligible to Beijing Mandarin speakers. Phonetic variations triggered by focus in Chongqing Dialect, especially in the form of post-focus compression (PFC), are investigated in terms of max F0, mean F0, duration and intensity. A follow-up perc...
Preprint
Full-text available
The way infants use auditory cues to learn to speak despite the acoustic mismatch of their vocal apparatus is a hot topic of scientific debate. The simulation of early vocal learning using articulatory speech synthesis offers a way towards gaining a deeper understanding of this process. One of the crucial parameters in these simulations is the choi...
Article
Full-text available
Speech is a highly skilled motor activity that shares a core problem with other motor skills: how to reduce the massive degrees of freedom (DOF) to the extent that the central nervous control and learning of complex motor movements become possible. It is hypothesized in this paper that a key solution to the DOF problem is to eliminate most of the t...
Article
Many studies across languages have recognised that focus substantially alters the prosodic structure of a sentence not only by increasing F0, intensity, and duration of the focused words but also by compressing the range of pitch and intensity of the post-focus words. Studies, however, are still not fully clear regarding the main effects of focus o...
Conference Paper
Full-text available
Article
Full-text available
Economy of effort, a popular notion in contemporary speech research, predicts that dynamic extremes such as the maximum speed of articulatory movement are avoided as much as possible and that approaching the dynamic extremes is necessary only when there is a need to enhance linguistic contrast, as in the case of stress or clear speech. Empirical da...
Article
Full-text available
Ground-breaking studies on how Bangkok Thai tones have changed over the past 100 years ( Pittayaporn 2007 , 2018 ; Zhu et al. 2015 ) reveal a pattern that Zhu et al. (2015) term the “clockwise tone shift cycle:” low > falling > high level or rising-falling > rising > falling-rising or low. The present study addresses three follow-up questions: (1)...
Conference Paper
Full-text available
In this paper, we report findings of a major difference between Mandarin and English in terms of means of marking major prosodic boundaries. We performed detailed duration analysis on two large corpora, one in each language, using pre-labelled break indices as a reference indicator of boundary strength. Results showed that pre-boundary syllable dur...
Conference Paper
Full-text available
The present study tested the idea that coarticulation, despite involving overlap of articulatory gestures, is achieved by sequential target approximation at the level of individual articulator dimensions. For example, CV co-onset in a velar stop can be achieved by having the tongue body vertically move upward for a closure contact, while at the sam...
Conference Paper
Full-text available
This paper reports preliminary results of our effort to address the acoustic-to-articulatory inversion problem. We tested an approach that simulates speech production acquisition as a distal learning task, with acoustic signals of natural utterances in the form of MFCC as input, VocalTractLab-a 3D articulatory synthesizer controlled by target appro...
Article
Full-text available
Korean and English are both known to show on-focus pitch range expansion and post-focus pitch range compression (PFC). But it is not clear if this prosodic similarity would make it easy for Korean speakers to learn English focus prosody. In the present study, we conducted a production experiment using phone number strings to examine whether Korean...
Conference Paper
Full-text available
This paper presents a novel method to estimate the pitch target parameters of the target approximation model (TAM). The TAM allows the compact representation of natural pitch contours on a solid theoretical basis and can be used as an intonation model for text-to-speech synthesis. In contrast to previous approaches, the method proposed here estimat...
Article
This paper presents findings of the first systematic acoustic analysis of focus prosody in Hijazi Arabic (HA), an under-researched Arabic dialect. A question-answer paradigm was used to elicit information and contrastive focus at different sentence locations in comparison with their neutral focus counterparts. Systematic acoustic analyses were perf...
Conference Paper
Full-text available
Prosody in speech is used to communicate a variety of linguistic, paralinguistic and non-linguistic information via multiparametric contours. The Superposition of Functional Contours (SFC) model is capable of extracting the average shape of these elementary contours through iterative analysis-by-synthesis training of neural network contour generato...
Article
Full-text available
This paper introduces FormantPro, a Praat-based tool for large-scale, systematic analysis of formant movements, especially for experimental data. The program generates a rich set of output metrics, including continuous contours like time-normalized formant trajectories and formant velocity profiles suitable for direct graphical comparisons, and dis...
Conference Paper
Full-text available
The way speech prosody encodes linguistic, paralinguistic and non-linguistic information via multiparametric representations of the speech signals is still an open issue. The Superposition of Functional Contours (SFC) model proposes to decompose prosody into elementary multiparametric functional contours through the iterative training of neural net...
Article
Full-text available
This paper introduces FormantPro, a Praat-based tool for large-scale, systematic analysis of formant movements, especially for experimental data. The program generates a rich set of output metrics, including continuous contours like time-normalized formant trajectories and formant velocity profiles suitable for direct graphical comparisons, and dis...
Preprint
Full-text available
The quest for comprehensive generative models of intonation that link linguistic and paralinguistic functions to prosodic forms has been a longstanding challenge of speech communication research. More traditional intonation models have given way to the overwhelming performance of artificial intelligence (AI) techniques for training model-free, end-...
Preprint
Full-text available
The way speech prosody encodes linguistic, paralinguistic and non-linguistic information via multiparametric representations of the speech signals is still an open issue. The Superposition of Functional Contours (SFC) model proposes to decompose prosody into elementary multiparametric functional contours through the iterative training of neural net...
Conference Paper
Full-text available
The Superposition of Functional Contours (SFC) prosody model decomposes the intonation and duration contours into elementary contours that encode specific linguistic functions. It can be used to extract these functional contours at multiple linguistic levels. The PySFC system, which incorporates the SFC, can thus be used to analyse the significance...
Conference Paper
Full-text available
Previous research has shown that post-focus compression (PFC) - the reduction of F0 and intensity after a focused word, is present in some languages but absent in many others. It has been hypothesized that the cross-linguistic distribution of PFC parallels that of the Nostratic macro-family. The present study is a test of this Nostratic-origin hypo...
Conference Paper
Full-text available
The typological feature ‘post-focus fo range compression’ (PFC) is often considered an all-or-nothing phenomenon, being completely absent in -PFC languages but applied across-the-board in +PFC languages. This paper presents production data from Japanese and shows that, within a language, the realisation of PFC can be conditional upon lexical prosod...
Conference Paper
Full-text available
This study is a reexamination of the rhythm class hypothesis through an investigation of isochrony tendency in English, an alleged stress-timed language, and Chinese, an alleged syllable-timed language. We compared the relationship between segment and syllable duration in a corpus from each language. The results show that the correlation of segment...
Article
Full-text available
Music and speech both communicate emotional meanings in addition to their domain-specific contents. But it is not clear whether and how the two kinds of emotional meanings are linked. The present study is focused on exploring the emotional connotations of musical timbre of isolated instrument sounds through the perspective of emotional speech proso...
Article
It is often assumed, explicitly or implicitly, that speakers generate special cues in whispered tone and intonation to make up for the absence of fundamental frequency. The present study examined this assumption with one production and three perception experiments. The production experiment compared duration, intensity, formants and spectral tilt o...
Conference Paper
Full-text available
Posh accent in British English is associated with upper class. Previous research on poshness has been centred on vocabulary, grammar and phonology, but little is known about the phonetic properties. This study, as part of a larger project, is an attempt to connect posh accent with attractiveness of voice through a common set of dimensions originati...
Article
Full-text available
Poshness refers to how much a British English speaker sounds upper class when they talk. Popular descriptions of posh English mostly focus on vocabulary, accent and phonology. This study tests the hypothesis that, as a social index, poshness is also manifested via phonetic properties known to encode vocal attractiveness. Specifically, posh English,...
Conference Paper
Full-text available
Part of speech (POS hereafter) is known to affect both duration and F0, such that, nouns are longer and higher in F0 than verbs. In this study we tested the hypothesis that the POS effects are actually a word frequency effect, and that this effect is predictable from information theory. We tested this hypothesis by comparing 44 phonologically match...
Conference Paper
Full-text available
The present study investigated three northern Wu dialects: Wuxi, Suzhou, and Ningbo. It is found that, in all three dialects, focus is encoded by increasing the maximum F0 and duration of focused words, and lowering and compressing the F0 and pitch range of post-focus words. These results are consistent with previous findings about Wu dialect in Sh...