Yi Xu

Yi Xu
University College London | UCL · Department of Speech, Hearing and Phonetic Sciences

PhD

About

256
Publications
64,038
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
7,253
Citations
Introduction
My research is concerned with what I believe is the central question about human speech, namely, how exactly can it be so effective in transmitting vast amount of meanings through a highly restrictive articulation process? Directly related to this question are a number of sub-questions that are just as fundamental: • What exactly are the kinds of meanings transmitted by speech? • How are the meanings encoded by the articulatory system? • And how are the meanings decoded in perception?
Additional affiliations
October 2004 - present
University College London
Position
  • Professor
September 2003 - September 2004
Haskins Laboratories
Position
  • Senior Researcher
September 2002 - September 2003
University of Chicago
Position
  • Research Assistant Professor
Education
September 1987 - December 1993
University of Connecticut
Field of study
  • Linguistics

Publications

Publications (256)
Article
Previous research has shown that post-focus compression (PFC) — the reduction of pitch range and intensity after a focused word in an utterance, is a robust means of marking focus, but it is present only in some languages. The presence of PFC appears to follow language family lines. The present study is a further exploration of the distribution of...
Preprint
The nature of English diphthongs has been much disputed. Bynow, the most influential account argues that diphthongs arephoneme entities rather than vowel combinations. However,mixed results have been reported regarding whether the rate offormant transition is the most reliable attribute in the perceptionand production of diphthongs. Here, we used c...
Preprint
Full-text available
High-quality articulatory speech synthesis has many potential applications in speech science and technology. However, developing appropriate mappings from linguistic specification to articulatory gestures is difficult and time consuming. In this paper we construct an optimisation-based framework as a first step towards learning these mappings witho...
Article
Full-text available
Representation learning is one of the fundamental issues in modeling articulatory-based speech synthesis using target-driven models. This paper proposes a computational strategy for learning underlying articulatory targets from a 3D articulatory speech synthesis model using a bi-directional long short-term memory recurrent neural network based on a...
Article
Full-text available
Abstract: It has been widely assumed that in speech perception it is imperative to first detect a set of distinctive properties or features and then use them to recognize phonetic units like consonants, vowels, and tones. Those features can be auditory cues or articulatory gestures, or a combination of both. There have been no clear demonstrations...
Conference Paper
Full-text available
While the acoustic vowel space has been extensively studied in previous research, little is known about the high-dimensional articulatory space of vowels. The articulatory imaging techniques are limited to tracking only a few key articulators, leaving the rest of the articulators unmonitored. In the present study, we attempted to develop a detailed...
Article
When pitch is explicitly modelled for parametric speech synthesis, microprosodic variations of the fundamental frequency f0 are usually disregarded by current intonation models. While there are numerous studies dealing with the nature and the origin of microprosody, little research has been done on its audibility and its effect on the naturalness o...
Article
Full-text available
F0 variation is a crucial feature in speech prosody, which can convey linguistic information such as focus and paralinguistic meanings such as surprise. How can multiple layers of information be represented with F0 in speech: are they divided into discrete layers of pitch or overlapped without clear divisions? We investigated this question by asses...
Article
In this study, we revisit consonantal perturbation of F0 in English, taking into particular consideration the effect of alignment of F0 contours to segments and the F0 extraction method in the acoustic analysis. We recorded words differing in consonant voicing, manner of articulation, and position in syllable, spoken by native speakers of American...
Chapter
This study examines the effects of segments, intonation and rhythm on the perception of second language (L2) accentedness and comprehensibility by focusing on a tone language, Mandarin Chinese. Fifteen Chinese sentences were manipulated by transferring the segments, intonation and rhythm between native and L2 speakers. 64 Chinese judges listened to...
Article
Full-text available
Although pre-low raising (PLR) has been extensively studied as a type of contextual tonal variation, its underlying mechanism is barely understood. This paper explored the effects of phonetic vs phonological duration on PLR in Cantonese and Thai and examined how speech rate and vowel quantity interact with its realization in these lan- guages, resp...
Conference Paper
Full-text available
The complex f0 variations in continuous speech make it rather difficult to perform automatic recognition of tones in a language like Mandarin Chinese. In this study, we tested the use of target approximation model (TAM) for continuous tone recognition on two datasets. TAM simulates f0 production from the articulatory point of view and so allow to d...
Article
The current study investigated the contribution of different acoustic dimensions to tonal contrasts in Pahari, an understudied language in the Pakistan-administrated part of Kashmir. While previous research on the tonal languages of the region focused only on overall pitch patterns, the present study analyzed fundamental frequency (F0), duration, i...
Article
The current study investigated the contribution of different acoustic dimensions to tonal contrasts in Pahari, an understudied language in the Pakistan-administrated part of Kashmir. While previous research on the tonal languages of the region focused only on overall pitch patterns, the present study analyzed fundamental frequency (F0), duration, i...
Conference Paper
Full-text available
In this study, a state-of-the-art articulatory speech synthesiser was used as the basis for simulating the exploration of CV sounds imitating speech stimuli. By adopting a relevant kinematic model and systematically reducing the search space of consonant articulatory targets, intelligible CV sounds can be found. Derivative-free optimisation strateg...
Conference Paper
Full-text available
Typologically, some languages mark narrow focus with ‘post-focus compression’ (PFC) while others do not. For those which do, PFC is easily lost through bilingualism, at both societal and individual levels. At the societal level, when in contact with a –PFC language (e.g. Southern Min), a likely +PFC language can lose this prosodic feature (e.g. Tai...
Preprint
Full-text available
The way infants use auditory cues to learn to speak despite the acoustic mismatch of their vocal apparatus is a hot topic of scientific debate. The simulation of early vocal learning using articulatory speech synthesis offers a way towards gaining a deeper understanding of this process. One of the crucial parameters in these simulations is the choi...
Article
Full-text available
Speech is a highly skilled motor activity that shares a core problem with other motor skills: how to reduce the massive degrees of freedom (DOF) to the extent that the central nervous control and learning of complex motor movements become possible. It is hypothesized in this paper that a key solution to the DOF problem is to eliminate most of the t...
Article
Many studies across languages have recognised that focus substantially alters the prosodic structure of a sentence not only by increasing F0, intensity, and duration of the focused words but also by compressing the range of pitch and intensity of the post-focus words. Studies, however, are still not fully clear regarding the main effects of focus o...
Conference Paper
Full-text available
Article
Full-text available
Economy of effort, a popular notion in contemporary speech research, predicts that dynamic extremes such as the maximum speed of articulatory movement are avoided as much as possible and that approaching the dynamic extremes is necessary only when there is a need to enhance linguistic contrast, as in the case of stress or clear speech. Empirical da...
Article
Full-text available
Ground-breaking studies on how Bangkok Thai tones have changed over the past 100 years ( Pittayaporn 2007 , 2018 ; Zhu et al. 2015 ) reveal a pattern that Zhu et al. (2015) term the “clockwise tone shift cycle:” low > falling > high level or rising-falling > rising > falling-rising or low. The present study addresses three follow-up questions: (1)...
Conference Paper
Full-text available
The present study tested the idea that coarticulation, despite involving overlap of articulatory gestures, is achieved by sequential target approximation at the level of individual articulator dimensions. For example, CV co-onset in a velar stop can be achieved by having the tongue body vertically move upward for a closure contact, while at the sam...
Conference Paper
Full-text available
This paper reports preliminary results of our effort to address the acoustic-to-articulatory inversion problem. We tested an approach that simulates speech production acquisition as a distal learning task, with acoustic signals of natural utterances in the form of MFCC as input, VocalTractLab-a 3D articulatory synthesizer controlled by target appro...
Article
Full-text available
Korean and English are both known to show on-focus pitch range expansion and post-focus pitch range compression (PFC). But it is not clear if this prosodic similarity would make it easy for Korean speakers to learn English focus prosody. In the present study, we conducted a production experiment using phone number strings to examine whether Korean...
Conference Paper
Full-text available
This paper presents a novel method to estimate the pitch target parameters of the target approximation model (TAM). The TAM allows the compact representation of natural pitch contours on a solid theoretical basis and can be used as an intonation model for text-to-speech synthesis. In contrast to previous approaches, the method proposed here estimat...
Article
This paper presents findings of the first systematic acoustic analysis of focus prosody in Hijazi Arabic (HA), an under-researched Arabic dialect. A question-answer paradigm was used to elicit information and contrastive focus at different sentence locations in comparison with their neutral focus counterparts. Systematic acoustic analyses were perf...
Conference Paper
Full-text available
Prosody in speech is used to communicate a variety of linguistic, paralinguistic and non-linguistic information via multiparametric contours. The Superposition of Functional Contours (SFC) model is capable of extracting the average shape of these elementary contours through iterative analysis-by-synthesis training of neural network contour generato...
Article
Full-text available
This paper introduces FormantPro, a Praat-based tool for large-scale, systematic analysis of formant movements, especially for experimental data. The program generates a rich set of output metrics, including continuous contours like time-normalized formant trajectories and formant velocity profiles suitable for direct graphical comparisons, and dis...
Conference Paper
Full-text available
The way speech prosody encodes linguistic, paralinguistic and non-linguistic information via multiparametric representations of the speech signals is still an open issue. The Superposition of Functional Contours (SFC) model proposes to decompose prosody into elementary multiparametric functional contours through the iterative training of neural net...
Article
Full-text available
This paper introduces FormantPro, a Praat-based tool for large-scale, systematic analysis of formant movements, especially for experimental data. The program generates a rich set of output metrics, including continuous contours like time-normalized formant trajectories and formant velocity profiles suitable for direct graphical comparisons, and dis...
Preprint
Full-text available
The quest for comprehensive generative models of intonation that link linguistic and paralinguistic functions to prosodic forms has been a longstanding challenge of speech communication research. More traditional intonation models have given way to the overwhelming performance of artificial intelligence (AI) techniques for training model-free, end-...
Preprint
Full-text available
The way speech prosody encodes linguistic, paralinguistic and non-linguistic information via multiparametric representations of the speech signals is still an open issue. The Superposition of Functional Contours (SFC) model proposes to decompose prosody into elementary multiparametric functional contours through the iterative training of neural net...
Conference Paper
Full-text available
The Superposition of Functional Contours (SFC) prosody model decomposes the intonation and duration contours into elementary contours that encode specific linguistic functions. It can be used to extract these functional contours at multiple linguistic levels. The PySFC system, which incorporates the SFC, can thus be used to analyse the significance...
Conference Paper
Full-text available
Previous research has shown that post-focus compression (PFC) - the reduction of F0 and intensity after a focused word, is present in some languages but absent in many others. It has been hypothesized that the cross-linguistic distribution of PFC parallels that of the Nostratic macro-family. The present study is a test of this Nostratic-origin hypo...
Conference Paper
Full-text available
The typological feature ‘post-focus fo range compression’ (PFC) is often considered an all-or-nothing phenomenon, being completely absent in -PFC languages but applied across-the-board in +PFC languages. This paper presents production data from Japanese and shows that, within a language, the realisation of PFC can be conditional upon lexical prosod...
Conference Paper
Full-text available
This study is a reexamination of the rhythm class hypothesis through an investigation of isochrony tendency in English, an alleged stress-timed language, and Chinese, an alleged syllable-timed language. We compared the relationship between segment and syllable duration in a corpus from each language. The results show that the correlation of segment...
Article
Full-text available
Music and speech both communicate emotional meanings in addition to their domain-specific contents. But it is not clear whether and how the two kinds of emotional meanings are linked. The present study is focused on exploring the emotional connotations of musical timbre of isolated instrument sounds through the perspective of emotional speech proso...
Article
It is often assumed, explicitly or implicitly, that speakers generate special cues in whispered tone and intonation to make up for the absence of fundamental frequency. The present study examined this assumption with one production and three perception experiments. The production experiment compared duration, intensity, formants and spectral tilt o...
Conference Paper
Full-text available
Posh accent in British English is associated with upper class. Previous research on poshness has been centred on vocabulary, grammar and phonology, but little is known about the phonetic properties. This study, as part of a larger project, is an attempt to connect posh accent with attractiveness of voice through a common set of dimensions originati...
Article
Full-text available
Poshness refers to how much a British English speaker sounds upper class when they talk. Popular descriptions of posh English mostly focus on vocabulary, accent and phonology. This study tests the hypothesis that, as a social index, poshness is also manifested via phonetic properties known to encode vocal attractiveness. Specifically, posh English,...
Conference Paper
Full-text available
Part of speech (POS hereafter) is known to affect both duration and F0, such that, nouns are longer and higher in F0 than verbs. In this study we tested the hypothesis that the POS effects are actually a word frequency effect, and that this effect is predictable from information theory. We tested this hypothesis by comparing 44 phonologically match...
Conference Paper
Full-text available
The present study investigated three northern Wu dialects: Wuxi, Suzhou, and Ningbo. It is found that, in all three dialects, focus is encoded by increasing the maximum F0 and duration of focused words, and lowering and compressing the F0 and pitch range of post-focus words. These results are consistent with previous findings about Wu dialect in Sh...
Article
Full-text available
The current study investigates whether and how focus, phrase boundary and newness can be simultaneously marked in speech prosody in Mandarin Chinese. Homophones were used to construct three syntactic structures that differed only in boundary condition, focus was elicited by preceding questions, while newness of postboundary words was manipulated as...
Article
Full-text available
Japanese has been observed to have two versions of the H tone, the higher of which is associated with an accented mora. However, the distinction of these two versions only surfaces in context but not in isolation, leading to a long-standing debate over whether there is one H tone or two. This article reports evidence that the higher version may res...
Article
Full-text available
There remains a gap in our knowledge base about neural representation of pitch attributes that occur between onset and offset of dynamic, curvilinear pitch contours. The aim is to evaluate how language experience shapes processing of pitch contours as reflected in the amplitude of cortical pitch-specific response components. Responses were elicited...
Conference Paper
Full-text available
This study explores the contexts in which native Japanese listeners have difficulty identifying prosodic focus. Theories of intonational phonology, syntax, and phonetics make different predictions as to which focus location would be the most challenging to the native listener. Lexical pitch accent further complicates this picture. In a sentence wit...
Poster
Full-text available
This study explores the contexts in which native Japanese listeners have difficulty identifying prosodic focus. Theories of intonational phonology, syntax, and phonetics make different predictions as to which focus location would be the most challenging to the native listener. Lexical pitch accent further complicates this picture. In a sentence wit...
Conference Paper
Full-text available
Conventional statistical parametric speech synthesis (SPSS) captures only frame-wise acoustic observations and computes probability densities at HMM state level to obtain statistical acoustic models combined with decision trees, which is therefore a purely statistical data-driven approach without explicit integration of any articulatory mechanisms...
Data
Full-text available
Poster
Full-text available
Previous studies have found that speaker sex can be identified in whispered English and Swedish. It is unknown whether listeners can also identify speaker gender from whispered Mandarin. We asked forty Mandarin listeners to judge the sex of six Mandarin speakers from phonated and whispered monosyllabic words. Results revealed a main effect of phona...
Article
Vocal emotions, as well as different speaking styles and speaker traits, are characterized by a complex interplay of multiple prosodic features. Natural sounding speech synthesis with the ability to control such paralinguistic aspects requires the manipulation of the corresponding prosodic features. With traditional concatenative speech synthesis i...
Conference Paper
Full-text available
Post-focus compression (PFC), the lowering of pitch range and intensity of the post prosodic focus components, is a phenomenon that has been found in various languages worldwide. The interesting findings of the presence and absence of PFC in two closely-related Mandarin Chinese languages, Beijing Mandarin and Taiwan Mandarin respectively, have brou...
Conference Paper
Full-text available
A previous study has found that whispered Mandarin, though still allowing listeners to perceive tones to a certain degree, does not carry acoustic cues that are special to whispered tones. That conclusion, however, was based on data from only one speaker. The present study attempted to verify the earlier finding with data from more speakers, with a...
Conference Paper
Full-text available
A previous study has found that whispered Mandarin, though still allowing listeners to perceive tones to a certain degree, does not carry acoustic cues that are special to whispered tones. That conclusion, however, was based on data from only one speaker. The present study attempted to verify the earlier finding with data from more speakers, with a...
Conference Paper
Full-text available
This paper introduces the Common Prosody Platform (CPP), a computational platform that implements major theories and models of prosody. CPP aims at a) adapting theory-specific assumptions into computational algorithms that can generate surface prosodic forms, and b) making all the models trainable through global optimization based on automatic anal...
Conference Paper
Full-text available
This study addressed the question of how multiple layers of meanings can be simultaneously encoded with F0 in speech by assessing pitch perception thresholds of focus and surprise in Mandarin Chinese. We synthetically increased the pitch height of one syllable in a sentence up to 12 semitones from its neutral baseline in one-semitone steps, and ask...
Conference Paper
Full-text available
This paper investigates the effect of speech rate on pre-low raising in Cantonese. Pre-low raising is an anticipatory tonal process where a high tone is raised when followed by a low tone (i.e. the trigger). Six native speakers of Cantonese were recorded saying a disyllable in 36 tone combinations (6 tones×6 tones) at two speech rates (normal and s...
Article
Full-text available
This paper presents an overview of the Parallel Encoding and Target Approximation (PENTA) model of speech prosody, in response to an extensive critique by Arvaniti & Ladd (2009). PENTA is a framework for conceptually and computationally linking communicative meanings to fine-grained prosodic details, based on an articulatory-functional view of spee...
Article
Full-text available
In computerized technology, artificial speech is becoming increasingly important, and is already used in ATMs, online gaming and healthcare contexts. However, today's artificial speech typically sounds monotonous, a main reason for this being the lack of meaningful prosody. One particularly important function of prosody is to convey different emoti...
Conference Paper
Full-text available
This paper reports our initial findings on whether Mandarin Chinese has developed effective strategies to convey tonal information in whispered speech. We recorded phonated and whispered tones in monosyllabic words, and analyzed the acoustic properties of the tonal contrasts. We then generated amplitude-modulated noise based on both the phonated an...
Conference Paper
Full-text available
This paper presents results from Japanese intonation modelling using PENTAtrainer2, an articulatory synthesiser. Our first aim is to show that PENTA, on which PENTAtrainer2 is based, can achieve high accuracy in predictive synthesis of varying intonation contours. We trained the synthesiser on a 6251-sentence functionally annotated corpus and gener...