About
307
Publications
88,506
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
9,836
Citations
Introduction
My research is concerned with what I believe is the central question about human speech, namely, how exactly can it be so effective in transmitting vast amount of meanings through a highly restrictive articulation process?
Directly related to this question are a number of sub-questions that are just as fundamental:
• What exactly are the kinds of meanings transmitted by speech?
• How are the meanings encoded by the articulatory system?
• And how are the meanings decoded in perception?
Current institution
Additional affiliations
December 1993 - June 1995
September 2003 - September 2004
September 2002 - September 2003
Education
September 1987 - December 1993
Publications
Publications (307)
Abstract: Speech is a highly skilled motor activity that shares a core problem with other motor skills: how to reduce the massive degrees of freedom (DOF) to the extent that the central nervous control and learning of complex motor movements become possible. It is hypothesized in this paper that a key solution to the DOF problem is to eliminate mos...
Speech is produced continuously over time. So, the information it conveys, including intonational functions, also unfolds over time. But many intonational functions are encoded across whole utterances rather than only within certain words. How can perception process speech signals continuously over time, even for communicative functions that are gl...
Recent research has shown evidence based on a minimal contrast paradigm that consonants and vowels are articulatorily synchronized at the onset of the syllable. What remains less clear is the laryngeal dimension of the syllable, for which evidence of tone synchrony with the consonant-vowel syllable has been circumstantial. The present study assesse...
It has long been a mystery how children learn to speak without formal instructions. Previous research has used computational modelling to help solve the mystery by simulating vocal learning with direct imitation or caregiver feedback, but has encountered difficulty in overcoming the speaker normalisation problem, namely, discrepancies between child...
The phonetics of emotion is about the acoustic-phonetic properties of the emotional facets of human vocalization. Conventionally, these properties are studied as correlates of a person’s internal states arising from reactions to the environment, where the internal states are defined by influential psychological theories of emotion. A more recent pe...
The nature of English diphthongs has been much disputed. By now, the most influential account argues that diphthongs are phoneme entities rather than vowel combinations. However, mixed results have been reported regarding whether the rate of formant transition is the most reliable attribute in the perception and production of diphthongs. Here, we u...
This study is a preliminary investigation of intonation in Emirati Arabic (EA) (an under-researched Arabic dialect), using systematic acoustic analysis and computational modelling. First, we investigated the prosodic realisation of information focus and contrastive focus at sentence-initial, -penultimate and -final positions. The analysis of 1980 E...
Computational approaches have an important role to play in understanding the complex process of speech acquisition, in general, and have recently been popular in studies of vocal learning in particular. In this article we suggest that two significant problems associated with imitative vocal learning of spoken language, the speaker normalisation and...
In English, a sentence like "He made out our intentions." could be misperceived as "He may doubt our intentions." because the coda /d/ sounds like it has become the onset of the next syllable. The nature and the occurrence condition of this resyllabification phenomenon are unclear, however. Previous empirical studies mainly relied on listener judgm...
It has been long held that languages of the world are divided into rhythm classes so that they are either stress-timed, syllable-timed or mora-timed. It is also known for a long time that duration serves various informational functions in speech. But it is unclear whether these two kinds of uses of duration are complementary to each other, or they...
This paper introduces a paradigm shift regarding vocal learning simulations, in which the communicative function of speech acquisition determines the learning process and intelligibility is considered the primary measure of learning success. Thereby, a novel approach for artificial vocal learning is presented that utilizes deep neural network-based...
This study explored the contexts in which native Japanese listeners have difficulty identifying prosodic focus. Using a 4AFC identification task, we compared native Japanese listeners’ focus identification accuracy in different lexical accent × focus location conditions using resynthesised speech stimuli, which varied only in fundamental frequency....
Evoc-Learn is a system for simulating early vocal learning of spoken language in ways that can overcome some of the major bottlenecks in vocal learning. The system consists of VocalTractLab, a geometrical three-dimensional vocal tract model for simulating aeroacoustics and articulatory dynamics, a coarticulation model for controlling the temporal d...
While numerous studies on automatic speech recognition have been published in recent years describing data augmentation strategies based on time or frequency domain signal processing, few works exist on the artificial extensions of training data sets using purely synthetic speech data. In this work, the German KIEL corpus was augmented with synthet...
High-quality articulatory speech synthesis has many potential applications in speech science and technology. However, developing appropriate mappings from linguistic specification to articulatory gestures is difficult and time consuming. In this paper we construct an optimisation-based framework as a first step towards learning these mappings witho...
p>This paper introduces a paradigm shift regarding vocal learning simulations, in which the communicative function of speech acquisition determines the learning process and intelligibility is considered the main measure of learning success. Thereby, a novel approach for artificial early vocal learning is presented that utilizes deep neural network-...
p>This paper introduces a paradigm shift regarding vocal learning simulations, in which the communicative function of speech acquisition determines the learning process and intelligibility is considered the main measure of learning success. Thereby, a novel approach for artificial early vocal learning is presented that utilizes deep neural network-...
This chapter reviews studies on contextual tonal variation in Chinese languages, often referred to as ‘tonal coarticulation’ in the literature. We start by explaining why the term ‘contextual variation’ is preferred to ‘coarticulation’ for tones, before introducing different types of contextual variation observed in Chinese languages. The following...
Previous research has shown that post-focus compression (PFC) — the reduction of pitch range and intensity after a focused word in an utterance, is a robust means of marking focus, but it is present only in some languages. The presence of PFC appears to follow language family lines. The present study is a further exploration of the distribution of...
The nature of English diphthongs has been much disputed. Bynow, the most influential account argues that diphthongs arephoneme entities rather than vowel combinations. However,mixed results have been reported regarding whether the rate offormant transition is the most reliable attribute in the perceptionand production of diphthongs. Here, we used c...
The nature of English diphthongs has been much disputed. By now, the most influential account argues that diphthongs are phoneme entities rather than vowel combinations. However, mixed results have been reported regarding whether the rate of formant transition is the most reliable attribute in the perception and production of diphthongs. Here, we u...
High-quality articulatory speech synthesis has many potential applications in speech science and technology. However, developing appropriate mappings from linguistic specification to articulatory gestures is difficult and time consuming. In this paper we construct an optimisation-based framework as a first step towards learning these mappings witho...
Representation learning is one of the fundamental issues in modeling articulatory-based speech synthesis using target-driven models. This paper proposes a computational strategy for learning underlying articulatory targets from a 3D articulatory speech synthesis model using a bi-directional long short-term memory recurrent neural network based on a...
It has been widely assumed that in speech perception it is imperative to first detect a set of distinctive properties or features and then use them to recognize phonetic units like consonants, vowels, and tones. Those features can be auditory cues or articulatory gestures, or a combination of both. There have been no clear demonstrations of how exa...
It has been widely assumed that in speech perception it is imperative to first detect a set of distinctive properties or features and then use them to recognize phonetic units like consonants, vowels, and tones. Those features can be auditory cues or articulatory gestures, or a combination of both. There have been no clear demonstrations of how exa...
An introduction to the the range of current theoretical approaches to the prosody of spoken utterances, with practical applications of those theories.
Prosody is an extremely dynamic field, with a rapid pace of theoretical development and a steady expansion of its influence beyond linguistics into such areas as cognitive psychology, neuroscience, c...
An introduction to the the range of current theoretical approaches to the prosody of spoken utterances, with practical applications of those theories.
Prosody is an extremely dynamic field, with a rapid pace of theoretical development and a steady expansion of its influence beyond linguistics into such areas as cognitive psychology, neuroscience, c...
This study tested the hypothesis that consonant and vowel are synchronised at the syllable onset, and that such synchronised co-onset is the essence of coarticulation. Articulatory data were collected for Mandarin Chinese, using Electromagnetic Articulography (EMA), and acoustic data were collected simultaneously. As a departure from conventional a...
While the acoustic vowel space has been extensively studied in previous research, little is known about the high-dimensional articulatory space of vowels. The articulatory imaging techniques are limited to tracking only a few key articulators, leaving the rest of the articulators unmonitored. In the present study, we attempted to develop a detailed...
When pitch is explicitly modelled for parametric speech synthesis, microprosodic variations of the fundamental frequency f0 are usually disregarded by current intonation models. While there are numerous studies dealing with the nature and the origin of microprosody, little research has been done on its audibility and its effect on the naturalness o...
F0 variation is a crucial feature in speech prosody, which can convey linguistic information such as focus and paralinguistic meanings such as surprise. How can multiple layers of information be represented with F0 in speech: are they divided into discrete layers of pitch or overlapped without clear divisions? We investigated this question by asses...
In this study, we revisit consonantal perturbation of F0 in English, taking into particular consideration the effect of alignment of F0 contours to segments and the F0 extraction method in the acoustic analysis. We recorded words differing in consonant voicing, manner of articulation, and position in syllable, spoken by native speakers of American...
This study examines the effects of segments, intonation and rhythm on the perception of second language (L2) accentedness and comprehensibility by focusing on a tone language, Mandarin Chinese. Fifteen Chinese sentences were manipulated by transferring the segments, intonation and rhythm between native and L2 speakers. 64 Chinese judges listened to...
Although pre-low raising (PLR) has been extensively studied as a type of contextual tonal variation, its underlying mechanism is barely understood. This paper explored the effects of phonetic vs phonological duration on PLR in Cantonese and Thai and examined how speech rate and vowel quantity interact with its realization in these lan- guages, resp...
The complex f0 variations in continuous speech make it rather difficult to perform automatic recognition of tones in a language like Mandarin Chinese. In this study, we tested the use of target approximation model (TAM) for continuous tone recognition on two datasets. TAM simulates f0 production from the articulatory point of view and so allow to d...
The current study investigated the contribution of different acoustic dimensions to tonal contrasts in Pahari, an understudied language in the Pakistan-administrated part of Kashmir. While previous research on the tonal languages of the region focused only on overall pitch patterns, the present study analyzed fundamental frequency (F0), duration, i...
In this study we tested the hypothesis that consonant and vowel articulations start at the same time at syllable onset [1]. Articulatory data was collected for Mandarin Chinese using Electromagenetic Articulography (EMA), which tracks flesh-point movements in time and space. Unlike the traditional velocity threshold method [2], we used a triplet me...
The current study investigated the contribution of different acoustic dimensions to tonal contrasts in Pahari, an understudied language in the Pakistan-administrated part of Kashmir. While previous research on the tonal languages of the region focused only on overall pitch patterns, the present study analyzed fundamental frequency (F0), duration, i...
In this study, a state-of-the-art articulatory speech synthesiser was used as the basis for simulating the exploration of CV sounds imitating speech stimuli. By adopting a relevant kinematic model and systematically reducing the search space of consonant articulatory targets, intelligible CV sounds can be found. Derivative-free optimisation strateg...
[This corrects the article DOI: 10.3389/fpsyg.2019.02469.].
Typologically, some languages mark narrow focus with ‘post-focus compression’ (PFC) while others do not. For those which do, PFC is easily lost through bilingualism, at both societal and individual levels. At the societal level, when in contact with a –PFC language (e.g. Southern Min), a likely +PFC language can lose this prosodic feature (e.g. Tai...
Many theories assume that speech perception is done by first extracting features like the distinctive features, tonal features or articulatory gestures before recognizing phonetic units such as segments and tones. But it is unclear how exactly extracted features can lead to effective phonetic recognition. In this study we explore this issue by usin...
This study investigates Chongqing Dialect, a language largely used in Southwest China which is mutually intelligible to Beijing Mandarin speakers. Phonetic variations triggered by focus in Chongqing Dialect, especially in the form of post-focus compression (PFC), are investigated in terms of max F0, mean F0, duration and intensity. A follow-up perc...
The way infants use auditory cues to learn to speak despite the acoustic mismatch of their vocal apparatus is a hot topic of scientific debate. The simulation of early vocal learning using articulatory speech synthesis offers a way towards gaining a deeper understanding of this process. One of the crucial parameters in these simulations is the choi...
Speech is a highly skilled motor activity that shares a core problem with other motor skills: how to reduce the massive degrees of freedom (DOF) to the extent that the central nervous control and learning of complex motor movements become possible. It is hypothesized in this paper that a key solution to the DOF problem is to eliminate most of the t...
Many studies across languages have recognised that focus substantially alters the prosodic structure of a sentence not only by increasing F0, intensity, and duration of the focused words but also by compressing the range of pitch and intensity of the post-focus words. Studies, however, are still not fully clear regarding the main effects of focus o...
Economy of effort, a popular notion in contemporary speech research, predicts that dynamic extremes such as the maximum speed of articulatory movement are avoided as much as possible and that approaching the dynamic extremes is necessary only when there is a need to enhance linguistic contrast, as in the case of stress or clear speech. Empirical da...
Ground-breaking studies on how Bangkok Thai tones have changed over the past 100 years ( Pittayaporn 2007 , 2018 ; Zhu et al. 2015 ) reveal a pattern that Zhu et al. (2015) term the “clockwise tone shift cycle:” low > falling > high level or rising-falling > rising > falling-rising or low. The present study addresses three follow-up questions: (1)...
In this paper, we report findings of a major difference between Mandarin and English in terms of means of marking major prosodic boundaries. We performed detailed duration analysis on two large corpora, one in each language, using pre-labelled break indices as a reference indicator of boundary strength. Results showed that pre-boundary syllable dur...
The present study tested the idea that coarticulation, despite involving overlap of articulatory gestures, is achieved by sequential target approximation at the level of individual articulator dimensions. For example, CV co-onset in a velar stop can be achieved by having the tongue body vertically move upward for a closure contact, while at the sam...
This paper reports preliminary results of our effort to address the acoustic-to-articulatory inversion problem. We tested an approach that simulates speech production acquisition as a distal learning task, with acoustic signals of natural utterances in the form of MFCC as input, VocalTractLab-a 3D articulatory synthesizer controlled by target appro...
Korean and English are both known to show on-focus pitch range expansion and post-focus pitch range compression (PFC). But it is not clear if this prosodic similarity would make it easy for Korean speakers to learn English focus prosody. In the present study, we conducted a production experiment using phone number strings to examine whether Korean...
This paper presents a novel method to estimate the pitch target parameters of the target approximation model (TAM). The TAM allows the compact representation of natural pitch contours on a solid theoretical basis and can be used as an intonation model for text-to-speech synthesis. In contrast to previous approaches, the method proposed here estimat...
This paper presents findings of the first systematic acoustic analysis of focus prosody in Hijazi Arabic (HA), an under-researched Arabic dialect. A question-answer paradigm was used to elicit information and contrastive focus at different sentence locations in comparison with their neutral focus counterparts. Systematic acoustic analyses were perf...
Prosody in speech is used to communicate a variety of linguistic, paralinguistic and non-linguistic information via multiparametric contours. The Superposition of Functional Contours (SFC) model is capable of extracting the average shape of these elementary contours through iterative analysis-by-synthesis training of neural network contour generato...
This paper introduces FormantPro, a Praat-based tool for large-scale, systematic analysis of formant movements, especially for experimental data. The program generates a rich set of output metrics, including continuous contours like time-normalized formant trajectories and formant velocity profiles suitable for direct graphical comparisons, and dis...
The way speech prosody encodes linguistic, paralinguistic and non-linguistic information via multiparametric representations of the speech signals is still an open issue. The Superposition of Functional Contours (SFC) model proposes to decompose prosody into elementary multiparametric functional contours through the iterative training of neural net...
This paper introduces FormantPro, a Praat-based tool for large-scale, systematic analysis of formant movements, especially for experimental data. The program generates a rich set of output metrics, including continuous contours like time-normalized formant trajectories and formant velocity profiles suitable for direct graphical comparisons, and dis...
The quest for comprehensive generative models of intonation that link linguistic and paralinguistic functions to prosodic forms has been a longstanding challenge of speech communication research. More traditional intonation models have given way to the overwhelming performance of artificial intelligence (AI) techniques for training model-free, end-...
The way speech prosody encodes linguistic, paralinguistic and non-linguistic information via multiparametric representations of the speech signals is still an open issue. The Superposition of Functional Contours (SFC) model proposes to decompose prosody into elementary multiparametric functional contours through the iterative training of neural net...
The Superposition of Functional Contours (SFC) prosody model decomposes the intonation and duration contours into elementary contours that encode specific linguistic functions. It can be used to extract these functional contours at multiple linguistic levels. The PySFC system, which incorporates the SFC, can thus be used to analyse the significance...
Previous research has shown that post-focus compression (PFC) - the reduction of F0 and intensity after a focused word, is present in some languages but absent in many others. It has been hypothesized that the cross-linguistic distribution of PFC parallels that of the Nostratic macro-family. The present study is a test of this Nostratic-origin hypo...
The typological feature ‘post-focus fo range compression’ (PFC) is often considered an all-or-nothing phenomenon, being completely absent in -PFC languages but applied across-the-board in +PFC languages. This paper presents production data from Japanese and shows that, within a language, the realisation of PFC can be conditional upon lexical prosod...
This study is a reexamination of the rhythm class hypothesis through an investigation of isochrony tendency in English, an alleged stress-timed language, and Chinese, an alleged syllable-timed language. We compared the relationship between segment and syllable duration in a corpus from each language. The results show that the correlation of segment...
Music and speech both communicate emotional meanings in addition to their domain-specific contents. But it is not clear whether and how the two kinds of emotional meanings are linked. The present study is focused on exploring the emotional connotations of musical timbre of isolated instrument sounds through the perspective of emotional speech proso...
It is often assumed, explicitly or implicitly, that speakers generate special cues in whispered tone and intonation to make up for the absence of fundamental frequency. The present study examined this assumption with one production and three perception experiments. The production experiment compared duration, intensity, formants and spectral tilt o...
Posh accent in British English is associated with upper class. Previous research on poshness has been centred on vocabulary, grammar and phonology, but little is known about the phonetic properties. This study, as part of a larger project, is an attempt to connect posh accent with attractiveness of voice through a common set of dimensions originati...
Li Jiao Chengxia Wang C Hsu- [...]
Yi Xu
Poshness refers to how much a British English speaker sounds upper class when they talk. Popular descriptions of posh English mostly focus on vocabulary, accent and phonology. This study tests the hypothesis that, as a social index, poshness is also manifested via phonetic properties known to encode vocal attractiveness. Specifically, posh English,...
Part of speech (POS hereafter) is known to affect both duration and F0, such that, nouns are longer and higher in F0 than verbs. In this study we tested the hypothesis that the POS effects are actually a word frequency effect, and that this effect is predictable from information theory. We tested this hypothesis by comparing 44 phonologically match...
The present study investigated three northern Wu dialects: Wuxi, Suzhou, and Ningbo. It is found that, in all three dialects, focus is encoded by increasing the maximum F0 and duration of focused words, and lowering and compressing the F0 and pitch range of post-focus words. These results are consistent with previous findings about Wu dialect in Sh...
The current study investigates whether and how focus, phrase boundary and newness can be simultaneously marked in speech prosody in Mandarin Chinese. Homophones were used to construct three syntactic structures that differed only in boundary condition, focus was elicited by preceding questions, while newness of postboundary words was manipulated as...
Japanese has been observed to have two versions of the H tone, the higher of which is associated with an accented mora. However, the distinction of these two versions only surfaces in context but not in isolation, leading to a long-standing debate over whether there is one H tone or two. This article reports evidence that the higher version may res...
There remains a gap in our knowledge base about neural representation of pitch attributes that occur between onset and offset of dynamic, curvilinear pitch contours. The aim is to evaluate how language experience shapes processing of pitch contours as reflected in the amplitude of cortical pitch-specific response components. Responses were elicited...
This study explores the contexts in which native Japanese listeners have difficulty identifying prosodic focus. Theories of intonational phonology, syntax, and phonetics make different predictions as to which focus location would be the most challenging to the native listener. Lexical pitch accent further complicates this picture. In a sentence wit...
This study explores the contexts in which native Japanese listeners have difficulty identifying prosodic focus. Theories of intonational phonology, syntax, and phonetics make different predictions as to which focus location would be the most challenging to the native listener. Lexical pitch accent further complicates this picture. In a sentence wit...
Conventional statistical parametric speech synthesis (SPSS) captures only frame-wise acoustic observations and computes probability densities at HMM state level to obtain statistical acoustic models combined with decision trees, which is therefore a purely statistical data-driven approach without explicit integration of any articulatory mechanisms...
Previous studies have found that speaker sex can be identified in whispered English and Swedish. It is unknown whether listeners can also identify speaker gender from whispered Mandarin. We asked forty Mandarin listeners to judge the sex of six Mandarin speakers from phonated and whispered monosyllabic words. Results revealed a main effect of phona...
Vocal emotions, as well as different speaking styles and speaker traits, are characterized by a complex interplay of multiple prosodic features. Natural sounding speech synthesis with the ability to control such paralinguistic aspects requires the manipulation of the corresponding prosodic features. With traditional concatenative speech synthesis i...
Post-focus compression (PFC), the lowering of pitch range and intensity of the post prosodic focus components, is a phenomenon that has been found in various languages worldwide. The interesting findings of the presence and absence of PFC in two closely-related Mandarin Chinese languages, Beijing Mandarin and Taiwan Mandarin respectively, have brou...
A previous study has found that whispered Mandarin, though still allowing listeners to perceive tones to a certain degree, does not carry acoustic cues that are special to whispered tones. That conclusion, however, was based on data from only one speaker. The present study attempted to verify the earlier finding with data from more speakers, with a...
A previous study has found that whispered Mandarin, though still allowing listeners to perceive tones to a certain degree, does not carry acoustic cues that are special to whispered tones. That conclusion, however, was based on data from only one speaker. The present study attempted to verify the earlier finding with data from more speakers, with a...
This paper introduces the Common Prosody Platform (CPP), a computational platform that implements major theories and models of prosody. CPP aims at a) adapting theory-specific assumptions into computational algorithms that can generate surface prosodic forms, and b) making all the models trainable through global optimization based on automatic anal...
This study addressed the question of how multiple layers of meanings can be simultaneously encoded with F0 in speech by assessing pitch perception thresholds of focus and surprise in Mandarin Chinese. We synthetically increased the pitch height of one syllable in a sentence up to 12 semitones from its neutral baseline in one-semitone steps, and ask...
This paper investigates the effect of speech rate on pre-low raising in Cantonese. Pre-low raising is an anticipatory tonal process where a high tone is raised when followed by a low tone (i.e. the trigger). Six native speakers of Cantonese were recorded saying a disyllable in 36 tone combinations (6 tones×6 tones) at two speech rates (normal and s...