Article

Partial compensation for coarticulatory vowel nasalization across concatenative and neural text-to-speech Partial compensation for coarticulatory vowel nasalization across concatenative and neural text-to-speech

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This study investigates the perception of coarticulatory vowel nasality generated using different text-to-speech (TTS) methods in American English. Experiment 1 compared concatenative and neural TTS using a 4IAX task, where listeners discriminated between a word pair containing either both oral or nasalized vowels and a word pair containing one oral and one nasalized vowel. Vowels occurred either in identical or alternating consonant contexts across pairs to reveal perceptual sensitivity and compensatory behavior, respectively. For identical contexts, listeners were better at discriminating between oral and nasalized vowels in neural than in concatenative TTS for nasalized same-vowel trials, but better discrimination for concatenative TTS was observed for oral same-vowel trials. Meanwhile, listeners displayed less compensation for coarticulation in neural than in concatenative TTS. To determine whether apparent roboticity of the TTS voice shapes vowel discrimination and compensation patterns, a "roboticized" version of neural TTS was generated (monotonized f0 and addition of an echo), holding phonetic nasality constant; a ratings study (experiment 2) confirmed that the manipulation resulted in different apparent robot-icity. Experiment 3 compared the discrimination of unmodified neural TTS and roboticized neural TTS: listeners displayed lower accuracy in identical contexts for roboticized relative to unmodified neural TTS, yet the performances in alternating contexts were similar.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Understanding how acoustic differences between the TTS voices and naturally produced speech influence speech perception processes such as vowel discrimination and compensation for coarticulatory vowel nasalization can aid in improving device voices in years to come (cf. comparing perceptual compensation across TTS types in [36]). ...
Conference Paper
Full-text available
The current study explores whether perception of coarticulatory vowel nasalization differs by speaker age (adult vs. child) and type of voice (naturally produced vs. synthetic speech). Listeners completed a 4IAX discrimination task between pairs containing acoustically identical (both nasal or oral) vowels and acoustically distinct (one oral, one nasal) vowels. Vowels occurred in either the same consonant contexts or different contexts across pairs. Listeners completed the experiment with either naturally produced speech or text-to-speech (TTS). For same-context trials, listeners were better at discriminating between oral and nasal vowels for child speech in the synthetic voices but adult speech in the natural voices. Meanwhile, in different-context trials, listeners were less able to discriminate, indicating more perceptual compensation for synthetic voices. There was no difference in different-context discrimination across talker ages, indicating that listeners did not compensate differently if the speaker was a child or adult. Findings are relevant for models of compensation, computer personification theories, and speaker-indexical perception accounts.
Article
Full-text available
The present study investigates whether native speakers of German phonetically accommodate to natural and synthetic voices in a shadowing experiment. We aim to determine whether this phenomenon, which is frequently found in HHI, also occurs in HCI involving synthetic speech. The examined features pertain to different phonetic domains: allophonic variation, schwa epenthesis, realization of pitch accents, word-based temporal structure and distribution of spectral energy. On the individual level, we found that the participants converged to varying subsets of the examined features, while they maintained their baseline behavior in other cases or, in rare instances, even diverged from the model voices. This shows that accommodation with respect to one particular feature may not predict the behavior with respect to another feature. On the group level, the participants of the natural condition converged to all features under examination, however very subtly so for schwa epenthesis. The synthetic voices, while partly reducing the strength of effects found for the natural voices, triggered accommodating behavior as well. The predominant pattern for all voice types was convergence during the interaction followed by divergence after the interaction.
Conference Paper
Full-text available
The present study compares how individuals perceive gradient acoustic realizations of emotion produced by a human voice versus an Amazon Alexa text-to-speech (TTS) voice. We manipulated semantically neutral sentences spoken by both talkers with identical emotional synthesis methods, using three levels of increasing 'happiness' (0 %, 33 %, 66 % 'happier'). On each trial, listeners (native speakers of American English, n=99) rated a given sentence on two scales to assess dimensions of emotion: valence (negative-positive) and arousal (calm-excited). Participants also rated the Alexa voice on several parameters to assess anthropomorphism (e.g., naturalness, human-likeness, etc.). Results showed that the emotion manipulations led to increases in perceived positive valence and excitement. Yet, the effect differed by interlocutor: increasing 'happiness' manipulations led to larger changes for the human voice than the Alexa voice. Additionally, we observed individual differences in perceived valence/arousal based on participants' an-thropomorphism scores. Overall, this line of research can speak to theories of computer personification and elucidate our changing relationship with voice-AI technology.
Conference Paper
Full-text available
This study tests speech-in-noise perception and social ratings of speech produced by different text-to-speech (TTS) synthesis methods. We used identical speaker training datasets for a set of 4 voices (using AWS Polly TTS), generated using neural and concatenative TTS. In Experiment 1, listeners identified target words in semantically predictable and unpredictable sentences in concatenative and neural TTS at two noise levels (-3 dB,-6 dB SNR). Correct word identification was lower for neural TTS than for concatenative TTS, in the lower SNR, and for semantically unpredictable sentences. In Experiment 2, listeners rated the voices on 4 social attributes. Neural TTS was rated as more human-like, natural, likeable, and familiar than concatenative TTS. Furthermore, how natural listeners rated the neural TTS voice was positively related to their speech-in-noise accuracy. Together, these findings show that the TTS method influences both intelligibility and social judgments of speech-and that these patterns are linked. Overall, this work contributes to our understanding of the nexus of speech technology and human speech perception.
Conference Paper
Full-text available
The current study explores whether the top-down influence of speaker age guise influences patterns of compensation for coarticulation. /u/-fronting variation in California is linked to both phonetic and social factors: /u/ in alveolar contexts is fronter than in bilabial contexts and /u/-fronting is more advanced in younger speakers. We investigate whether the apparent age of the speaker, via a guise depicting a 21-year-old woman or a 55-year-old woman, influences whether listeners compensate for coarticulation on /u/. Listeners performed a paired discrimination task of /u/ with a raised F2 (fronted) in an alveolar consonant context (/sut/), compared to non-fronted /u/ in a non-coronal context. Overall, discrimination was more veridical for the younger guise, than for the older guise, leading to the perception of more inherently fronted variants for the younger talker. Results indicate that apparent talker age may influence perception of /u/-fronting, but not only in coarticulatory contexts.
Conference Paper
Full-text available
Humans are now regularly speaking to voice-activated artificially intelligent (voice-AI) assistants. Yet, our understanding of the cognitive mechanisms at play during speech interactions with a voice-AI, relative to a real human, interlocutor is an understudied area of research. The present study tests whether top-down guise of "apparent humanness" affects vocal alignment patterns to human and text-to-speech (TTS) voices. In a between-subjects design, participants heard either 4 naturally-produced or 4 TTS voices. Apparent humanness guise varied within-subject. Speaker guise was manipulated via a top-down label with images, either of two pictures of voice-AI systems (Amazon Echos) or two human talkers. Vocal alignment in vowel duration revealed top-down effects of apparent humanness guise: participants showed greater alignment to TTS voices when presented with a device guise ("authentic guise"), but lower alignment in the two inauthentic guises. Results suggest a dynamic interplay of bottom-up and top-down factors in human and voice-AI interaction.
Conference Paper
Full-text available
The current study tests subjects' vocal alignment toward female and male text-to-speech (TTS) voices presented via three systems: Amazon Echo, Nao, and Furhat. These systems vary in their physical form, ranging from a cylindrical speaker (Echo), to a small robot (Nao), to a human-like robot bust (Furhat). We test whether this cline of personification (cylinder < mini robot < human-like robot bust) predicts patterns of gender-mediated vocal alignment. In addition to comparing multiple systems, this study addresses a confound in many prior vocal alignment studies by using identical voices across the systems. Results show evidence for a cline of personification toward female TTS voices by female shadowers (Echo < Nao < Furhat) and a more categorical effect of device personification for male TTS voices by male shadowers (Echo < Nao, Furhat). These findings are discussed in terms of their implications for models of device-human interaction and theories of computer personification.
Article
Full-text available
Listeners show better-than-chance discrimination of nasalized and oral vowels occurring in appropriate consonantal contexts. Yet, the methods for investigating partial perceptual compensation for nasal coarticulation often include nasal and oral vowels containing naturally different pitch contours. Listeners may therefore be discriminating between these vowels based on pitch differences and not nasalization. The current study investigates the effect of pitch variation on the discrimination of nasalized and oral vowels in C_N and C_C items. The f0 contour of vowels within paired discrimination trials was varied. The results indicate that pitch variation does not influence patterns of partial perceptual compensation for coarticulation.
Article
Full-text available
Voice has become a widespread and commercially viable interaction mechanism with the introduction of voice assistants (VAs), such as Amazon’s Alexa, Apple’s Siri, Google Assistant, and Microsoft’s Cortana. Despite their prevalence, we do not have a detailed understanding of how these technologies are used in domestic spaces. To understand how people use VAs, we conducted interviews with 19 users, and analyzed the log files of 82 Amazon Alexa devices, totaling 193,665 commands, and 88 Google Home Devices, totaling 65,499 commands. In our analysis, we identified music, search, and IoT usage as the command categories most used by VA users. We explored how VAs are used in the home, investigated the role of VAs as scaffolding for Internet of Things device control, and characterized emergent issues of privacy for VA users. We conclude with implications for the design of VAs and for future research studies of VAs.
Article
Full-text available
This study was designed to test whether listener-based sound change—listener misperception (Ohala, 1981, 1993) and perceptual cue re-weighting (Beddor, 2009, 2012)—can be observed synchronically in a laboratory setting. Co-registered articulatory data (degree of nasalization, tongue height, breathiness) and acoustic data (F1 frequency) related to the productions of phonemic oral and nasal vowels of Southern French were first collected from four native speakers, and the acoustic recordings were subsequently presented to nine Australian English naïve listeners, who were instructed to imitate the native productions. During these imitations, similar articulatory and acoustic data were collected in order to compare the articulatory strategies used by the two groups. The results suggest that the imitators successfully reproduced the acoustic distinctions made by the native speakers, but that they did so using different articulatory strategies. The articulatory strategies for the vowel pair /ɑ̃/-/a/ suggest that listeners (at least partially) misperceived F1-lowering due to nasalization and breathiness as being due to tongue height. Additional evidence supports perceptual cue re-weighting, in that the naïve imitators used nasalance less, and tongue height more, in order to obtain the same F1 nasal-oral distinctions that the native speakers had originally produced.
Article
Full-text available
Generally, concatenative speech synthesis systems provide a considerable synthesis quality since the criteria methods have been optimized. However, the level of synthesis quality still depends on the adequate concatenation of speech un adequate concatenation of speech units has as precondition that concatenation mismatches as phase mismat discontinuity of spectral envelope must not appear in the synthesized speech signal. Therefore, avoiding phase mismatches lea high speech synthesis quality and the way to avoid phase mismatches is achieved by an appropriated p Therefore, a pitch marking study was carried out through a evaluat was pitch marked many times using the different pitch mark algorithms. Therewith, several sentences different pitch makings of the speech database. A mean Opinion Score (MOS) listening test was carried out for the evaluation synthesized speech sentences regarding mismatch human perception. The best pitch mark algorit effect in the quality of the speech synthesis. Generally, concatenative speech synthesis systems provide a considerable synthesis quality since the criteria methods have been optimized. However, the level of synthesis quality still depends on the adequate concatenation of speech un adequate concatenation of speech units has as precondition that concatenation mismatches as phase mismat discontinuity of spectral envelope must not appear in the synthesized speech signal. Therefore, avoiding phase mismatches lea high speech synthesis quality and the way to avoid phase mismatches is achieved by an appropriated p Therefore, a pitch marking study was carried out through a evaluating the available pitch marking algorithms. So, a speech database was pitch marked many times using the different pitch mark algorithms. Therewith, several sentences different pitch makings of the speech database. A mean Opinion Score (MOS) listening test was carried out for the evaluation synthesized speech sentences regarding mismatch human perception. The best pitch mark algorithm was selected according its observed lection, Concatenation Mismatch Generally, concatenative speech synthesis systems provide a considerable synthesis quality since the criteria for unit selection methods have been optimized. However, the level of synthesis quality still depends on the adequate concatenation of speech units. An adequate concatenation of speech units has as precondition that concatenation mismatches as phase mismatch, phase mismatch and discontinuity of spectral envelope must not appear in the synthesized speech signal. Therefore, avoiding phase mismatches leads to a high speech synthesis quality and the way to avoid phase mismatches is achieved by an appropriated pitch marking algorithm. ing the available pitch marking algorithms. So, a speech database was pitch marked many times using the different pitch mark algorithms. Therewith, several sentences were synthesized applying the different pitch makings of the speech database. A mean Opinion Score (MOS) listening test was carried out for the evaluation of the hm was selected according its observed Pitch Marking Study towards High Quality in Concatenative Based Speech Synthesis
Article
Full-text available
This study explores the relationship between prosodic strengthening and linguistic contrasts in English by examining temporal realization of nasals (N-duration) in CVN# and #NVC, and their coarticulatory influence on vowels (V-nasalization). Results show that different sources of prosodic strengthening bring about different types of linguistic contrasts. Prominence enhances the consonant׳s [nasality] as reflected in an elongation of N-duration, but it enhances the vowel׳s [orality] (rather than [nasality]) showing coarticulatory resistance to the nasal influence even when the nasal is phonologically focused (e.g., mob-bob; bomb-bob). Boundary strength induces different types of enhancement patterns as a function of prosodic position (initial vs. final). In the domain-initial position, boundary strength reduces the consonant׳s [nasality] as evident in a shortening of N-duration and a reduction of V-nasalization, thus enhancing CV contrast. The opposite is true with the domain-final nasal in which N-duration is lengthened accompanied by greater V-nasalization, showing coarticulatory vulnerability. The systematic coarticulatory variation as a function of prosodic factors indicates that V-nasalization as a coarticulatory process is indeed under speaker control, fine-tuned in a linguistically significant way. In dynamical terms, these results may be seen as coming from differential intergestural coupling relationships that may underlie the difference in V-nasalization in CVN# vs. #NVC. It is proposed that the timing initially determined by such coupling relationships must be fine-tuned by prosodic strengthening in a way that reflects the relationship between dynamical underpinnings of speech timing and linguistic contrasts.
Article
Full-text available
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.
Technical Report
Full-text available
Description Fit linear and generalized linear mixed-effects models. The models and their components are represented using S4 classes and methods. The core computational algorithms are implemented using the 'Eigen' C++ library for numerical linear algebra and 'RcppEigen' ``glue''.
Article
Full-text available
This paper deals with the traditional problem of the occurrence of audible discontinuities at concatenation points at diphone boundaries in the concatenative speech synthesis. While most of the related studies put stress on the spectral component, we focused on the pitch contours and their role as predictors of the discontinuities. To measure the amount of information contained in the pitch contours, we trained SVM classifiers using perceptual data collected in listening tests. The results have shown that the fine grained pitch contours ex-tracted from a vicinity of the concatenation points carry enough information for classifying continuous and discontinuous joins with a high accuracy.
Article
Full-text available
This study includes results of an articulatory (electromagnetic articulography, i.e. EMA) and acoustic study of the realizations of three oral–nasal vowel pairs /a/–/ɑ̃/, /ε/–/ε̃/, and /o/–/ɔ̃/ recorded from 12 Northern Metropolitan French (NMF) female speakers in laboratory settings. By studying the position of the tongue and the lips during the production of target oral and nasal vowels and simultaneously recording the acoustic signal, the predicted effects of velo-pharyngeal (VP) coupling on the acoustic output of the vocal tract can be separated from those due to oral articulatory configuration in a qualitative manner. Based on the previous research, all nasal vowels were expected to be produced with at least some change in lingual and labial articulatory configurations compared to their oral vowel counterparts. Evidence is observed which suggests that many of the oral articulatory configurations of NMF nasal vowels enhance the acoustic effect of VP coupling on F1 and F2 frequencies. Moreover, evidence is observed that the oral articulatory strategies used to produce the oral/nasal vowel distinction are idiosyncratic, but that, nevertheless, speakers produce a similar acoustic output. These results are discussed in the light of motor equivalence as well as the view that the goal of speech acts is acoustic, not articulatory.
Article
Full-text available
Speech produced in the context of real or imagined communicative difficulties is characterized by hyperarticulation. Phonological neighborhood density (ND) conditions similar patterns in production: Words with many neighbors are hyperarticulated relative to words with fewer; Hi ND words also show greater coarticulation than Lo ND words [e.g., Scarborough, R. (2012). "Lexical similarity and speech production: Neighborhoods for nonwords," Lingua 122(2), 164-176]. Coarticulatory properties of "clear speech" are more variable across studies. This study examined hyperarticulation and nasal coarticulation across five real and simulated clear speech contexts and two neighborhood conditions, and investigated consequences of these details for word perception. The data revealed a continuum of (attempted) clarity, though real listener-directed speech (Real) differed from all of the simulated styles. Like the clearest simulated-context speech (spoken "as if to someone hard-of-hearing"-HOH), Real had greater hyperarticulation than other conditions. However, Real had the greatest coarticulatory nasality while HOH had the least. Lexical decisions were faster for words from Real than from HOH, indicating that speech produced in real communicative contexts (with hyperarticulation and increased coarticulation) was perceptually better than simulated clear speech. Hi ND words patterned with Real in production, and Real Hi ND words were clear enough to overcome the dense neighborhood disadvantage.
Article
Full-text available
In acoustic studies of vowel nasalization, it is sometimes assumed that the primary articulatory difference between an oral vowel and a nasal vowel is the coupling of the nasal cavity to the rest of the vocal tract. Acoustic modulations observed in nasal vowels are customarily attributed to the presence of additional poles affiliated with the naso-pharyngeal tract and zeros affiliated with the nasal cavity. We test the hypothesis that oral configuration may also change during nasalized vowels, either enhancing or compensating for the acoustic modulations associated with nasality. We analyze tongue position, nasal airflow, and acoustic data to determine whether American English /i/ and /a/ manifest different oral configurations when they are nasalized, i.e. when they are followed by nasal consonants. We find that tongue position is higher during nasalized [ĩ] than it is during oral [i] but do not find any effect for nasalized [ã]. We argue that speakers of American English raise the tongue body during nasalized [ĩ] in order to counteract the perceived F1-raising (centralization) associated with high vowel nasalization.
Article
Full-text available
The perception of coarticulated speech as it unfolds over time was investigated by monitoring eye movements of participants as they listened to words with oral vowels or with late or early onset of anticipatory vowel nasalization. When listeners heard [CṼNC] and had visual choices of images of CVNC (e.g., send) and CVC (said) words, they fixated more quickly and more often on the CVNC image when onset of nasalization began early in the vowel compared to when the coarticulatory information occurred later. Moreover, when a standard eye movement programming delay is factored in, fixations on the CVNC image began to occur before listeners heard the nasal consonant. Listeners' attention to coarticulatory cues for velum lowering was selective in two respects: (a) listeners assigned greater perceptual weight to coarticulatory information in phonetic contexts in which [Ṽ] but not N is an especially robust property, and (b) individual listeners differed in their perceptual weights. Overall, the time course of perception of velum lowering in American English indicates that the dynamics of perception parallel the dynamics of the gestural information encoded in the acoustic signal. In real-time processing, listeners closely track unfolding coarticulatory information in ways that speed lexical activation.
Article
Full-text available
The experiments reported here used auditory–visual mismatches to compare three approaches to speaker normalization in speech perception: radical invariance, vocal tract normalization, and talker normalization. In contrast to the first two, the talker normalization theory assumes that listeners' subjective, abstract impressions of talkers play a role in speech perception. Experiment 1 found that the gender of a visually presented face affects the location of the phoneme boundary between [Ω] and [Λ] in the perceptual identification of a continuum of auditory–visual stimuli ranging from hood to hud. This effect was found for both “stereotypical” and “non-stereotypical” male and female voices. The experiment also found that voice stereotypicality had an effect on the phoneme boundary. The difference between male and female talkers was greater when the talkers were rated by listeners as “stereotypical”. Interestingly, for the two female talkers in this experiment, rated stereotypicality was correlated with voice breathiness rather than vowel fundamental frequency. Experiment 2 replicated and extended experiment 1 and tested whether the visual stimuli in experiment 1 were being perceptually integrated with the acoustic stimuli. In addition to the effects found in experiment 1, there was a boundary effect for the visually presented word: listeners responded hood more frequently when the acoustic stimulus was paired with a movie clip of a talker saying hood. Experiment 3 tested the abstractness of the talker information used in speech perception. Rather than seeing movie clips of male and female talkers, listeners were instructed to imagine a male or female talker while performing an audio-only identification task with a gender-ambiguous hood-hud continuum. The phoneme boundary differed as a function of the imagined gender of the talker. The results from these experiments suggest that listeners integrate abstract gender information with phonetic information in speech perception. This conclusion supports the talker normalization theory of perceptual speaker normalization.
Conference Paper
Full-text available
The lack of prosody variation in text-to-speech sys tems contributes to their perceived unnaturalness when synthesizing extended passages. In this paper, we p resent a method to improve prosody generation in this direct ion. A database of natural sample sentences is searched fo r sentences having similar word and syllable structure to the i nput. One sentence is selected randomly from the similar sent ences found. The prosody of the randomly selected natural sentence is used as a target to generate the prosody of the synthetic one. An experiment was conducted to determine the potential of the proposed method. The rule-based pitch contou r generation of a Hungarian concatenative synthesizer was replaced by a semi-automatic implementation of the proposed method. A listening test showed that subjects prefe rred sentences synthesized by the proposed method over a rule- based solution. Index Terms : speech synthesis, prosodic variability, F 0 variation, F 0 transplantation
Article
Full-text available
The research investigates how listeners segment the acoustic speech signal into phonetic segments and explores implications that the segmentation strategy may have for their perception of the (apparently) context-sensitive allophones of a phoneme. Two manners of segmentation are contrasted. In one, listeners segment the signal into temporally discrete, context-sensitive segments. In the other, which may be consistent with the talker’s production of the segments, they partition the signal into separate, but overlapping, segments freed of their contextual influences. Two complementary predictions of the second hypothesis are tested. First, listeners will use anticipatory coarticulatory information for a segment as information for the forthcoming segment. Second, subjects will not hear anticipatory coarticulatory information as part of the phonetic segment with which it co-occurs in time. The first hypothesis is supported by findings on a choice reaction time procedure; the second is supported by findings on a 4IAX discrimination test. Implications of the findings for theories of speech production, perception, and of the relation between the two are considered.
Article
While the fact that phonetic information is evaluated in a non-discrete, probabilistic fashion is well established, there is less consensus regarding how long such encoding is maintained. Here, we examined whether people maintain in memory the amount of vowel nasality present in a word when processing a subsequent word that holds a semantic dependency with the first one. Vowel nasality in English is an acoustic correlate of the oral vs. nasal status of an adjacent consonant, and sometimes it is the only distinguishing phonetic feature (e.g., bet vs. bent). In Experiment 1, we show that people can perceive differences in nasality between two vowels above and beyond differences in the categorization of those vowels. In Experiment 2, we tracked listeners’ eye-movements as they heard a sentence that mentioned one of four displayed images (e.g., ‘money’) following a prime word (e.g., ‘bet’) that held a semantic relationship with the target word. Recognition of the target was found to be modulated by the degree of nasality in the first word’s vowel: Slightly greater uncertainty regarding the oral status of the post-vocalic consonant in the first word translated into a weaker semantic cue for the identification of the second word. Thus, listeners appear to maintain in memory the degree of vowel nasality they perceived on the first word and bring this information to bear onto the interpretation of a subsequent, semantically-dependent word. Probabilistic cue integration across words that hold semantic coherence, we argue, contributes to achieving robust language comprehension despite the inherent ambiguity of the speech signal.
Book
For a machine to convert text into sounds that humans can understand as speech requires an enormous range of components, from abstract analysis of discourse structure to synthesis and modulation of the acoustic output. Work in the field is thus inherently interdisciplinary, involving linguistics, computer science, acoustics, and psychology. This collection of articles by leading researchers in each of the fields involved in text-to-speech synthesis provides a picture of recent work in laboratories throughout the world and of the problems and challenges that remain. By providing samples of synthesized speech as well as video demonstrations for several of the synthesizers discussed, the book will also allow the reader to judge what all the work adds up to -- that is, how good is the synthetic speech we can now produce? Topics covered include: Signal processing and source modeling Linguistic analysis Articulatory synthesis and visual speech Concatenative synthesis and automated segmentation Prosodic analysis of natural speech Synthesis of prosody Evaluation and perception Systems and applications.
Article
Although much is known about the linguistic function of vowel nasality, whether contrastive (as in French) or coarticulatory (as in English), and much effort has gone into identifying potential correlates for the phenomenon, this study examines these proposed features to find the optimal acoustic feature(s) for nasality measurement. To this end, a corpus of 4778 oral and nasal vowels in English and French was collected, and data for 22 features were extracted. A series of linear mixed-effects regressions highlighted three promising features with large oral-to-nasal feature differences and strong effects relative to normal oral vowel variability: A1-P0, F1's bandwidth, and spectral tilt. However, these three features, particularly A1-P0, showed considerable variation in baseline and range across speakers and vowels within each language. Moreover, although the features were consistent in direction across both languages, French speakers' productions showed markedly stronger effects, and showed evidence of spectral tilt beyond the nasal norm being used to enhance the oral-nasal contrast. These findings strongly suggest that the acoustic nature of vowel nasality is both language- and speaker-specific, and that, like vowel formants, nasality measurements require speaker normalization for across-speaker comparison, and that these acoustic properties should not be taken as constant across different languages.
Conference Paper
Intelligent Personal Assistants (IPAs) are widely available on devices such as smartphones. However, most people do not use them regularly. Previous research has studied the experiences of frequent IPA users. Using qualitative methods we explore the experience of infrequent users: people who have tried IPAs, but choose not to use them regularly. Unsurprisingly infrequent users share some of the experiences of frequent users, e.g. frustration at limitations on fully hands-free interaction. Significant points of contrast and previously unidentified concerns also emerge. Cultural norms and social embarrassment take on added significance for infrequent users. Humanness of IPAs sparked comparisons with human assistants, juxtaposing their limitations. Most importantly, significant concerns emerged around privacy, monetization, data permanency and transparency. Drawing on these findings we discuss key challenges, including: designing for interruptability; reconsideration of the human metaphor; issues of trust and data ownership. Addressing these challenges may lead to more widespread IPA use.
Article
In a variety of languages, changes in tongue height and breathiness have been observed to covary with nasalization in both phonetic and phonemic vowel nasality. It has been argued that this covariation stems from speakers using multiple articulations to enhance F1 modulation and/or from listeners misperceiving the articulatory basis for F1 modification. This study includes results from synchronous nasalance, ultrasound, EGG, and F1 data related to the realizations of the oral–nasal vowel pairs /ɛ/-/ɛ̃/, /a/-/ɑ̃/, and /o/-/ɔ̃/ of Southern French (SF) as produced by four male speakers in a laboratory setting. The aim of the study is to determine to what extent tongue height and breathiness covary with nasalization, as well as how these articulations affect the realization of F1. The following evidence is observed: (1) that nasalization, breathiness, and tongue height are used in idiosyncratic ways to distinguish F1 for each vowel pair; (2) that increased nasalization and breathiness significantly predict F1-lowering for all three nasal vowels; (3) that nasalization increases throughout the duration of the nasal vowels, supporting previous claims about the temporal nature of nasality in SF nasal vowels, but contradicting claims that SF nasal vowels comprise distinct oral and nasal elements; (4) that breathiness increases in a gradient manner as nasalization increases; and (5) that the acoustic and articulatory data provide limited support for claims of the existence of an excrescent nasal coda in SF nasal vowels. These results are discussed in the light of claims that the multiple articulatory components observed in the production of vowel nasalization may have arisen due to misperception-based sound change and/or to phonetic enhancement.
Article
The current study investigates correlations between individual differences in the production of nasal coarticulation and patterns of perceptual compensation in American English. A production study (Experiment 1) assessed participants' nasal coarticulation repertoires by eliciting productions of CVC, CVN and NVN words. Stimuli for two perception tasks were created by cross-splicing oral vowels (from C_C words), nasal vowels (from C_N words), and hypernasal vowels (from N_N words) into C_C, C_N, and N_N consonant contexts. Stimuli pairs were presented to listeners in a paired discrimination task (Experiment 2), where similarity of vowels was assessed, and a nasality ratings task (Experiment 3), where relative nasalization of vowels was judged. In the discrimination task, individual differences in produced nasal coarticulation predicted patterns of veridical acoustic perception. Individuals who produce less extensive anticipatory nasal coarticulation exhibit more veridical acoustic perception (indicating less compensation for coarticulation) than individuals who produce greater coarticulatory nasality. However, in the ratings task, listeners' produced nasal coarticulation did not predict perceptual patterns. Rather, more veridical perceptual response patterns were observed across participants in context-inappropriate coarticulatory conditions, i.e., for hypernasal vowels in C_N contexts (e.g. bẽ ̃ n) and nasal vowels in N_N contexts (e.g. mẽn). The results of this study suggest a complex and multifaceted relationship between representations used to produce and perceive speech.
Article
In certain Eastern Algonquian languages a nasal vowel developed from the long low vowel/a:/, regardless of consonantal context. A series of experiments showed that longer vowels (regardless of height) were perceived as more nasalized than shorter ones, but only when some nasalization was actually present. Further experiments showed no evidence of an increase in nasalization for long vowels in oral contexts. If some nasalization was nonetheless introduced (either randomly or by a general increase in nasalization) into these languages, the vowels most likely to be perceived as nasalized were the long ones. This perceptual process may have been responsible for this unusual historical development.
Article
A program of research on the identifiable characteristics of speech sounds is described. The results of several previous investigations of this group are presented. With the use of synthetic speech equipment, the important characteristics of several classes of consonant sounds have been identified. The perception of a given consonant is a characteristic function of the vowel with which it is presented. Future directions of research and possible utilizations of results from synthetic speech experiments are discussed. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Although coarticulatory variation is largely systematic, and serves as useful information for listeners, such variation is nonetheless linked to sound change. This article explores the articulatory and perceptual interactions between a coarticulatory source and its effects, and how these interactions likely contribute to change. The focus is on the historical change VN (phonetically, á¹¼N) > á¹¼, but with more general attention to how a gesture associated with a source segment comes to be reinterpreted as distinctively, rather than coarticulatorily, associated with a nearby vowel or consonant. Two synchronic factors are hypothesized to contribute to reinterpretation: (i) articulatory covariation between the duration of the coarticulatory source (here, N) and the temporal extent of its effects (á¹¼), and (ii) perceived equivalence between source and effect. Experimental support for both hypotheses is provided. Additionally, the experimental data are linked to the historical situation by showing that the contextual conditions that trigger (i) and (ii) parallel the conditions that historically influence phonologization of vowel nasalization.
Article
Three experiments tested the hypothesis that V-to-V coarticulatory organization differs in Shona and English, and that Shona- and English-speaking listeners' sensitivity to V-to-V coarticulatory effects is correspondingly language-specific. An acoustic study of Shona and English CV′CVCV trisyllables (Experiment 1) showed that the two languages differ in the carryover vs. anticipatory influences of stressed and unstressed vowels on each other. In 4IAX discrimination tests in which both Shona and English coarticulatory effects were spliced into different coarticulatory contexts (Experiment 2), Shona and English listeners perceptually compensated more (i.e., attributed more of a vowel's acoustic properties to its coarticulatory context in targeted test trials) for effects that were consistent with their linguistic experience. Similarly, when these listeners identified synthetic target vowels embedded into different vowel contexts (Experiment 3), Shona listeners compensated more (i.e., showed larger category boundary shifts) for the vowel contexts that triggered larger acoustic influences in the production study. English listeners' boundary shifts were more complicated but, when these data were combined with those from a follow-up identification study, they showed the perceptual shifts expected on the basis of the English coarticulatory findings. Overall, the relation between the production and perception data suggests that listeners are attuned to native-language coarticulatory patterns.
Article
In New Zealand English there is a merger-in-progress of the near and square diphthongs. This paper investigates the consequences of this merger for speech perception.We report on an experiment involving the speech of four New Zealanders—two male, and two female. All four speakers make a distinction between near and square. Participants took part in a binary forced-choice identification task which included 20 near/square items produced by each of the four speakers. All participants were presented with identical auditory stimuli. However the visual presentation differed. Across four conditions, we paired each voice with a series of photos—an “older” looking photo, a “younger” looking photo, a “middle class” photo and a “working class” photo. The middle and working class photos were, in fact, photos of the same people, in different attire. In a fifth condition, participants completed the task with no associated photos. At the end of the identification task, each participant was recorded reading a near/square wordlist, containing the same items as appeared in the perception task.The results show that a wide range of factors influence accuracy in the perception task. These include participant-specific characteristics, word-specific characteristics, context-specific characteristics, and perceived speaker characteristics. We argue that, taken together, the results provide strong support for exemplar-based models of speech perception, in which exemplars are socially indexed.
Article
Forty-one Detroit-area residents were given perceptual tests in which they were asked to choose from a set of resynthesized vowels the tokens that they felt best matched the vowels they heard in the speech of a fellow Detroiter. Half of the respondents were told that the speaker was from Detroit, whereas half were told that she was from Canada. Respondents given the Canadian label chose raised-diphthong tokens as those present in the dialect of the speaker, whereas those given the Michigan label did not. Respondents given the Michigan label chose vowels that were quite different from the Northern Cities Chain-Shifted variety present in the speaker's dialect. Because the "speaker's" perceived nationality was the only aspect that varied between the two groups of respondents, this label alone must have caused the difference in the selection of tokens. This indicates that listeners use social information in speech perception.
Article
Many sound patterns in languages are cases of fossilized coarticulation, that is, synchronic or phonetic contextual variation became diachronic or phonological variation via sound change. An examination of languages' phonologies can therefore yield insights into the mechanisms of coarticulation. In this paper I discuss (a) the need to differentiate between phonological processes that are and are not due to coarticulation, (b) the need to differentiate between 'on-line' synchronic variation and comparable fossilized diachronic variation, (c) how to determine some of the constraints on coarticulation--especially the higher priority of maintaining acoustic-auditory, rather than articulatory, norms for the shape of speech elements, and (d) how coarticulation presents a "parsing" problem to the listener and, of course, to systems for automatic speech recognition.
Article
Acoustic analysis of nasalized vowels in the frequency domain indicates the presence of extra peaks: one between the first two formants with amplitude P1 and one at lower frequencies, often below the first formant, with amplitude P0. The first-formant amplitude A1 is also reduced relative to its amplitude for an oral vowel. These acoustic characteristics can be explained by speech production theory. The objective of this study was to determine the values for the acoustic correlates A1-P1 and A1-P0 (dB) for quantifying nasalization. They were tested as measures of nasalization by comparing vowels between nasal consonants and those between stop consonants for English speakers. Also, portions of nasal vowels following a stop consonant were compared for speakers of French, which makes a linguistic distinction between oral and nasal vowels. In the analysis of English, the mean difference of A1-P1 measured in oral vowels and nasalized vowels had a range of 10 dB-15 dB; the difference of A1-P0 had a range of 6 dB-8 dB. In the study of French, the difference of A1-P1 measured between the least-nasalized portion and the most-nasalized portion of the vowel had a range of 9 dB-12 dB; for A1-P0, the difference ranged between 3 dB and 9 dB. In order to obtain an absolute acoustic measure of nasalization that was independent of vowel type, normalized parameters were calculated by adjusting for the influence of the vowel formant frequencies.
Article
The conditions under which listeners do and do not compensate for coarticulatory vowel nasalization were examined through a series of experiments of listeners' perception of naturally produced American English oral and nasal vowels spliced into three contexts: oral (C_C), nasal (N_N), and isolation. Two perceptual paradigms, a rating task in which listeners judged the relative nasality of stimulus pairs and a 4IAX discrimination task in which listeners judged vowel similarity, were used with two listener groups, native English speakers and native Thai speakers. Thai and English speakers were chosen because their languages differ in the temporal extent of anticipatory vowel nasalization. Listeners' responses were highly context dependent. For both perceptual paradigms and both language groups, listeners were less accurate at judging vowels in nasal than in non-nasal (oral or isolation) contexts; nasal vowels in nasal contexts were the most difficult to judge. Response patterns were generally consistent with the hypothesis that, given an appropriate and detectable nasal consonant context, listeners compensate for contextual vowel nasalization and attribute the acoustic effects of the nasal context to their coarticulatory source. However, the results also indicated that listeners do not hear nasal vowels in nasal contexts as oral; listeners retained some sensitivity to vowel nasalization in all contexts, indicating partial compensation for coarticulatory vowel nasalization. Moreover, there were small but systematic differences between the native Thai- and native English-speaking groups. These differences are as expected if perceptual compensation is partial and the extent of compensation is linked to patterns of coarticulatory nasalization in the listeners' native language.
One voice is used, following the approach in prior studies of perceptual compensation
One voice is used, following the approach in prior studies of perceptual compensation (e.g., Beddor and Krakow, 1999; Zellou, 2017)
The perception of nasal vowels,” in Phonetics and Phonology, Nasals; Nasalization, and the Velum
  • P S Beddor
Beddor, P. S. (1993). "The perception of nasal vowels," in Phonetics and Phonology, Nasals; Nasalization, and the Velum, edited by M. K. Huffman and R. A. Krakow (Academic, San Diego), Vol. 5, pp. 171-196.
Praat vocal toolkit: A Praat plugin with automated scripts for voice processing
  • R Corretge
Corretge, R. (2012). "Praat vocal toolkit: A Praat plugin with automated scripts for voice processing [software package]," http://www.praatvocaltoolkit.com/index.html (Last viewed 12/1/2020).
Phonetic accommodation to natural and synthetic voices: Behavior of groups and individuals in speech shadowing
  • I Gessinger
  • E Raveh
  • I Steiner
  • B Obius
Gessinger, I., Raveh, E., Steiner, I., and M€ obius, B. (2021). "Phonetic accommodation to natural and synthetic voices: Behavior of groups and individuals in speech shadowing," Speech Commun. 127, 43-63.
Coarticulation and the perception of nasality
  • R Krakow
  • P S Beddor
Krakow, R., and Beddor, P. S. (1991). "Coarticulation and the perception of nasality," in Proceedings of the 12th International Congress of Phonetic Sciences, Vol. 5, pp. 38-41.