Contexts in source publication

Context 1
... modeled responses to nasal vowel syllables selected as CVCs with a mixed effects logistic regression model using the glmer() function in the lme4 package [21]. The model included two fixed effects predictors: Test Talker (2 levels: Human and Device) and Exposure Condition corresponding to the prior exposure a given participant had with the test voice (5 levels: Shifted, Unshifted, Same Humanness, Different Humanness, No Exposure), (see Figure 1 and Table 1). If listeners perceptually adapt to the shifted talker from exposure they should be more likely to classify a CV syllable with a nasal vowel as a CVC, relative to categorizations of that talker's nasal vowels in the control condition. ...
Context 2
... model also included by-subject random intercepts and by-subject random slopes for Test Talker. Table 1 presents the model output. ...

Similar publications

Article
Full-text available
Learning is often seen as developing academic abilities and achievement through formative assessment. Students always hate their 'failure' and are not as happy as students with better theoretical results. Equally important is the fact that undue emphasis on results rather than the process is likely to scare students from originality. Taking risks w...

Citations

... Yet, direct comparisons of TTS and naturally produced speech have revealed different responses to coarticulation. For instance, [16] found that nasal coarticulation (measured via acoustic nasality) present in TTS was more ambiguous than that in human speech. This led to distinct patterns of speech adaptation: listeners were more likely to shift their categorizations of nasalized vowels as oral for TTS than natural voices. ...
Conference Paper
Full-text available
The current study explores whether perception of coarticulatory vowel nasalization differs by speaker age (adult vs. child) and type of voice (naturally produced vs. synthetic speech). Listeners completed a 4IAX discrimination task between pairs containing acoustically identical (both nasal or oral) vowels and acoustically distinct (one oral, one nasal) vowels. Vowels occurred in either the same consonant contexts or different contexts across pairs. Listeners completed the experiment with either naturally produced speech or text-to-speech (TTS). For same-context trials, listeners were better at discriminating between oral and nasal vowels for child speech in the synthetic voices but adult speech in the natural voices. Meanwhile, in different-context trials, listeners were less able to discriminate, indicating more perceptual compensation for synthetic voices. There was no difference in different-context discrimination across talker ages, indicating that listeners did not compensate differently if the speaker was a child or adult. Findings are relevant for models of compensation, computer personification theories, and speaker-indexical perception accounts.
... There is some work to examine the perception of coarticulation in synthetic speech, for example, Hawkins and Slater (1994) found that listeners were better able to identify synthesized speech segments presented in background noise if they contained appropriate coarticulation (e.g., F2 lowering before /r/). More recently, Segedin et al. (2019) found that acoustic nasality for TTS voices generated for Amazon's Alexa devices was more phonetically ambiguous with vowels before a nasal consonant containing much greater nasalization than naturally produced human speech. Furthermore, they found that listeners adapted to a novel accent containing increased vowel nasalization to a larger extent for TTS voices relative to human voices possibly as a result of the more ambiguous phonetic patterns. ...
... The current study tests how listeners perceive vowel nasality in synthesized speech developed for a voice-AI assistant [an Amazon Polly voice, one of the non-default voices], generated with concatenative and neural TTS. Holding the speaker dataset constant addresses a limitation in prior studies, where both the type of speech and speaker vary: the productions might be more ambiguous in their phonetic productions and also in how robotic they sound (e.g., Segedin et al., 2019). Across two studies using the same paradigm, experiment 1 directly compares these two TTS methods, whereas experiment 3 presents listeners with neural TTS (the more naturalistic sounding synthesized speech) and a manipulated roboticized neural TTS. ...
Article
This study investigates the perception of coarticulatory vowel nasality generated using different text-to-speech (TTS) methods in American English. Experiment 1 compared concatenative and neural TTS using a 4IAX task, where listeners discriminated between a word pair containing either both oral or nasalized vowels and a word pair containing one oral and one nasalized vowel. Vowels occurred either in identical or alternating consonant contexts across pairs to reveal perceptual sensitivity and compensatory behavior, respectively. For identical contexts, listeners were better at discriminating between oral and nasalized vowels in neural than in concatenative TTS for nasalized same-vowel trials, but better discrimination for concatenative TTS was observed for oral same-vowel trials. Meanwhile, listeners displayed less compensation for coarticulation in neural than in concatenative TTS. To determine whether apparent roboticity of the TTS voice shapes vowel discrimination and compensation patterns, a "roboticized" version of neural TTS was generated (monotonized f0 and addition of an echo), holding phonetic nasality constant; a ratings study (experiment 2) confirmed that the manipulation resulted in different apparent robot-icity. Experiment 3 compared the discrimination of unmodified neural TTS and roboticized neural TTS: listeners displayed lower accuracy in identical contexts for roboticized relative to unmodified neural TTS, yet the performances in alternating contexts were similar.