Figures
Explore figures and images from publications
FIGURE 2 - uploaded by Georgia Zellou
Content may be subject to copyright.
| Mean proportion and standard errors of AXB perceptual similarity ratings for the interaction between Model Talker Humanness (Device vs. Human) and Shadower Gender (F, M).

| Mean proportion and standard errors of AXB perceptual similarity ratings for the interaction between Model Talker Humanness (Device vs. Human) and Shadower Gender (F, M).

Source publication
Article
Full-text available
Speech alignment is where talkers subconsciously adopt the speech and language patterns of their interlocutor. Nowadays, people of all ages are speaking with voice-activated, artificially-intelligent (voice-AI) digital assistants through phones or smart speakers. This study examines participants’ age (older adults, 53–81 years old vs. younger adult...

Context in source publication

Context 1
... model also computed several significant interactions. First, there was an interaction between Shadower Gender and Model Talker Humanness, which is plotted in Figure 2: male shadowers align more to human voices overall. Post-hoc pairwise comparison using lsmeans revealed that Female shadowers show no difference in alignment toward device voices and human voices (z 0.6, p 0.5). ...

Similar publications

Article
Full-text available
Most cancer deaths are due to tumor metastasis rather than the primary tumor. Metastasis is a highly complex and dynamic process that requires orchestration of signaling between the tumor, its local environment, distant tissue sites, and immune system. Animal models of cancer metastasis provide the necessary systemic environment, but lack control o...
Article
Full-text available
Applications related to smart cities require virtual cities in the experimental development stage. To build a virtual city that are close to a real city, a large number of various types of human models need to be created. To reduce the cost of acquiring models, this paper proposes a method to reconstruct 3D human meshes from single images captured...
Article
Full-text available
Image-based virtual try-on (VTON) systems based on deep learning have attracted research and commercial interests. Although they show their strengths in blending the person and try-on clothing image and synthesizing the dis-occluded regions, their results for complex-posed persons are often unsatisfactory due to the limitations in their geometry de...

Citations

... Still, this is an exciting result, which shows that gender effects can play an essential role in performance, which also happened to be the case on other occasions (Bräuer & Mazarakis, 2019). Age-and gender-related differences in the use of voice-activated artificially intelligent (voice-AI) devices have not gone unnoticed and are already being addressed in studies (Zellou, Cohn, & Ferenc Segedin, 2021). The need for future research in this area should already be noted at this point. ...
... A first starting point is provided by the additionally considered differences between the genders. In both fields of gamification and research on IVAs, initial studies address the effects of age and gender (Codish & Ravid, 2017;Jent & Janneck, 2018;Zellou et al., 2021). Our data also suggest that there might be differences in the motivational effect of audio-gamification due to these influencing factors. ...
Article
Full-text available
Intelligent virtual assistants (IVAs) like Amazon Alexa or Google Assistant have become increasingly popular in recent years, and research into the topic is growing accordingly. A major challenge in designing IVA applications is making them appealing. Gamification as a concept might help to boost motivation when using IVAs. Visual representation of progress and feedback is an essential component of gamification. When using IVAs, however, visual information is generally not available. To this end, this article reports the results of a lab experiment with 81 subjects describing how gamification, utilized entirely by audio, can assist subjects to work faster and improve motivation. Game design elements such as points and levels are integrated within an Alexa Skill via audio output to motivate subjects to complete household tasks. The results show a substantial effect on the subjects. Both their attitude and the processing time of the given tasks were positively influenced by the audio-gamification. The outcomes indicate that audio-gamification has a huge potential in the field of voice assistants. Differences in experimental conditions were also considered, but no statistical significance was found between the cooperative and competitive groups. Finally, we discuss how these insights affect IVA design principles and future research questions.
... Indeed, there is some support for technology equivalence accounts for linguistic behavior toward voice-AI. For instance, several recent studies have shown that people vocally align toward both voice-AI and human interlocutors Snyder et al., 2019;Zellou, Cohn, & Ferenc Segedin, 2021;Zellou, Cohn, & Kline, 2021), and even display similar gender-based speech asymmetries (such as aligning more to male, than female, TTS and human voices in Cohn et al., 2019). Hence, an alternative prediction in the current study, based on technology equivalence accounts, is that speech patterns to voice-AI and human interlocutors will not differ. ...
... This parallels Raveh et al.'s (2019) interpretation of a difference in mean f0 for Alexa-DS, where they found higher mean f0 in Alexa-DS, as the Alexa had a female voice while the human was male. Indeed, recent work has examined and found that vocal alignment differs toward voice-AI and human interlocutors Zellou, Cohn, & Ferenc Segedin, 2021). Here, acoustic analysis of the interlocutors confirmed that the human speaker had a higher mean f0 than the Siri voice, lending support for this interpretation. ...
... Prior work has shown variation in how people perceive and personify technological agents, such as robots (Hinz et al., 2019) and voice-AI (Cohn, Raveh, et al., 2020;Etzrodt & Engesser, 2021). Recently, some work has revealed differences in speech alignment toward voice-AI by speaker age (e.g., older vs. college-age adults in Zellou, Cohn, & Ferenc Segedin, 2021) and cognitive processing style (e.g., autisticlike traits in Snyder et al., 2019), suggesting these differences could shape voice-AI speech adaptation as well. ...
Article
Full-text available
Millions of people engage in spoken interactions with voice activated artificially intelligent (voice-AI) systems in their everyday lives. This study explores whether speakers have a voice-AI-specific register, relative to their speech toward an adult human. Furthermore, this study tests if speakers have targeted error correction strategies for voice-AI and human interlocutors. In a pseudo-interactive task with pre-recorded Siri and human voices, participants produced target words in sentences. In each turn, following an initial production and feedback from the interlocutor, participants repeated the sentence in one of three response types: after correct word identification, a coda error, or a vowel error made by the interlocutor. Across two studies, the rate of comprehension errors made by both interlocutors was varied (lower vs. higher error rate). Register differences are found: participants speak louder, with a lower mean f0, and with a smaller f0 range in Siri-DS. Many differences in Siri-DS emerged as dynamic adjustments over the course of the interaction. Additionally, error rate shapes how register differences are realized. One targeted error correction was observed: speakers produce more vowel hyperarticulation in coda repairs in Siri-DS. Taken together, these findings contribute to our understanding of speech register and the dynamic nature of talker-interlocutor interactions.
... Conversely, a recent study of the same phenomenon in spontaneous speech found that early Cantonese-English bilinguals were less likely to release final stops in English than non-Cantonese-English bilinguals [20]. These conflicting outcomes simply illustrate the need to examine variation in speech across styles and registers, as this variation has maximum utility for ASR systems and the development of NLP tools for speech and language, given how little is know about how talkers interact with such systems [21]. ...
... Modern voice-AI systems have highly human-like features, such as apparent gender [4] and personality traits [5]. People even assign apparent speaker age for TTS voices, e.g., Siri voices are rated as being approximately 40s or 50s [6] and Amazon Polly voices are rated as being in their 30s (female voice) or 50s (male voice) [7]. The present study investigates the extent to which voice age shapes how listeners perceive coarticulation in naturally produced and TTS voices. ...
Conference Paper
Full-text available
The current study explores whether perception of coarticulatory vowel nasalization differs by speaker age (adult vs. child) and type of voice (naturally produced vs. synthetic speech). Listeners completed a 4IAX discrimination task between pairs containing acoustically identical (both nasal or oral) vowels and acoustically distinct (one oral, one nasal) vowels. Vowels occurred in either the same consonant contexts or different contexts across pairs. Listeners completed the experiment with either naturally produced speech or text-to-speech (TTS). For same-context trials, listeners were better at discriminating between oral and nasal vowels for child speech in the synthetic voices but adult speech in the natural voices. Meanwhile, in different-context trials, listeners were less able to discriminate, indicating more perceptual compensation for synthetic voices. There was no difference in different-context discrimination across talker ages, indicating that listeners did not compensate differently if the speaker was a child or adult. Findings are relevant for models of compensation, computer personification theories, and speaker-indexical perception accounts.
... A growing body of research has begun to investigate the social, cognitive, and linguistic effects of humans interacting with voice-AI (Purington et al., 2017;Arnold et al., 2019;Cohn et al., 2019b;Burbach et al., 2019). For example, recent work has shown that listeners attribute human-like characteristics to the text-tospeech (TTS) output used for modern voice-AI, including personality traits (Lopatovska, 2020), apparent age (Cohn et al., 2020a;Zellou et al., 2021), and gender (Habler et al., 2019;Loideain and Adams, 2020). While the spread of voice-AI assistants is undeniable-particularly in the United States-there are many open scientific questions as to the nature of people's interactions with voice-AI. ...
... For example, people appear to apply politeness norms from humanhuman interaction to computers: giving more favorable ratings when a computer directly asks about its own performance, relative to when a different computer elicits this information (Nass et al., 1994;Hoffmann et al., 2009). In line with technology equivalence accounts, there is some evidence for applied social behaviors to voice-AI in the way people adjust their speech, such as gendermediated vocal alignment (Cohn et al., 2019b;Zellou et al., 2021). In the present study, one prediction from technology equivalence accounts is that people will adjust their speech patterns when talking to voice-AI and humans in similar ways if the communicative context is controlled. ...
... First, the interlocutor introduced themselves and then went through voice-over instructions with the participant. Participants saw an image corresponding to the interlocutor category: stock images of "adult female" (used in prior work; Zellou et al., 2021) and "Amazon Alexa" (2nd Generation Black Echo). FIGURE 1 | Interaction trial schematic. ...
Article
Full-text available
The current study tests whether individuals (n = 53) produce distinct speech adaptations during pre-scripted spoken interactions with a voice-AI assistant (Amazon’s Alexa) relative to those with a human interlocutor. Interactions crossed intelligibility pressures (staged word misrecognitions) and emotionality (hyper-expressive interjections) as conversation-internal factors that might influence participants’ intelligibility adjustments in Alexa- and human-directed speech (DS). Overall, we find speech style differences: Alexa-DS has a decreased speech rate, higher mean f0, and greater f0 variation than human-DS. In speech produced toward both interlocutors, adjustments in response to misrecognition were similar: participants produced more distinct vowel backing (enhancing the contrast between the target word and misrecognition) in target words and louder, slower, higher mean f0, and higher f0 variation at the sentence-level. No differences were observed in human- and Alexa-DS following displays of emotional expressiveness by the interlocutors. Expressiveness, furthermore, did not mediate intelligibility adjustments in response to a misrecognition. Taken together, these findings support proposals that speakers presume voice-AI has a “communicative barrier” (relative to human interlocutors), but that speakers adapt to conversational-internal factors of intelligibility similarly in human- and Alexa-DS. This work contributes to our understanding of human-computer interaction, as well as theories of speech style adaptation.
... Millions of users now speak to voice-AI to complete daily tasks (e.g., play music, turn on lights, set timers) (Ammari et al., 2019). Given their presence in many individuals' everyday lives, some researchers have aimed to uncover the cognitive, social, and linguistic factors involved in voice-AI interactions by examining task-based interactions with voice-AI (e.g., setting an appointment on a calendar in Raveh et al., 2019), scripted interactions in laboratory settings (Cohn et al., 2019;Zellou et al., 2021), and interviews to probe how people perceive voice-AI (Lovato and Piper, 2015;Purington et al., 2017;Abdolrahmani et al., 2018). Yet, our scientific understanding of non-task based, or purely social, interactions with voice-AI is even less established. ...
... Voice-AI systems are already imbued with multiple human-like features: they have names, apparent genders Habler et al. (2019) and interact with users using spoken language. Indeed, there is some evidence that individuals engage with voice-AI in ways that parallel the ways they engage with humans (e.g., gender-asymmetries in phonetic alignment in Cohn et al., 2019;Zellou et al., 2021). In the case of voice-AI socialbots, the cues of humanity could be even more robust since the system is designed for social interaction. ...
... Entrainment has been previously observed both in human-human (Levitan and Hirschberg, 2011;Babel and Bulatov, 2012;Lubold and Pon-Barry, 2014;Levitan et al., 2015;Pardo et al., 2017) and human-computer interaction (Coulston et al., 2002;Bell et al., 2003;Branigan et al., 2011;Fandrianto and Eskenazi, 2012;Thomason et al., 2013;Cowan et al., 2015;Gessinger et al., 2017;Gessinger et al., 2021), suggesting it is a behavior transferred to interactions with technology. Recent work has shown that entrainment occurs in interactions with voice-AI assistants as well (Cohn et al., 2019;Raveh et al., 2019;Zellou et al., 2021). Like hyperarticulation, there are some accounts proposing that entrainment improves intelligibility (Pickering and Garrod, 2006), aligning representations between interlocutors. ...
Article
Full-text available
This paper investigates users’ speech rate adjustments during conversations with an Amazon Alexa socialbot in response to situational (in-lab vs. at-home) and communicative (ASR comprehension errors) factors. We collected user interaction studies and measured speech rate at each turn in the conversation and in baseline productions (collected prior to the interaction). Overall, we find that users slow their speech rate when talking to the bot, relative to their pre-interaction productions, consistent with hyperarticulation. Speakers use an even slower speech rate in the in-lab setting (relative to at-home). We also see evidence for turn-level entrainment: the user follows the directionality of Alexa’s changes in rate in the immediately preceding turn. Yet, we do not see differences in hyperarticulation or entrainment in response to ASR errors, or on the basis of user ratings of the interaction. Overall, this work has implications for human-computer interaction and theories of linguistic adaptation and entrainment.
Article
This article details the methodology behind the Manchester Voices Accent Van, and the accompanying online Virtual Van. In 2021, the project travelled around Greater Manchester in a van converted into a mobile recording booth, asking people to climb aboard and take part in an unsupervised interview about language and identity in the region. Participants could also take part from their own home through a bespoke website, called the Virtual Van, which asked the same interview questions as the physical Van and recorded speakers through their computer/phone microphone. With a view to informing others who might want to use similar methods in the future, we present a detailed description of the methodology here, as well as an overview and sample of the data collected. We conclude with a reflection on the elements of the data collection that went well, and a discussion of improvements and considerations for future research using this methodology.
Article
Two studies investigated the influence of conversational role on phonetic imitation toward human and voice-AI interlocutors. In a Word List Task, the giver instructed the receiver on which of two lists to place a word; this dialogue task is similar to simple spoken interactions users have with voice-AI systems. In a Map Task, participants completed a fill-in-the-blank worksheet with the interlocutors, a more complex interactive task. Participants completed the task twice with both interlocutors, once as giver-of-information and once as receiver-of-information. Phonetic alignment was assessed through similarity rating, analysed using mixed effects logistic regressions. In the Word List Task, participants aligned to a greater extent toward the human interlocutor only. In the Map Task, participants as giver only aligned more toward the human interlocutor. Results indicate that phonetic alignment is mediated by the type of interlocutor and that the influence of conversational role varies across tasks and interlocutors. ARTICLE HISTORY