Article

The influence of conversational role on phonetic alignment toward voice-AI and human interlocutors View supplementary material The influence of conversational role on phonetic alignment toward voice-AI and human interlocutors

If you want to read the PDF, try requesting it from the authors.

Abstract

Two studies investigated the influence of conversational role on phonetic imitation toward human and voice-AI interlocutors. In a Word List Task, the giver instructed the receiver on which of two lists to place a word; this dialogue task is similar to simple spoken interactions users have with voice-AI systems. In a Map Task, participants completed a fill-in-the-blank worksheet with the interlocutors, a more complex interactive task. Participants completed the task twice with both interlocutors, once as giver-of-information and once as receiver-of-information. Phonetic alignment was assessed through similarity rating, analysed using mixed effects logistic regressions. In the Word List Task, participants aligned to a greater extent toward the human interlocutor only. In the Map Task, participants as giver only aligned more toward the human interlocutor. Results indicate that phonetic alignment is mediated by the type of interlocutor and that the influence of conversational role varies across tasks and interlocutors. ARTICLE HISTORY

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Indeed, there is some support for technology equivalence accounts for linguistic behavior toward voice-AI. For instance, several recent studies have shown that people vocally align toward both voice-AI and human interlocutors Snyder et al., 2019;Zellou, Cohn, & Ferenc Segedin, 2021;Zellou, Cohn, & Kline, 2021), and even display similar gender-based speech asymmetries (such as aligning more to male, than female, TTS and human voices in Cohn et al., 2019). Hence, an alternative prediction in the current study, based on technology equivalence accounts, is that speech patterns to voice-AI and human interlocutors will not differ. ...
Article
Full-text available
Millions of people engage in spoken interactions with voice activated artificially intelligent (voice-AI) systems in their everyday lives. This study explores whether speakers have a voice-AI-specific register, relative to their speech toward an adult human. Furthermore, this study tests if speakers have targeted error correction strategies for voice-AI and human interlocutors. In a pseudo-interactive task with pre-recorded Siri and human voices, participants produced target words in sentences. In each turn, following an initial production and feedback from the interlocutor, participants repeated the sentence in one of three response types: after correct word identification, a coda error, or a vowel error made by the interlocutor. Across two studies, the rate of comprehension errors made by both interlocutors was varied (lower vs. higher error rate). Register differences are found: participants speak louder, with a lower mean f0, and with a smaller f0 range in Siri-DS. Many differences in Siri-DS emerged as dynamic adjustments over the course of the interaction. Additionally, error rate shapes how register differences are realized. One targeted error correction was observed: speakers produce more vowel hyperarticulation in coda repairs in Siri-DS. Taken together, these findings contribute to our understanding of speech register and the dynamic nature of talker-interlocutor interactions.
... In practice, context is rarely explicated and is frequently used as a catch-all explanation for findings. As empirical studies have found both similarities Ho et al., 2018; S. K. Lee et al., 2021;Meng & Dai, 2021;Xu, 2019) and differences between HMC and interpersonal communication (Edwards & Edwards, 2022;Jia et al., 2022;Kim & Song, 2021;Liu & Wei, 2021;van Straten et al., 2021;van Straten at al., 2022;Zellou et al., 2021), we argue that explication of the HMC context in relation to the interpersonal corollary is necessary to provide meaningful and nuanced explanations for groups of findings, and, ultimately, to build theories of both HMC and interpersonal communication. ...
Article
Full-text available
The proliferation and integration of social technologies has occurred quickly, and the specific technologies with which we engage are ever-changing. The dynamic nature of the development and use of social technologies is often acknowledged by researchers as a limitation. In this manuscript, however, we present a discussion on the implications of our modern technological context by focusing on processes of socialization and communication that are fundamentally different from their interpersonal corollary. These are presented and discussed with the goal of providing theoretical building blocks toward a more robust understanding of phenomena of human-computer interaction, human-robot interaction, human-machine communication, and interpersonal communication.
Article
Full-text available
Speech alignment is where talkers subconsciously adopt the speech and language patterns of their interlocutor. Nowadays, people of all ages are speaking with voice-activated, artificially-intelligent (voice-AI) digital assistants through phones or smart speakers. This study examines participants’ age (older adults, 53–81 years old vs. younger adults, 18–39 years old) and gender (female and male) on degree of speech alignment during shadowing of (female and male) human and voice-AI (Apple’s Siri) productions. Degree of alignment was assessed holistically via a perceptual ratings AXB task by a separate group of listeners. Results reveal that older and younger adults display distinct patterns of alignment based on humanness and gender of the human model talkers: older adults displayed greater alignment toward the female human and device voices, while younger adults aligned to a greater extent toward the male human voice. Additionally, there were other gender-mediated differences observed, all of which interacted with model talker category (voice-AI vs. human) or shadower age category (OA vs. YA). Taken together, these results suggest a complex interplay of social dynamics in alignment, which can inform models of speech production both in human-human and human-device interaction.
Article
Full-text available
The present study investigates whether native speakers of German phonetically accommodate to natural and synthetic voices in a shadowing experiment. We aim to determine whether this phenomenon, which is frequently found in HHI, also occurs in HCI involving synthetic speech. The examined features pertain to different phonetic domains: allophonic variation, schwa epenthesis, realization of pitch accents, word-based temporal structure and distribution of spectral energy. On the individual level, we found that the participants converged to varying subsets of the examined features, while they maintained their baseline behavior in other cases or, in rare instances, even diverged from the model voices. This shows that accommodation with respect to one particular feature may not predict the behavior with respect to another feature. On the group level, the participants of the natural condition converged to all features under examination, however very subtly so for schwa epenthesis. The synthetic voices, while partly reducing the strength of effects found for the natural voices, triggered accommodating behavior as well. The predominant pattern for all voice types was convergence during the interaction followed by divergence after the interaction.
Conference Paper
Full-text available
Increasingly, people are having conversational interactions with voice-AI systems, such as Amazon's Alexa. Do the same social and functional pressures that mediate alignment toward human interlocutors also predict align patterns toward voice-AI? We designed an interactive dialogue task to investigate this question. Each trial consisted of scripted, interactive turns between a participant and a model talker (pre-recorded from either a natural production or voice-AI): First, participants produced target words in a carrier phrase. Then, a model talker responded with an utterance containing the target word. The interlocutor responses varied by 1) communicative affect (social) and 2) correctness (functional). Finally, participants repeated the carrier phrase. Degree of phonetic alignment was assessed acoustically between the target word in the model's response and participants' response. Results indicate that social and functional factors distinctly mediate alignment toward AI and humans. Findings are discussed with reference to theories of alignment and human-computer interaction.
Conference Paper
Full-text available
The current study tests subjects' vocal alignment toward female and male text-to-speech (TTS) voices presented via three systems: Amazon Echo, Nao, and Furhat. These systems vary in their physical form, ranging from a cylindrical speaker (Echo), to a small robot (Nao), to a human-like robot bust (Furhat). We test whether this cline of personification (cylinder < mini robot < human-like robot bust) predicts patterns of gender-mediated vocal alignment. In addition to comparing multiple systems, this study addresses a confound in many prior vocal alignment studies by using identical voices across the systems. Results show evidence for a cline of personification toward female TTS voices by female shadowers (Echo < Nao < Furhat) and a more categorical effect of device personification for male TTS voices by male shadowers (Echo < Nao, Furhat). These findings are discussed in terms of their implications for models of device-human interaction and theories of computer personification.
Article
Full-text available
The computers are social actors framework (CASA), derived from the media equation, explains how people communicate with media and machines demonstrating social potential. Many studies have challenged CASA, yet it has not been revised. We argue that CASA needs to be expanded because people have changed, technologies have changed, and the way people interact with technologies has changed. We discuss the implications of these changes and propose an extension of CASA. Whereas CASA suggests humans mindlessly apply human-human social scripts to interactions with media agents, we argue that humans may develop and apply human-media social scripts to these interactions. Our extension explains previous dissonant findings and expands scholarship regarding human-machine communication, human-computer interaction, human-robot interaction, human-agent interaction, artificial intelligence, and computer-mediated communication.
Conference Paper
Full-text available
In human-human interactions, the situational context plays a large role in the degree of speakers’ accommodation. In this paper, we investigate whether the degree of accommodation in a human-robot computer game is affected by (a) the duration of the interaction and (b) the success of the players in the game. 30 teams of two players played two card games with a conversational robot in which they had to find a correct order of five cards. After game 1, the players received the result of the game on a success scale from 1 (lowest success) to 5 (highest). Speakers’ fo accommodation was measured as the Euclidean distance between the human speakers and each human and the robot. Results revealed that (a) the duration of the game had no influence on the degree of fo accommodation and (b) the result of Game 1 correlated with the degree of fo accommodation in Game 2 (higher success equals lower Euclidean distance). We argue that game success is most likely considered as a sign of the success of players’ cooperation during the discussion, which leads to a higher accommodation behavior in speech.
Conference Paper
Full-text available
The current study explores the extent to which humans vocally align to digital device voices (i.e., Apple's Siri) and human voices. First, participants shadowed word productions by 4 model talkers: a female and a male digital device voice, and a female and a male real human voice. Second, an independent group of raters completed an AXB task assessing perceptual similarity between imitators' pre-and post-exposure items to model talkers' productions. Results show that people do imitate device voices, but to a lesser degree than they imitate real human voices. Furthermore, similar social factors mediated vocal imitation toward both device and human voices: people imitated male device and human voices to a greater extent than female device and human voices.
Conference Paper
Full-text available
This study examines how the presence of other speakers affects the interaction with a spoken dialogue system. We analyze participants’ speech regarding several phonetic features, viz., fundamental frequency, intensity, and articulation rate, in two conditions: with and without additional speech input from a human confederate as a third interlocutor. The comparison was made via tasks performed by participants using a commercial voice assistant under both conditions in alternation. We compare the distributions of the features across the two conditions to investigate whether speakers behave differently when a confederate is involved. Temporal analysis exposes continuous changes in the feature productions. In particular, we measured overall accommodation between the participants and the system throughout the interactions. Results show significant differences in a majority of cases for two of the three features, which are more pronounced in cases where the user first interacted with the device alone. We also analyze factors such as the task performed, participant gender, and task order, providing additional insight into the participants’ behavior.
Conference Paper
Full-text available
This paper presents a Wizard-of-Oz experiment designed to study phonetic accommodation in human-computer interaction. The experiment comprises a dynamic exchange of information between a human interlocutor and a supposedly intelligent system, while allowing for planned manipulation of the system's speech output on the level of phonetic detail. In the current configuration of the experiment, we are targeting convergence in allophonic contrasts and phenomena of local prosody. A study was conducted with 12 German native speakers. The results of a map task show highly speaker dependent behavior for the contrast [ɪç] vs. [ɪk] occurring in the German suffix <-ig>: during the baseline production of the target items, speakers either consistently choose one allophone or use both interchangeably. When conversing with the system, some speakers converge to its speech output, while others maintain their preferred variant or even diverge. This reflects individual variation observed in previous work on accommodation.
Article
Full-text available
Voice has become a widespread and commercially viable interaction mechanism with the introduction of voice assistants (VAs), such as Amazon’s Alexa, Apple’s Siri, Google Assistant, and Microsoft’s Cortana. Despite their prevalence, we do not have a detailed understanding of how these technologies are used in domestic spaces. To understand how people use VAs, we conducted interviews with 19 users, and analyzed the log files of 82 Amazon Alexa devices, totaling 193,665 commands, and 88 Google Home Devices, totaling 65,499 commands. In our analysis, we identified music, search, and IoT usage as the command categories most used by VA users. We explored how VAs are used in the home, investigated the role of VAs as scaffolding for Internet of Things device control, and characterized emergent issues of privacy for VA users. We conclude with implications for the design of VAs and for future research studies of VAs.
Article
Full-text available
Phonetic convergence is a form of variation in speech production in which a talker adopts aspects of another talker’s acoustic–phonetic repertoire. To date, this phenomenon has been investigated in non-interactive laboratory tasks extensively and in conversational interaction to a lesser degree. The present study directly compares phonetic convergence in conversational interaction and in a non-interactive speech shadowing task among a large set of talkers who completed both tasks, using a holistic AXB perceptual similarity measure. Phonetic convergence occurred in a new role-neutral conversational task, exhibiting a subtle effect with high variability across talkers that is typical of findings reported in previous research. Conversational phonetic convergence did not differ by talker sex on average, but relationships between speech shadowing and conversational convergence differed according to talker sex, with female talkers showing no consistency across settings in their relative levels of convergence and male talkers showing a modest relationship. These findings indicate that phonetic convergence is not directly compatible across different settings, and that phonetic convergence of female talkers in particular is sensitive to differences across different settings. Overall, patterns of acoustic–phonetic variation and convergence observed both within and between different settings of language use are inconsistent with accounts of automatic perception-production integration.
Article
Full-text available
This study consolidates findings on phonetic convergence in a large-scale examination of the impacts of talker sex, word frequency, and model talkers on multiple measures of convergence. A survey of nearly three dozen published reports revealed that most shadowing studies used very few model talkers and did not assess whether phonetic convergence varied across same- and mixed-sex pairings. Furthermore, some studies have reported effects of talker sex or word frequency on phonetic convergence, but others have failed to replicate these effects or have reported opposing patterns. In the present study, a set of 92 talkers (47 female) shadowed either same-sex or opposite-sex models (12 talkers, six female). Phonetic convergence was assessed in a holistic AXB perceptual-similarity task and in acoustic measures of duration, F0, F1, F2, and the F1 × F2 vowel space. Across these measures, convergence was subtle, variable, and inconsistent. There were no reliable main effects of talker sex or word frequency on any measures. However, female shadowers were more susceptible to lexical properties than were males, and model talkers elicited varying degrees of phonetic convergence. Mixed-effects regression models confirmed the complex relationships between acoustic and holistic perceptual measures of phonetic convergence. In order to draw broad conclusions about phonetic convergence, studies should employ multiple models and shadowers (both male and female), balanced multisyllabic items, and holistic measures. As a potential mechanism for sound change, phonetic convergence reflects complexities in speech perception and production that warrant elaboration of the underspecified components of current accounts.
Article
Full-text available
Maximum likelihood or restricted maximum likelihood (REML) estimates of the parameters in linear mixed-effects models can be determined using the lmer function in the lme4 package for R. As for most model-fitting functions in R, the model is described in an lmer call by a formula, in this case including both fixed- and random-effects terms. The formula and data together determine a numerical representation of the model from which the profiled deviance or the profiled REML criterion can be evaluated as a function of some of the model parameters. The appropriate criterion is optimized, using one of the constrained optimization functions in R, to provide the parameter estimates. We describe the structure of the model, the steps in evaluating the profiled deviance or REML criterion, and the structure of classes or types that represents such a model. Sufficient detail is included to allow specialization of these structures by users who wish to write functions to fit specialized linear mixed models, such as models incorporating pedigrees or smoothing splines, that are not easily expressible in the formula language used by lmer.
Conference Paper
Full-text available
In conversation, speakers have been shown to entrain, or become more similar to each other, in various ways. We measure entrainment on eight acoustic features extracted from the speech of subjects playing a cooperative computer game and associate the degree of entrainment with a number of manually-labeled social variables acquired using Amazon Mechanical Turk, as well as objective measures of dialogue success. We find that male-female pairs entrain on all features, while male-male pairs entrain only on particular acoustic features (intensity mean, intensity maximum and syllables per second). We further determine that entrainment is more important to the perception of female-male social behavior than it is for same-gender pairs, and it is more important to the smoothness and flow of male-male dialogue than it is for female-female or mixed-gender pairs. Finally, we find that entrainment is more pronounced when intensity or speaking rate is especially high or low.
Article
Full-text available
Speech produced in the context of real or imagined communicative difficulties is characterized by hyperarticulation. Phonological neighborhood density (ND) conditions similar patterns in production: Words with many neighbors are hyperarticulated relative to words with fewer; Hi ND words also show greater coarticulation than Lo ND words [e.g., Scarborough, R. (2012). "Lexical similarity and speech production: Neighborhoods for nonwords," Lingua 122(2), 164-176]. Coarticulatory properties of "clear speech" are more variable across studies. This study examined hyperarticulation and nasal coarticulation across five real and simulated clear speech contexts and two neighborhood conditions, and investigated consequences of these details for word perception. The data revealed a continuum of (attempted) clarity, though real listener-directed speech (Real) differed from all of the simulated styles. Like the clearest simulated-context speech (spoken "as if to someone hard-of-hearing"-HOH), Real had greater hyperarticulation than other conditions. However, Real had the greatest coarticulatory nasality while HOH had the least. Lexical decisions were faster for words from Real than from HOH, indicating that speech produced in real communicative contexts (with hyperarticulation and increased coarticulation) was perceptually better than simulated clear speech. Hi ND words patterned with Real in production, and Real Hi ND words were clear enough to overcome the dense neighborhood disadvantage.
Article
Full-text available
Speech alignment, or the tendency of individuals to subtly imitate each other's speaking styles, is often assessed by comparing a subject's baseline and shadowed utterances to a model's utterances, often through perceptual ratings. These types of comparisons provide information about the occurrence of a change in subject's speech, but they do not indicate that this change is toward the specific shadowed model. In three experiments, we investigated whether alignment is specific to a shadowed model. Experiment 1 involved the classic baseline-to-shadowed comparison, to confirm that subjects did, in fact, sound more like their model when they shadowed, relative to any preexisting similarities between a subject and a model. Experiment 2 tested whether subjects' utterances sounded more similar to the model whom they had shadowed or to another, unshadowed model. In Experiment 3, we examined whether subjects' utterances sounded more similar to the model whom they had shadowed or to another subject who had shadowed a different model. The results of all experiments revealed that subjects sounded more similar to the model whom they had shadowed. This suggests that shadowing-based speech alignment is not just a change, but a change in the direction of the shadowed model, specifically.
Article
Full-text available
The current study examined phonetic convergence when talkers alternated roles during conversational interaction. The talkers completed a map navigation task in which they alternated instruction Giver and Receiver roles across multiple map pairs. Previous studies found robust effects of the role of a talker on phonetic convergence, and it was hypothesized that role-switching would either reduce the impact of role or elicit alternating patterns of role-induced conversational dominance and accommodation. In contrast to the hypothesis, the initial role assignments induced a pattern of conversational dominance that persisted throughout the interaction in terms of the amount of time spent talking—Original Givers dominated amount of time talking consistently, even when they acted as Receivers. These results indicate that conversational dominance does not necessarily follow nominal role when roles alternate, and that talkers are influenced by initial role assignment when making acoustic-phonetic adjustments in their speech.
Article
Full-text available
The abstract for this document is available on CSA Illumina.To view the Abstract, click the Abstract button above the document title.
Article
Full-text available
This study explores phonetic convergence during conversations between pairs of talkers with varying language distance. Specifically, we examined conversations within two native English talkers and within two native Korean talkers who had either the same or different regional dialects, and between native and nonnative talkers of English. To measure phonetic convergence, an independent group of listeners judged the similarity of utterance samples from each talker through an XAB perception test, in which X was a sample of one talker's speech and A and B were samples from the other talker at either early or late portions of the conversation. The results showed greater convergence for same-dialect pairs than for either the different-dialect pairs or the different-L1 pairs. These results generally support the hypothesis that there is a relationship between phonetic convergence and interlocutor language distance. We interpret this pattern as suggesting that phonetic convergence between talker pairs that vary in the degree of their initial language alignment may be dynamically mediated by two parallel mechanisms: the need for intelligibility and the extra demands of nonnative speech production and perception.
Conference Paper
Full-text available
This paper presents a new experimental paradigm for the study of human-computer interaction, Five experiments provide evidence that individuals' interactions with computers are fundamentally social. The studies show that social responses to computers are not the result of conscious beliefs that computers are human or human-like. Moreover, such behaviors do not result from users' ignorance or from psychological or social dysfunctions, nor from a belief that subjects are interacting with programmers. Rather, social responses to computers are commonplace and easy to generate. The results reported here present numerous and unprecedented hypotheses, unexpected implications for design, new approaches to usability testing, and direct methods for verii3cation.
Article
Full-text available
Most theories of spoken word identification assume that variable speech signals are matched to canonical representations in memory. To achieve this, idiosyncratic voice details are first normalized, allowing direct comparison of the input to the lexicon. This investigation assessed both explicit and implicit memory for spoken words as a function of speakers' voices, delays between study and test, and levels of processing. In 2 experiments, voice attributes of spoken words were clearly retained in memory. Moreover, listeners were sensitive to fine-grained similarity between 1st and 2nd presentations of different-voice words, but only when words were initially encoded at relatively shallow levels of processing. The results suggest that episodic memory traces of spoken words retain the surface details typically considered as noise in perceptual systems.
Article
Full-text available
In this article the author proposes an episodic theory of spoken word representation, perception, and production. By most theories, idiosyncratic aspects of speech (voice details, ambient noise, etc.) are considered noise and are filtered in perception. However, episodic theories suggest that perceptual details are stored in memory and are integral to later perception. In this research the author tested an episodic model (MINERVA 2; D. L. Hintzman, 1986) against speech production data from a word-shadowing task. The model predicted the shadowing-response-time patterns, and it correctly predicted a tendency for shadowers to spontaneously imitate the acoustic patterns of words and nonwords. It also correctly predicted imitation strength as a function of "abstract" stimulus properties, such as word frequency. Taken together, the data and theory suggest that detailed episodes constitute the basic substrate of the mental lexicon.
Article
Full-text available
Traditional mechanistic accounts of language processing derive almost entirely from the study of monologue. Yet, the most natural and basic form of language use is dialogue. As a result, these accounts may only offer limited theories of the mechanisms that underlie language processing in general. We propose a mechanistic account of dialogue, the interactive alignment account, and use it to derive a number of predictions about basic language processes. The account assumes that, in dialogue, the linguistic representations employed by the interlocutors become aligned at many levels, as a result of a largely automatic process. This process greatly simplifies production and comprehension in dialogue. After considering the evidence for the interactive alignment model, we concentrate on three aspects of processing that follow from it. It makes use of a simple interactive inference mechanism, enables the development of local dialogue routines that greatly simplify language processing, and explains the origins of self-monitoring in production. We consider the need for a grammatical framework that is designed to deal with language in dialogue rather than monologue, and discuss a range of implications of the account.
Article
Full-text available
Following research that found imitation in single-word shadowing, this study examines the degree to which interacting talkers increase similarity in phonetic repertoire during conversational interaction. Between-talker repetitions of the same lexical items produced in a conversational task were examined for phonetic convergence by asking a separate set of listeners to detect similarity in pronunciation across items in a perceptual task. In general, a listener judged a repeated item spoken by one talker in the task to be more similar to a sample production spoken by the talker's partner than corresponding pre- and postinteraction utterances. Both the role of a participant in the task and the sex of the pair of talkers affected the degree of convergence. These results suggest that talkers in conversational settings are susceptible to phonetic convergence, which can mark nonlinguistic functions in social discourse and can form the basis for phenomena such as accent change and dialect formation.
Conference Paper
This paper examines speech rate and f0 data from the Switchboard corpus [11] to investigate how a speaker's chronological age affects the extent of their accommodation to their interlocutor. In terms of speech rate, I demonstrate that older speakers slow down less for a slow interlocutor than younger speakers, but they also speed up more for a fast interlocutor. I argue that these effects are due to older speakers' initial speech rate being slower than younger speakers', and thus the accommodation pattern attested by old speakers is compensatory, rather than a sign of decreased willingness or ability to accommodate later in life. This explanation is corroborated by the fact that accommodation for f0 is not affected by the chronological age of the speaker.
Article
Speech interfaces are growing in popularity. Through a review of 99 research papers this work maps the trends, themes, findings and methods of empirical research on speech interfaces in the field of human–computer interaction (HCI). We find that studies are usability/theory-focused or explore wider system experiences, evaluating Wizard of Oz, prototypes or developed systems. Measuring task and interaction was common, as was using self-report questionnaires to measure concepts like usability and user attitudes. A thematic analysis of the research found that speech HCI work focuses on nine key topics: system speech production, design insight, modality comparison, experiences with interactive voice response systems, assistive technology and accessibility, user speech production, using speech technology for development, peoples’ experiences with intelligent personal assistants and how user memory affects speech interface interaction. From these insights we identify gaps and challenges in speech research, notably taking into account technological advancements, the need to develop theories of speech interface interaction, grow critical mass in this domain, increase design work and expand research from single to multiple user interaction contexts so as to reflect current use contexts. We also highlight the need to improve measure reliability, validity and consistency, in the wild deployment and reduce barriers to building fully functional speech interfaces for research. RESEARCH HIGHLIGHTS Most papers focused on usability/theory-based or wider system experience research with a focus on Wizard of Oz and developed systems Questionnaires on usability and user attitudes often used but few were reliable or validated Thematic analysis showed nine primary research topics Challenges identified in theoretical approaches and design guidelines, engaging with technological advances, multiple user and in the wild contexts, critical research mass and barriers to building speech interfaces
Article
Vowels are enhanced via vowel-space expansion in perceptually difficult contexts, including in words subject to greater lexical competition. Yet, vowel hyperarticulation often covaries with other acoustic adjustments, such as increased nasal coarticulation, suggesting that the goals of phonetic enhancement are not strictly to produce canonical phoneme realizations. This study explores phonetic enhancement by examining how speakers realize an allophonic vowel split in lexically challenging conditions. Specifically, in US English, /æ/ is raising before nasal codas, such that pre-nasal and pre-oral /æ/ are moving apart. Speakers produced monosyllabic words varying in phonological neighborhood density (ND), a measure of lexical difficulty, with CæN or CæC structure to a real listener interlocutor in an interactive task. Acoustic analyses reveal that speakers enhance pre-oral /æ/ by lowering it in Hi ND words; meanwhile, pre-nasal /æ/ Hi ND words are produced with greater degrees of nasalization and increased diphthongization. These patterns indicate that ND-conditioned phonetic enhancement is realized in targeted ways for distinct allophones of /æ/. Results support views of hyperarticulation in which the goal is to make words, that is, segments in their contexts, as distinct as possible.
Chapter
This chapter considers common perceptions about the fundamental natures of animals, humans, and machines by exploring individuals’ perceptions of the similarities and differences among those groups of agents. Specifically, this chapter focuses on understanding how a cross-section of U.S. American adults used ontological classification to understand and construct the differences between humans, animals, and machines and to explore one of the ways in which such classification may matter: responses to the destruction of the social robot hitchBOT. A sizeable majority of participants classified humans with anthropomorphic animals, with fewer identifying humans with robots, or as fundamentally distinct from both. Thematic analysis relates participants’ reasoning to cultural constructions of the inherent nature of each entity and to the nature and level of their concern for hitchBOT.
Article
Over the past two years the Ubicomp vision of ambient voice assistants, in the form of smart speakers such as the Amazon Echo and Google Home, has been integrated into tens of millions of homes. However, the use of these systems over time in the home has not been studied in depth. We set out to understand exactly what users are doing with these devices over time through analyzing voice history logs of 65,499 interactions with existing Google Home devices from 88 diverse homes over an average of 110 days. We found that specific types of commands were made more often at particular times of day and that commands in some domains increased in length over time as participants tried out new ways to interact with their devices, yet exploration of new topics was low. Four distinct user groups also emerged based on using the device more or less during the day vs. in the evening or using particular categories. We conclude by comparing smart speaker use to a similar study of smartphone use and offer implications for the design of new smart speaker assistants and skills, highlighting specific areas where both manufacturers and skill providers can focus in this domain.
Article
This study assessed the impact of a conscious imitation goal on phonetic convergence during conversational interaction. Twelve pairs of unacquainted talkers participated in a conversational task designed to elicit between-talker repetitions of the same lexical items. To assess the degree to which the talkers exhibited phonetic convergence during the conversational task, these repetitions were used to elicit perceptual similarity judgments provided by separate sets of listeners. In addition, perceptual measures of phonetic convergence were compared with measures of articulation rates and vowel formants. The sex of the pair of talkers and a talker's role influenced the degree of phonetic convergence, and perceptual judgments of phonetic convergence were not consistently related to individual acoustic-phonetic attributes. Therefore, even with a conscious imitative goal, situational factors were shown to retain a strong influence on phonetic form in conversational interaction.
Article
This study investigates the spontaneous phonetic imitation of coarticulatory vowel nasalization. Speakers produced monosyllabic words with a vowel-nasal sequence either from dense or sparse phonological neighborhoods in shadowing and word-naming tasks. During shadowing, they were exposed to target words that were modified to have either an artificially increased or decreased degree of coarticulatory vowel nasality. Increased nasality, which is communicatively more facilitative in that it provides robust predictive information about the upcoming nasal segment, was imitated more strongly during shadowing than decreased nasality. An effect of neighborhood density was also observed only in the increased nasality condition, where high neighborhood density words were imitated more robustly in early shadowing repetition. An effect of exposure to decreased nasality was observed during post-shadowing word-naming only. The observed imitation of coarticulatory nasality provides evidence that speakers and listeners are sensitive to the details of coarticulatory realization, and that imitation need not be mediated by abstract phonological representations. Neither a communicative account nor a representational account could single-handedly predict these observed patterns of imitation. As such, it is argued that these findings support both communicative and representational accounts of phonetic imitation.
Article
This paper explores entrainment of two speaking styles, shouting and hyperarticulation, in an information-driven spoken dialog system. Both styles present difficulties for automatic speech recognition. We describe and evaluate the system's detection and reaction mechanisms for these speaking styles, which involve deploying appropriate dialog-level strategies. The three strategies tested do induce style change more effectively than the baseline of no strategy. This can translate into both better recognition and a higher chance of task success. Shouting is found to be more amenable to modification than hyperarticulation and the effect of the former on system performance is more profound.
Article
Interactive language use inherently involves a process of coordination, which often leads to matching behaviour between interlocutors in different semiotic channels. We study this process of interactive alignment from a multimodal perspective: using data from head-mounted eye-trackers in a corpus of face-to-face conversations, we measure which effect gaze fixations by speakers (on their own gestures, condition 1) and fixations by interlocutors (on the gestures by those speakers, condition 2) have on subsequent gesture production by those interlocutors. The results show there is a significant effect of interlocutor gaze (condition 2), but not of speaker gaze (condition 1) on the amount of gestural alignment, with an interaction between the conditions.
Article
Talkers automatically imitate aspects of perceived speech, a phenomenon known as phonetic convergence. Talkers have previously been found to converge to auditory and visual speech information. Furthermore, talkers converge more to the speech of a conversational partner who is seen and heard, relative to one who is just heard (Dias & Rosenblum Perception, 40, 1457-1466, 2011). A question raised by this finding is what visual information facilitates the enhancement effect. In the following experiments, we investigated the possible contributions of visible speech articulation to visual enhancement of phonetic convergence within the noninteractive context of a shadowing task. In Experiment 1, we examined the influence of the visibility of a talker on phonetic convergence when shadowing auditory speech either in the clear or in low-level auditory noise. The results suggest that visual speech can compensate for convergence that is reduced by auditory noise masking. Experiment 2 further established the visibility of articulatory mouth movements as being important to the visual enhancement of phonetic convergence. Furthermore, the word frequency and phonological neighborhood density characteristics of the words shadowed were found to significantly predict phonetic convergence in both experiments. Consistent with previous findings (e.g., Goldinger Psychological Review, 105, 251-279, 1998), phonetic convergence was greater when shadowing low-frequency words. Convergence was also found to be greater for low-density words, contrasting with previous predictions of the effect of phonological neighborhood density on auditory phonetic convergence (e.g., Pardo, Jordan, Mallari, Scanlon, & Lewandowski Journal of Memory and Language, 69, 183-195, 2013). Implications of the results for a gestural account of phonetic convergence are discussed.
Chapter
We argue that the speaker designs each utterance for specific listeners, and they, in turn, make essential use of this fact in understanding that utterance. We call this property of utterances audience design. Often listeners can come to a unique interpretation for an utterance only if they assume that the speaker designed it just so that they could come to that interpretation uniquely. We illustrate reasoning from audience design in the understanding of definite reference, anaphora, and word meaning, and we offer evidence that listeners actually reason this way. We conclude that audience design must play a central role in any adequate theory of understanding.
Article
Spontaneous phonetic imitation is the process by which a talker comes to be more similar-sounding to a model talker as the result of exposure. The current experiment investigates this phenomenon, examining whether vowel spectra are automatically imitated in a lexical shadowing task and how social liking affects imitation. Participants were assigned to either a Black talker or White talker; within this talker manipulation, participants were either put into a condition with a digital image of their assigned model talker or one without an image. Liking was measured through attractiveness rating. Participants accommodated toward vowels selectively; the low vowels /æ ɑ/ showed the strongest effects of imitation compared to the vowels /i o u/, but the degree of this trend varied across conditions. In addition to these findings of phonetic selectivity, the degree to which these vowels were imitated was subtly affected by attractiveness ratings and this also interacted with the experimental condition. The results demonstrate the labile nature of linguistic segments with respect to both their perceptual encoding and their variation in production.
Article
This study examined how perceptual sensitivity contributes to gender differences in vocal accommodation. Male and female shadowers repeated isolated words presented over headphones by male and female speakers, and male and female listeners evaluated whether accommodation occurred. Female shadowers accommodated more than males, and more to males than to female speakers, although some speakers elicited greater accommodation than others. Gender differences in accommodation emerged even when immediate social motives were minimized, suggesting that accommodation may be due, in part, to differences in perceptual sensitivity to vocal characteristics.
Article
The growth of speech interfaces and speech interaction with computer partners has made it increasingly important to understand the factors that determine users’ language choices in human-computer dialogue. We report two controlled experiments that used a picture-naming-matching task to investigate whether users in human-computer speech-based interactions tend to use the same grammatical structures as their conversational partners, and whether such syntactic alignment can impact strong default grammatical preferences. We additionally investigate whether beliefs about system capabilities that are based on partner identity (i.e. human or computer) and speech interface design cues (here, voice anthropomorphism) affect the magnitude of syntactic alignment in such interactions. We demonstrate syntactic alignment for both dative structures (e.g., give the waitress the apple vs. give the apple to the waitress), where there is no strong default preference for one or other structure (Experiment 1), and noun phrase structures (e.g., a purple circle vs. a circle that is purple), where there is a strong default preference for one structure (Experiment 2). The tendency to align syntactically was unaffected by partner identity (human vs. computer) or voice anthropomorphism. These findings have both practical and theoretical implications for HCI by demonstrating the potential for spoken dialogue system behaviour to influence users’ syntactic choices in interaction. As well as verifying natural corpora findings, this work also highlights that priming and cognitive mechanisms that are unmediated by beliefs about partner identity could be important in understanding why people align syntactically in human-computer dialogue.
Article
The interactive-alignment model of dialogue provides an account of dialogue at the level of explanation normally associated with cognitive psychology. We develop our claim that interlocutors align their mental models via priming at many levels of linguistic representation, explicate our notion of automaticity, defend the minimal role of “other modeling,” and discuss the relationship between monologue and dialogue. The account can be applied to social and developmental psychology, and would benefit from computational modeling.
Article
Two people talking, as at a crowded party, may try to conceal all or part of what they mean from overhearers. To do this, it is proposed, they need to build what they wish to conceal on a private key, a piece of information, such as an event mentioned in an earlier conversation, that is common ground for the two of them and yet not inferable by the overhearers. What they say must be designed so that it cannot be understood without knowledge of that key. As evidence for the proposal, pairs of friends were required, as part of an arrangement task, to refer to familiar landmarks while concealing these references from overhearers. As predicted, the two of them used private keys, which they often concealed even further by using certain collaborative techniques. Still, the two partners weren't always successul.
Article
This paper provides an introduction to mixed-effects models for the analysis of repeated measurement data with subjects and items as crossed random effects. A worked-out example of how to use recent software for mixed-effects modeling is provided. Simulation studies illustrate the advantages offered by mixed-effects analyses compared to traditional analyses based on quasi-F tests, by-subjects analyses, combined by-subjects and by-items analyses, and random regression. Applications and possibilities across a range of domains of inquiry are discussed.
Article
Current approaches to the development of natural language dialogue systems are discussed, and it is claimed that they do not sufficiently consider the unique qualities of man-machine interaction as distinct from general human discourse. It is concluded that empirical studies of this unique communication situation are required for the development of user-friendly interactive systems. One way of achieving this is through the use of so-called Wizard of Oz studies. The focus of the work described in the paper is on the practical execution of the studies and the methodological conclusions drawn on the basis of the authors' experience. While the focus is on natural language interfaces, the methods used and the conclusions drawn from the results obtained are of relevance also to other kinds of intelligent interfaces.
Article
Five experiments examined the extent to which speakers' alignment (i.e., convergence) on words in dialog is mediated by beliefs about their interlocutor. To do this, we told participants that they were interacting with another person or a computer in a task in which they alternated between selecting pictures that matched their 'partner's' descriptions and naming pictures themselves (though in reality all responses were scripted). In both text- and speech-based dialog, participants tended to repeat their partner's choice of referring expression. However, they showed a stronger tendency to align with 'computer' than with 'human' partners, and with computers that were presented as less capable than with computers that were presented as more capable. The tendency to align therefore appears to be mediated by beliefs, with the relevant beliefs relating to an interlocutor's perceived communicative capacity.