Conference PaperPDF Available

Individual Variation in Cognitive Processing Style Predicts Differences in Phonetic Imitation of Device and Human Voices

Authors:

Figures

Content may be subject to copyright.
Individual variation in cognitive processing style predicts differences in phonetic
imitation of device and human voices
Cathryn Snyder, Michelle Cohn, and Georgia Zellou
Department of Linguistics, University of California, Davis
gzellou@ucdavis.edu
Abstract
Phonetic imitation, or implicitly matching the acoustic-
phonetic patterns of another speaker, has been empirically
associated with natural tendencies to promote successful social
communication, as well as individual differences in
personality and cognitive processing style. The present study
explores whether individual differences in cognitive
processing style, as indexed by self-reported scored from the
Autism-Spectrum Quotient (AQ) questionnaire, are linked to
the way people imitate the vocal productions by two digital
device voices (i.e., Apple’s Siri) and two human voices.
Subjects first performed a word shadowing task of human and
device voices and then completed the self-administered AQ.
We assessed imitation of two acoustic properties: f0 and
vowel duration. We find that the attention to detail and the
imagination subscale scores on the AQ mediated degree of
imitation of f0 and vowel duration, respectively. The findings
yield new insight to speech production and perception
mechanisms and how it interacts with individual cognitive
processing style differences.
1. Introduction
Voice-AI systems, such as Apple’s Siri, are now a prevalent
type of interlocutor [1]. Humans engaging with the devices
vary along multiple dimensions: by their age, gender, and
background [2]. Yet, little work has explored variation across
individuals’ communication patterns with modern voice-AI
systems. Prior work has observed individual differences in
people's’ behavior patterns during human-computer and
human-robot interaction [3]. Some studies have begun
outlining factors that might predict such individual variation in
behavior during device interactions, such as the degree to
which users anthropomorphize non-human entities [4] level or
experience with the device, or cognitive/personality traits (cf.
[5]). Recent work has demonstrated differences in speech
patterns towards voice-AI devices based on the gender of the
individual [6]. Understanding individual variation in human-
device interaction is important for several reasons. For one, it
can inform our scientific understanding of human-computer
interaction and developing more inclusive models of human
behavior toward devices. Furthermore, it is relevant for those
interested in designing systems that can accommodate as
many users as possible (cf., [3], [7]).
As reviewed in [5], several aspects of users’ personalities
and cognitive styles have been associated with their behavior
during interactions with technology, such as spatial ability [8]
and locus of control [9]. A relevant factor that has been
understudied in individual variation human-device interaction
is cognitive processing style, broadly defined as the
constellation of psychological dimensions that reflect
consistencies in the processing of sensory information that
vary across individuals [10], [11]. Within a population of
individuals of the same gender, age, and background (e.g.,
education level, socioeconomic status, etc.), there is diversity
in cognitive processing style and this has been linked to a
variety of behaviors in various domains, including identify
formation [12], learning [13] and processing of sensory
information [14]. Differences in cognitive processing style
have also been linked to variation in individuals’ speech
production and perception patterns (see [15] for review). In
this paper, we ask whether there is systematic variation in
individuals’ speech alignment toward voice-AI systems and
humans as a function of their cognitive processing style.
One way to measure individual differences in cognitive
processing style is with the Autism-Spectrum Quotient (AQ)
questionnaire: a self-administered non-clinical assessment
used to quantify degree of various “autistic” traits in
neurotypical adults of normal intelligence [16]. The AQ
assesses five dimensions of cognitive processing: social skills,
attention switching, attention to detail, communication, and
imagination. These AQ sub-scales provide insight into the
multidimensional factors that influences individuals’ cognitive
processing styles and personality, in general. Several studies
have found a link between individuals’ scores on various AQ
subscales and their performance on a range of language
processing tasks. For example, higher communication AQ
scores, i.e., more difficulty in social communication, relate to
difficulties in using prosodic information to disambiguate
distinct pragmatic meanings during sentence processing [17].
Since humans are interacting more and more with devices
using speech, examining how individual variation in cognitive
processing style might be linked to their phonetic patterns is
an open question. Thus, of particular interest are studies
showing a link between AQ scores and patterns of speech
production and perception behavior. For example, women
with high AQ attention switching scores, i.e., inflexibility in
new situations, poor task switching and multitasking skills,
tend to perceptually compensate, i.e., show less veridical
acoustic perception, for /s/ following a rounded vowel relative
to women with low AQ attention switching scores [18].
Additional research also shows that people with overall higher
AQ scores (across subscales) are more likely to show
sensitivity to fine-grained acoustic differences and are less
likely to be influenced by higher-level lexical knowledge
during the perception of sibilants [19]. Greater sensitivity to
phonetic details in individuals with higher AQ scores, and
subscale scores, has also been linked to differences in speech
production. Yu et al. [20] showed that a higher score in the
AQ attention switching subscale positively correlates with
degree of phonetic imitation of voice onset time (VOT); this is
Index Terms: speech alignment, human-device interaction,
individual differences
Copyright © 2019 ISCA
INTERSPEECH 2019
September 15–19, 2019, Graz, Austria
http://dx.doi.org/10.21437/Interspeech.2019-2669116
thought to be driven by their heightened attention and phonetic
sensitivity to each individual word (Yu et al., 2013).
Some prior work has linked subjects’ AQ scores to their
sensitivity to robot behavior: for example, individuals with
fewer autistic traits were more accurate in detecting whether
robot behavior was programmed or human-controlled [21] and
in interpreting a robot’s facial expression [22]. Thus, there is
evidence to suggest that individual variation in AQ scores will
predict patterns of human-computer communication in the
domain of speech.
1.1. Current study
Little prior work has examined variation across individuals’
cognitive processing styles (i.e., AQ scores) in interactions
with voice-AI. In the current study, we tested whether
individual variation in AQ subscores predicts patterns of
phonetic imitation on a single-word shadowing task of four
different interlocutors: two device voices (Apple’s Siri) and
two human voices. We focus on acoustic measures of
imitation that have been shown to be sensitive to individual
differences from prior work, specifically duration [23] pitch
[24].
Our predictions about how individual variation in AQ sub-
scores predict patterns of imitation toward device and human
voices can be framed in terms of different perspectives on the
motivations of phonetic imitation, in general. For one,
Communication Accommodation Theory (CAT) [25], [26]
proposes that imitation is a means by which people express
social closeness. This has been empirically supported for
example, if the interlocutor is perceived to have more positive
social attributes, e.g. attitude, perceived attractiveness, similar
ideologies, and social closeness, greater phonetic imitation is
observed [27]–[29]. We predict that if imitation is more
socially driven, as CAT suggests, differences in imitating
humans and digital devices will be borne out in differences
related to the AQ social skills sub-score, which measures
flexibility, comfort, and comprehension of social cues during
social encounters.
Another prediction is that individuals’ phonetic imitation
patterns may vary based on the imagination AQ subscale,
which relates to the ability to comprehend fictional events and
attribute human-like characteristics onto non-human objects.
One possibility is that imitation patterns toward the human
versus device voices are mediated by extent of attributing
anthropomorphism to the voice-AI system, similar to past
work demonstrating variation in personifying non-human
entities [4]. This relates to theoretical frameworks of computer
personification, such as the “Computers as Social Actors”
(CASA) [30], [31], which posits that humans treat a computer
as they would a human as soon as any degree of humanity can
be detected. We ask whether AQ imagination score is related
to the “degree of humanity” speakers detect from Siri voices.
Another perspective that might be relevant to phonetic
imitation is that interactions with computer systems are driven
by functional pressures [32], [33]. This is supported by
findings that humans align with computers in ways that seem
motivated to improve mutual intelligibility and
communicative success, i.e., by choosing lexemes and
speaking at a rate that they believe the computer will
understand, e.g. [34], [35]. Thus, if imitation is more
functionally driven, differences in imitating humans and
digital devices will be borne out in differences related to the
communication subcategory of the AQ, which assesses
conversational competence and fluency skills.
Meanwhile, if imitation is more dependent on attentional
mechanisms, as argued by Gambi and Pickering [32], we
might expect that differences in imitating humans and digital
devices will be borne out in a link between degree of imitation
and scores in the AQ subscales that relate to attention (i.e.,
attention switching, attention to detail). This would align with
findings from Yu’s [23], [36] studies reporting higher AQ
attention switching scores correlated with greater phonetic
imitation.
2. Methods
2.1. Stimuli
Stimuli consisted of 12 low frequency CVN target words:
bomb, sewn, vine, pun, yawn, shun, chime, shone, wane, tame,
wren, hem (mean log frequency: 1.6, range: 1.1-2.1, taken
from SUBTLEX [37]) produced by 2 real human talkers (1
female and 1 male, both native English speakers from
California) and 2 Siri voices (American female, American
male). The Siri voices were created on the Terminal on a Mac
computer, while the human voices were recorded in a sound-
attenuated booth.
2.2. Participants and procedure
A total of 43 female subjects participated in the experiment.
We recruited only female subjects since [23] report that the
association between AQ subscores and imitation was greatest
for females. All subjects were native English speakers and
reported no hearing impairment. All participants except for 3
reported that they have experience using Siri at least once a
week.
While in the lab, subjects completed the 50-question
Autism Quotient [16] that assesses individuals’ self-reported
autistic-like characteristics. The questionnaire is administered
as a pen-and-paper survey. The test consists of items that fall
into five subscales, consisting of 10 questions each: social
skill (e.g., “I find social situations easy.”), imagination (e.g.,
“If I try to imagine something, I find it very easy to create a
picture in my mind.”), attention to detail (e.g., “I often notice
small sounds when others do not.”), attention switching (e.g.,
“I find it easy to do more than one thing at once.”), and
communication (e.g. “Other people frequently tell me that
what I’ve said is impolite, even though I think it is polite.”).
Questions are worded so that half would elicit an “agree”
response and half would elicit a “disagree” response.
Participants respond to each question on a 4-point scale
(“strongly agree”, “agree”, “disagree”, “strongly disagree”).
Each response is then converted to a numeric score (1-4).
Scoring can also be tabulated using a 0-1 value (grouping the
slightly and strongly agree or disagree responses together).
The total AQ score is calculated by summing the scores for all
50 questions. Each of the subscale scores is calculated by
summing the scores for the 10 questions that correspond to
each subscale trait. Table 1 presents the descriptive statistics
for the Total AQ and subscale scores, as well as basic
demographic characteristics of the participants. A high value
denotes more ‘‘autistic’’ like traits, i.e. lower imagination,
lower social skills, greater difficulty in attention switching,
higher attention to detail, and lower ability to communicate.
117
Table 1: Descriptive statistics of participant variables.
Mean
SD
Min
Max
Age
19.6
1.7
18
24
AQ total
108.2
13.4
83
133
Social skill (AQSS)
20.4
4.2
13
29
Imagination (AQIM)
19.6
3.9
13
29
Attention switching (AQAS)
24.4
4.2
17
32
Attention to detail (AQAD)
25.7
4.5
14
33
Communication (AQCM)
18.1
4.1
13
27
The study began with a pre-exposure phase, where subjects
read each of the target words in isolation (presented randomly,
4 repetitions) to get their baseline productions. In the
shadowing phase, subjects were introduced to the four
interlocutors by name and picture: the two device voices (Siri,
device female; Alex, device male) and the two human voices
(Melissa, female; Carl, male). Images for the devices were two
iPhones showing different home screens (e.g., “How can I
help you today?”), while the images for the human
interlocutors were stock images. Next, subjects were told that
they would hear each of the four talkers say the word and that
their task was to simply repeat the word. Subjects were not
told explicitly to imitate. Word and model talker were
randomized. In total, subjects shadowed 96 tokens (12 words *
4 model talkers * 2 repetitions).
3. Analysis
3.1. Acoustic assessment of phonetic imitation (DID)
We measured several acoustic properties of interest for each
token produced by the subjects, as well as the productions by
the model talkers: vowel duration (logged) and vowel mean f0
(mean, in semitones, ST). We then calculated a difference in
distance (DID) measure [38] to quantify degree of acoustic
convergence for each feature toward the model talker’s
production of that word. DID = |baseline - model| - |shadowed
- model|. A positive DID value indicates change toward to
direction of the model talker after exposure; a negative value
indicates divergence from the model talker’s speech.
3.2. Statistical analyses
We modeled DID values for vowel duration (logged) and
mean f0 (ST) in two separate linear mixed effects models with
the lme4 R package [39]. Estimates for degrees of freedom, F-
statistics, and p-values were computed using Satterthwaite
approximation with anova() function in the lmerTest package
(Kuznetsova et al., 2015). Both models had identical fixed and
random effects structure. Fixed effects included Model
Humanness (2 levels: human vs. device), Model Gender (2
levels: female vs. male). Exposure (2 levels: first vs. second
repetition) was also included as a fixed effect predictor. The
model also included participants’ scores for each of the five
AQ subscales (ranging from 5-40, logged) as fixed effects. An
analysis of multicollinearity using the ggpairs() function in the
GGally package [40] indicated high correlation between
individuals’ AQSS scores and their scores on the AQIM,
AQCM, and AQAS subscores. Therefore, prior to model
fitting, AQIM, AQCM, and AQAS were residualized for the
effect of AQSS. The model included all possible two- and
three-way interactions between Model Humanness x Model
Gender x each AQ subscale variable. Random effects structure
for each model included random intercepts for Participant and
Lexical Item. In addition, each model included by-Participant
random slopes for Model Gender, Model Humanness, and the
interaction between these factors. Each of the discrete
predictors were sum-coded, all continuous variables were
centered and scaled.
4. Results
4.1. Phonetic imitation of vowel duration
The model run on DID duration scores computed significant
main effects of Model Gender [F(1, 37)=23.9, p<.001]:
participants converged in duration to male model talkers
(x
̅DID=.03), while there was significantly less alignment to
female model talkers (x
̅DID= -.002) (β=-0.01, t=-4.8, p<.001).
We additionally observed a main effect of Model Humanness
[F(1, 38)=22.8, p<.001]: participants showed greater
convergence to human model talkers’ durations (x
̅DID= .02), but
very small alignment to device voices, overall (x
̅DID= .002) (β=-
0.009, t=-4.8, p<.001). There was also a significant interaction
between Model Gender and Model Humanness [F(1,
3575.4)=61.7, p<. 001]: Participants converged in duration
toward the human male voice most robustly (x
̅DID= .04),
followed by the male device voice (x
̅DID= .006). The female
human voice and the device female voice showed the smallest
mean DID values (-.003, -.002, respectively).
There was a trend towards significance for the main effect
of the residualized AQ Imagination subscale (AQIM) on
degree of duration imitation [F(1, 32.7)=3.6, p=.06], with a
negative coefficient value (β= -.008), indicating that listeners
with more autistic-like imagination traits, i.e., poorer
imagination, tended to display less convergence in vowel
duration.
The model also revealed a significant three-way
interaction between Model Gender, Model Humanness, and
the residualized AQIM subscale [F(1, 3574.7)=5.3, p<.05].
The three-way interaction is illustrated in Figure 1: at lower
AQIM scores, i.e., individuals with greater imagination skills,
there is imitation toward the human male model talker, yet
little to no convergence toward the other model talkers. Yet, as
AQIM increases, signaling poorer imagination skills,
participants display less convergence toward all the model
talkers except the female device voice, which does not change
across AQIM scores.
Figure 1: DID duration means and standard errors by AQIM
score, by Gender and Humanness of the Model Talker.
4.2. Phonetic imitation of mean vowel f0
The model run on DID mean f0 values revealed a significant
two-way interaction between Model Gender and AQ subscale
of Attention to Detail (AQAD) [F(1, 36.9)=5.4, p<.05]. As
observed in the right panel of Figure 2, we see less imitation
of f0 with increasing AQAD scores for female model talkers.
118
Figure 2: DID mean f0 means and standard errors by AQAD
score, by Gender and Humanness of the Model Talker.
There was also a significant three-way interaction between
Model Gender, Model Humanness, and AQAD [F(1,
3541.2)=6.9, p<.01]: as AQAD increases, there is greater pitch
imitation for the male model talkers (relative to the female
model talkers), and we see even greater imitation for the
human male talker. In Figure 2, we additionally see that the
slopes are steeper for pitch imitation of human voices than for
device voices, on the basis of increasing AQAD score (though
in opposite directions). No other effects or interactions were
significant in the mean f0 DID model.
5. Discussion
In this study, we tested whether there is individual variation in
shadowed productions of digital device (e.g. Apple’s Siri) and
human voices. This study was designed to address a gap in
exploring individual variation in human-computer
interactions, cf. [3], specifically in considering subjects’
cognitive processing style [10], as measured by the self-
reported Autism Quotient (AQ) [16]. Given prior work
establishing links between cognitive processing style and
speech behavior such as phonetic imitation [17], [23] we used
a shadowing paradigm [38].
Overall, our results reveal individual variation in self-
reported AQ subscale scores predict differences in patterns of
imitation toward human and device voices. Specifically, for
duration, we find that scores on the Imagination subscale of
the AQ predict differences in imitation of device and human
model talkers. Individuals with higher imagination ability, i.e.,
lower AQIM scores, showed greater imitation of duration, in
general. In addition, this was mediated by both gender and
humanity of the model talker: individuals with lower
imagination ability display less convergence in duration
toward all the model talkers, excluding the female device
voice. Meanwhile, individuals with greater imagination skills
display robust imitation toward the human male model talker,
and some imitation of the male device voice. Conversely,
individuals with high imagination scores show little to no
convergence toward the female human and device model
talkers.
Together, these findings suggest that imitation toward
human/device voices is socially mediated, in that we see
different patterns based on the apparent gender of the voices,
in line with CAT [25], [26]; but in ways that relate to an
individual’s cognitive processing style. Furthermore, we see
overlap in subjects’ imitation patterns on the basis of model
talker humanity: imitation patterns for the human and device
male model talkers are in parallel, suggesting that across
individual variation, people treat devices and humans
differently, contra CASA [30], [31]..
One aspect that supports this is the difference in how
individuals varying in imagination ability displayed imitation
toward the male device voice: individuals with higher
imagination displayed positive DID values toward the male
device voice, while individuals with lower imagination
showed divergence away from the male device voice. In other
words, individuals with higher imagination were most likely to
imitate a device voice. We do see similarities in imitation of
female device and human voices. Yet, prior work has shown
greater convergence toward male talkers [41], thus differences
in imitation toward device and human voices could be realized
more strongly since the ceiling is higher.
Our results additionally reveal differences in mean pitch
(f0) imitation on the basis of AQ subscales. In particular, we
see a relationship between the Attention to Detail subscale
(AQAD), indicating greater attention to detail, and degree of
imitation by the model talker humanity/gender. Subjects show
greater f0 imitation for male talkers on the basis of increasing
attention to detail, which even further increased when
shadowing the human male talker. That we see differences
based on an attentional measure is in line with proposals that
imitation is attentionally mediated [32]. Our results
additionally support proposals that imitation is socially
mediated, in line with the duration patterns, in that we see
different patterns of f0 imitation based on the gender of the
model talker, supporting a CAT model. This was also
observed for imitation for duration. In general, then, that we
see imitation is mediated by gender of the model talker is in
line with CAT theories of socially-mediated phonetic
imitation.
At the same time, we see similar patterns of f0 imitation
within genders for the male and female voices: with a general
decline for female talkers and general incline for male talkers
(on the basis of increasing AQAD score). These similarities
are supportive of theories of computer personification, e.g.
CASA, in that speech production patterns toward human and
device voices are similar. Yet, we observe differences in the
steepness of the function within these categories: subjects
produce a steeper decline/incline for human model talkers.
In sum, an individual’s cognitive processing style seems to
influence how they treat humans and digital devices.
Furthermore, these findings suggest that there is variation in
human-device interaction. The results of this study have
implications for models of human-device interaction. In
particular, our findings raise questions as to the “automaticity”
by which we treat computers as people, as proposed by
CASA. We see individuals with different cognitive profiles
demonstrating variation in the extent to which they
“personify” digital device voices, suggesting that human-
computer communication patterns are more complex than
originally theorized. We propose an extension to the CASA
theoretical framework, wherein the degree of computer
personification is mediated by social-cognitive mechanisms,
including speaker characteristics.
6. Acknowledgments
This work was partially funded by an Amazon Faculty
Research Award and an Artificial Intelligence and Healthcare
Seed Grant from UCD to GZ.
119
7. References
[1] M. B. Hoy, “Alexa, Siri, Cortana, and More: An Introduction
to Voice Assistants,” Med. Ref. Serv. Q., vol. 37, no. 1, pp.
8188, 2018.
[2] D. C. Plummer et al., ’Top strategic predictions for 2017 and
beyond: Surviving the storm winds of digital disruption’.
Gartner report G00315910. Gartner. Inc, 2016.
[3] D. E. Egan, “Chapter 24 - Individual Differences In Human-
Computer Interaction,” in Handbook of Human-Computer
Interaction, M. Helander, Ed. Amsterdam: North-Holland,
1988, pp. 543568.
[4] A. Waytz, J. Cacioppo, and N. Epley, “Who Sees Human?:
The Stability and Importance of Individual Differences in
Anthropomorphism,” Perspect. Psychol. Sci., vol. 5, n o. 3, p p.
219232, May 2010.
[5] N. M. Aykin and T. Aykin, “Individual differences in human-
computer interaction,” Comput. Ind. Eng., vol. 20, no. 3, pp.
373379, Jan. 1991.
[6] Anonymous, “Imitating Siri: Socially-media ted alignment to
device and human voices,” Proc. ICPhS, to ap pear.
[7] M. Schmettow and J. Havinga, “Are Users More Diverse Tha n
Designs?: Testing and Extending a 25 Years Old Claim, in
Proceedings of the 27th International BCS Human Computer
Interaction Conference, Swinton, UK, UK, 2013, pp. 40:1
40:5.
[8] K. J. Vicente and R. C. Williges, “Accommodating individual
differences in searching a hierarchical file system,” Int. J.
Man-Mach. Stud., vol. 29, no. 6, pp. 64 7668, Jan. 1988.
[9] M. D. Coovert and M. Goldstein, “Locus of Control as a
Predictor of Users’ Attitude toward Computers,” Psychol.
Rep., vol. 47, no. 3_suppl, pp. 11671173, Dec. 1980.
[10] L. J. Ausburn and F. B. Ausburn, “Cognitive styles: Some
information and implications f or instructi onal design,” EC TJ,
vol. 26, no. 4, pp. 337354, Dec. 1978.
[11] S. Messick and et al, Individuality in learning. Oxford,
England: Jossey-Bass, 1976.
[12] M. D. Berzonsky, “Identity formati on: The role of identity
processing style and cognitive processes,” Personal. Individ.
Differ., vol. 44, no. 3, pp. 645655, Feb. 2008.
[13] L. Price, “Individual Differences in Learning: Cognitive
control, cognitive style, and learning style, Educ. Psychol.,
vol. 24, no. 5, pp. 681698, Sep. 2004.
[14] R. J. Riding, “On the Nature of Cognitive Style,” Educ.
Psychol., vol. 17, no. 12, pp. 2949, Mar. 1997.
[15] A. C. L. Yu and G. Zellou, “Individual Differences in
Language Processing: Phonology,” Annu. Rev. Linguist., vol.
5, no. 1, pp. 131150, 2019.
[16] S. Baron-Cohen, S. Wheelwright, R. Skinner, J. Martin, and E.
Clubley, “The Autism-Spec trum Quotient (AQ): Evidence
from Asperger Syndrome/High-Functioning Aut ism,
Malesand Females, Scientists and Mathematician s,” J. Autism
Dev. Disord., vol. 31, no. 1, pp. 517, Feb. 2001.
[17] S.-A. Jun and J. Bishop, “Priming Implicit Prosody: Prosodic
Boundaries and Individual Differences,” Lang. Speech, vol.
58, no. 4, pp. 459473, Dec. 2015.
[18] Z. Yu, “Attention and engagement aware multimodal
conversational systems,” in Proceedings of the 2015 ACM on
International Conference on Multimodal Interaction, 2015,
pp. 593597.
[19] A. Yu, J. Grove, M. Martinovic, and M. Sonderegger, “Effects
of Working Memory Capacity and Autistic’ Traits on
Phonotact ic Effects in Speech Percep tion,” Proc. Int. Congr.
Phon. Sci. XVII , pp. 22362239, 2011.
[20] Z. Yu, D. Gerritsen, A. Ogan, A. Black, and J. Cassell,
“Automatic prediction of friendship via multi-model dyadic
features,” in Proceedings of the SIGDIAL 2013 Conference,
2013, pp. 5160.
[21] A. Wykowska, J. Kajopoulos, K. Ramirez-Amaro, and G.
Cheng, “Autistic traits and sensitivity to human-like features
of robot beh avior,Interact. Stud., vol. 16, no. 2, pp. 219248,
Jan. 2015.
[22] P. E. McKenna, A. Gh osh, R. Aylett, F. Broz, I. Keller, and G.
Rajendran, “Robot Expressive Behaviour and Autistic Traits,”
in Proceedings of the 17th International Conference on
Autonomous Agents and MultiAgent Systems, Richland, SC,
2018, pp. 22392241.
[23] A. C. L. Yu, C. Abrego-Collier, and M. Sonderegger,
“Phonetic Imitation from an Individual-Difference
Perspective: Subjective Attitude, Personality and ‘Autistic’
Traits,” PLOS ONE, vol. 8, no. 9, p. e74746, Sep. 2013.
[24] A. Bonnel, L. Mottron, I. Peretz, M. Trudel, E. Gallun, and A.-
M. Bonnel, “En hanced Pitch Sensitivity in Individuals with
Autism: A Signal Detection Analysis,” J. Cogn. Neurosci.,
vol. 15, no. 2, pp. 226235, Feb. 2003.
[25] H. Giles, N. Coupland, and I. Coupland, “1. Accommodation
theory: Communicati on, context, and,” Contexts Accommod.
Dev. Appl. Socioling., vol. 1, 1991.
[26] H. Giles and S. C. Baker, “Communication accommodation
theory,” Int. Encycl. Commun., 2008.
[27] M. Babel, “Evidence for phonetic and social sele ctivity in
spontaneous phonetic imitation,” J. Phon., v ol. 40, no. 1, pp.
177189, 2012.
[28] J. S. Pardo, “On phonetic convergence during conversational
interaction,” J. Acoust. Soc. Am., vol. 119, no. 4, pp. 2382
2393, 2006.
[29] M. Natale, “Converge nce of mean vocal intensity in dyadic
communication as a function of social desirability,” J. Pers.
Soc. Psychol., vol. 32, no. 5, pp. 790804, 1975.
[30] C. Nass, J . Steuer, and E. R. Tauber, “Computers are social
actors,” in Proceedings of the SIGCHI conference on Human
factors in computing systems, 1994, pp. 7278.
[31] C. Nass and Y. Moon, “Machines and mindlessness: Social
responses to computers,” J. Soc. Issues, vol. 56, no. 1, pp. 81
103, 2000.
[32] C. Gambi and M. J. Pickering, “Prediction and imitation in
speech,” Front. Psychol., vol. 4, 2013.
[33] M. J. Pickering and S. Garrod, “Align ment as the basis for
successful communication,” Res. Lang. Comput., vol. 4, no.
2–3, pp. 203228, 2006.
[34] H. P. Branigan, M. J. Pickering, J. Pearson, J. F. McLean, and
A. Brown, “The role of beliefs in lexical alignment: Evidence
from dialogs with humans and computers,” Cognition, vol.
121, no. 1, pp. 4157, 2011.
[35] L. Bell, “Linguistic Adaptations in Spoken Human-Computer
Dialogues-Empirical Studies of User Behavior,” Institutionen
för talöverföring och musikakustik, 2003.
[36] A. C. L. Yu, “Perceptual Compensation Is Correlated with
Individuals’ ‘Autistic’ Traits: Implications for Models of
Sound Change,” PLOS ONE, vol. 5, no. 8, p. e11950, Aug.
2010.
[37] M. Brysbaert and B. New, “Moving beyond Kučera and
Francis: A crit ical evaluation of current w ord frequency norms
and the introduction of a new and i mproved word frequen cy
measu re f or Americ an English,” Behav. Res. Method s, vol. 41,
no. 4, pp. 977990, 2009.
[38] J. S. Pardo, K. Jordan, R. Mallari, C. Scanlon, and E.
Lewandowski, “Phonetic convergence in shadowed speech:
The relation between acoustic and perceptual measures,” J.
Mem. Lang., vol. 69, no. 3, pp. 183195, 2013.
[39] D. Bates, M. Mächler, B. Bolker, and S. Walker, “Fitting
Linear Mixed-Effects Models Using lme4,” J. Stat. Softw.,
vol. 67, no. 1, pp. 148, Oct. 2015.
[40] B. Schloerke et al., Ggally: Extension to ggplot2. R package
version 0.5. 0. 2014.
[41] J. S. Pardo, R. Gibbons, A. Suppes, and R. M. Krauss,
“Phonetic convergence in college roommates,” J. Phon., vol.
40, no. 1, pp. 190197, 2012.
120
... Indeed, work examining (non-emotional) vocal alignment has demonstrated differences in how individuals align toward device and human voices when direct comparisons are made Snyder et al., 2019). In a study comparing alignment toward a voice-AI (Amazon's Alexa) and a human interlocutor, Raveh and colleagues (2019) found that people do align toward voice-AI; but, when a human confederate was present, participants aligned less toward the Alexa voice. ...
... In a study comparing alignment toward a voice-AI (Amazon's Alexa) and a human interlocutor, Raveh and colleagues (2019) found that people do align toward voice-AI; but, when a human confederate was present, participants aligned less toward the Alexa voice. Similarly, in two studies examining single-word shadowing of voice-AI and human interlocutors, participants displayed greater alignment toward human voices, relative to Apple's Siri voices Snyder et al., 2019). With respect to emotion, a recent study found that listeners perceived synthesized 'happiness' in a human and Alexa voice (from 'emotionally-neutral' productions) similarly in some respects (e.g., increased perceived arousal with 'happiness' manipulations) but differently for others: listeners did not hear the same increase in valence with the 'happiness' manipulation in the Alexa voice. ...
... In the current study, socially-mediated alignment patterns might be realized as greater emotional vocal alignment toward the 'in-group' human voice, relative to the voice-AI talker, which represents a distinct social category. This would be consistent with prior findings for (non-emotional) alignment where greater alignment toward human, compared to voice-AI, interlocutors was observed Snyder et al., 2019). Alternatively, participants might find emotionally expressive productions by voice-AI to be 'uncanny' (Mori, 1970;Mori et al., 2012), and subsequently diverge from those productions only. ...
Article
Full-text available
This study tests whether individuals vocally align toward emotionally expressive prosody produced by two types of interlocutors: a human and a voice-activated artificially intelligent (voice-AI) assistant. Participants completed a word shadowing experiment of interjections (e.g., “Awesome”) produced in emotionally neutral and expressive prosodies by both a human voice and a voice generated by a voice-AI system (Amazon's Alexa). Results show increases in participants’ word duration, mean f0, and f0 variation in response to emotional expressiveness, consistent with increased alignment toward a general ‘positive-emotional’ speech style. Small differences in emotional alignment by talker category (human vs. voice-AI) parallel the acoustic differences in the model talkers’ productions, suggesting that participants are mirroring the acoustics they hear. The similar responses to emotion in both a human and voice-AI talker support accounts of unmediated emotional alignment, as well as computer personification: people apply emotionally-mediated behaviors to both types of interlocutors. There were small differences in magnitude by participant gender, the overall patterns were similar for women and men, supporting a nuanced picture of emotional vocal alignment.
... For example, compared phonetic alignment patterns toward human and Siri voices by participants in a word shadowing task (where participants repeat isolated words produced by an interlocutor). Using a perceptual assessment, they found less alignment toward the Siri voices, overall, than toward the human voices (similar patterns were reported using an acoustic assessment of alignment in Snyder et al. (2019)). The observation of less alignment toward device voices than toward human voices suggests that people are sensitive to the social distinction between devices and humans, in line with a Communication Accommodation Theory (Giles et al., 1991), as well as accounts proposing more gradient application of CASA (e.g. that individuals show greater phonetic alignment toward machines as their embodiment increases in Cohn et al. (2020)). ...
... Giles, 1973;Giles et al., 1991). This interpretation is supported by prior studies comparing phonetic imitation of digital device and human voices in word shadowing tasks which report less phonetic alignment toward the device voices Snyder et al., 2019) and less phonetic alignment toward less anthropomorphised device systems than devices that have a more embodied human form . More specifically, we observe apparent divergence toward the device interlocutor observed in the Map Task study: productions toward the device showed significantly less than chance perception of alignment. ...
... Additionally, we did not observe transfer of conversational role-based alignment from humanhuman to human-computer interaction. This contrasts with prior studies that have observed display similar socially-mediated differences in phonetic alignment for human and voice-AI talkers (e.g. more alignment toward male than female voices in Snyder et al., 2019). In those studies, participants "shadowed" single words (repeating after the interlocutor); one possibility is that the more interactive dialogue in the present study highlighted different social constraints on alignment between humans and voice-AI. ...
Article
Two studies investigated the influence of conversational role on phonetic imitation toward human and voice-AI interlocutors. In a Word List Task, the giver instructed the receiver on which of two lists to place a word; this dialogue task is similar to simple spoken interactions users have with voice-AI systems. In a Map Task, participants completed a fill-in-the-blank worksheet with the interlocutors, a more complex interactive task. Participants completed the task twice with both interlocutors, once as giver-of-information and once as receiver-of-information. Phonetic alignment was assessed through similarity rating, analysed using mixed effects logistic regressions. In the Word List Task, participants aligned to a greater extent toward the human interlocutor only. In the Map Task, participants as giver only aligned more toward the human interlocutor. Results indicate that phonetic alignment is mediated by the type of interlocutor and that the influence of conversational role varies across tasks and interlocutors. ARTICLE HISTORY
... While Habler et al. (2019) found no differences in participants' ratings of male and female TTS voices, other studies examining participants' speech behavior suggest there are some differences. For example, participants show different speech patterns toward male and female Apple Siri TTS voices, in similar directions as for real human male and female voices Snyder et al., 2019). This suggests that more subconscious behavior may reveal gender-mediated patterns (if present) in human-device interactions. ...
... This suggests that more subconscious behavior may reveal gender-mediated patterns (if present) in human-device interactions. Yet, in all three of these studies Habler et al., 2019;Snyder et al., 2019), only a young adult population (e.g., college-age students) was examined. How might gender-mediated patterns emerge across different age groups? ...
... Socially-mediated imitation patterns are often interpreted through the lens of Communication Accommodation Theory (CAT) (Giles et al., 1991;Shepard, 2001), which proposes that speakers use linguistic alignment to emphasize or minimize social differences between themselves and their interlocutors. The CAT framework can also be applied to understand humandevice interaction: recent studies that make a direct comparison between human and voice-AI interlocutors found greater vocal imitation for the human, relative to the voice-AI speaker (e.g., Apple's Siri in Cohn et al., 2019;Snyder et al., 2019;Amazon's Alexa in Raveh et al., 2019;Zellou and Cohn, 2020). Less speech alignment toward digital device assistants suggests that people may be less inclined to demonstrate social closeness toward voice-AI, as they do for humans. ...
Article
Full-text available
Speech alignment is where talkers subconsciously adopt the speech and language patterns of their interlocutor. Nowadays, people of all ages are speaking with voice-activated, artificially-intelligent (voice-AI) digital assistants through phones or smart speakers. This study examines participants’ age (older adults, 53–81 years old vs. younger adults, 18–39 years old) and gender (female and male) on degree of speech alignment during shadowing of (female and male) human and voice-AI (Apple’s Siri) productions. Degree of alignment was assessed holistically via a perceptual ratings AXB task by a separate group of listeners. Results reveal that older and younger adults display distinct patterns of alignment based on humanness and gender of the human model talkers: older adults displayed greater alignment toward the female human and device voices, while younger adults aligned to a greater extent toward the male human voice. Additionally, there were other gender-mediated differences observed, all of which interacted with model talker category (voice-AI vs. human) or shadower age category (OA vs. YA). Taken together, these results suggest a complex interplay of social dynamics in alignment, which can inform models of speech production both in human-human and human-device interaction.
... In many studies designed to compare of human-human and human-computer interactions, there are multiple features that co-vary: the computer interlocutor has both a different form (e.g., digital avatar in Burnham et al., 2010) and a synthetic voice. This has been true for recent work exploring human-voice-AI interaction as well, where naturally produced and TTS voices are confounded with "apparent humanness" Snyder et al., 2019). One way to probe whether people have a distinct social category for voice-AI is to observe their speech behavior toward a set of voices of the same type while varying the top-down label, either device or human, provided with each voice. ...
... For example, Bell and colleagues (2003) found that participants vocally aligned to the speech rate produced by a computer avatar they interacted with. At the same time, multiple studies have found that individuals tend to show less vocal alignment toward modern voice-AI systems than toward human voices Raveh et al., 2019;Snyder et al., 2019). For example, Snyder et al. (2019) found that participants showed greater vowel duration alignment toward human voices, relative to Apple's Siri voices. ...
... At the same time, multiple studies have found that individuals tend to show less vocal alignment toward modern voice-AI systems than toward human voices Raveh et al., 2019;Snyder et al., 2019). For example, Snyder et al. (2019) found that participants showed greater vowel duration alignment toward human voices, relative to Apple's Siri voices. Furthermore, these vowel-durational patterns mirror those reported in a perceptual similarity ratings task of alignment: less overall vocal alignment toward voice-AI (relative to human voices) overall . ...
Conference Paper
Full-text available
Humans are now regularly speaking to voice-activated artificially intelligent (voice-AI) assistants. Yet, our understanding of the cognitive mechanisms at play during speech interactions with a voice-AI, relative to a real human, interlocutor is an understudied area of research. The present study tests whether top-down guise of "apparent humanness" affects vocal alignment patterns to human and text-to-speech (TTS) voices. In a between-subjects design, participants heard either 4 naturally-produced or 4 TTS voices. Apparent humanness guise varied within-subject. Speaker guise was manipulated via a top-down label with images, either of two pictures of voice-AI systems (Amazon Echos) or two human talkers. Vocal alignment in vowel duration revealed top-down effects of apparent humanness guise: participants showed greater alignment to TTS voices when presented with a device guise ("authentic guise"), but lower alignment in the two inauthentic guises. Results suggest a dynamic interplay of bottom-up and top-down factors in human and voice-AI interaction.
... For example, Branigan and colleagues (2003) found that participants aligned in syntactic structure (e.g., "give the dog a bone" vs. "give a bone to the dog") to the same extent in typed interactions between an apparent 'computer' and 'human' interlocutor. Yet, in spoken language interaction, differences by interlocutor appear to be more pronounced: three recent studies found that people vocally align to both human and voice-AI assistants (Apple's Siri, Amazon's Alexa), but display less alignment to the assistant voices Raveh et al., 2019;Snyder et al., 2019). These findings suggest that our transfer of social behaviors to AI systems in speech interactions is tempered by their social category as not human. ...
... However, this gender effect is sometimes mixed (Pardo et al., 2017), suggesting that idiosyncratic properties of voices can influence the degree of alignment as well. Nevertheless, there is some evidence that gendermediated alignment patterns may also transfer to humandevice interaction: humans display greater alignment toward male, relative to female, voices for both human and Apple Siri model talkers Snyder et al., 2019). This supports the hypothesis that humans transfer gendermediated patterns of vocal alignment from human-human conversations to their interactions with voice-AI systems, supporting predictions made by CASA (Nass et al., 1994). ...
... Based on our proposal that people's vocal alignment behavior toward AI will vary as a function of the personification of the system, we can explore more specific predictions by varying the apparent gender of the voice. Prior work reports that male voices are imitated to a greater degree than female voices, which is realized to a lesser extent for TTS voices Snyder et al., 2019). Thus, we predict that this gender-mediated pattern will vary gradiently as a function of the personification of the AI. ...
Conference Paper
Full-text available
The current study tests subjects' vocal alignment toward female and male text-to-speech (TTS) voices presented via three systems: Amazon Echo, Nao, and Furhat. These systems vary in their physical form, ranging from a cylindrical speaker (Echo), to a small robot (Nao), to a human-like robot bust (Furhat). We test whether this cline of personification (cylinder < mini robot < human-like robot bust) predicts patterns of gender-mediated vocal alignment. In addition to comparing multiple systems, this study addresses a confound in many prior vocal alignment studies by using identical voices across the systems. Results show evidence for a cline of personification toward female TTS voices by female shadowers (Echo < Nao < Furhat) and a more categorical effect of device personification for male TTS voices by male shadowers (Echo < Nao, Furhat). These findings are discussed in terms of their implications for models of device-human interaction and theories of computer personification.
... Indeed, there is some support for technology equivalence accounts for linguistic behavior toward voice-AI. For instance, several recent studies have shown that people vocally align toward both voice-AI and human interlocutors Snyder et al., 2019;Zellou, Cohn, & Ferenc Segedin, 2021;Zellou, Cohn, & Kline, 2021), and even display similar gender-based speech asymmetries (such as aligning more to male, than female, TTS and human voices in Cohn et al., 2019). Hence, an alternative prediction in the current study, based on technology equivalence accounts, is that speech patterns to voice-AI and human interlocutors will not differ. ...
... Prior work has shown variation in how people perceive and personify technological agents, such as robots (Hinz et al., 2019) and voice-AI (Cohn, Raveh, et al., 2020;Etzrodt & Engesser, 2021). Recently, some work has revealed differences in speech alignment toward voice-AI by speaker age (e.g., older vs. college-age adults in Zellou, Cohn, & Ferenc Segedin, 2021) and cognitive processing style (e.g., autisticlike traits in Snyder et al., 2019), suggesting these differences could shape voice-AI speech adaptation as well. ...
Article
Full-text available
Millions of people engage in spoken interactions with voice activated artificially intelligent (voice-AI) systems in their everyday lives. This study explores whether speakers have a voice-AI-specific register, relative to their speech toward an adult human. Furthermore, this study tests if speakers have targeted error correction strategies for voice-AI and human interlocutors. In a pseudo-interactive task with pre-recorded Siri and human voices, participants produced target words in sentences. In each turn, following an initial production and feedback from the interlocutor, participants repeated the sentence in one of three response types: after correct word identification, a coda error, or a vowel error made by the interlocutor. Across two studies, the rate of comprehension errors made by both interlocutors was varied (lower vs. higher error rate). Register differences are found: participants speak louder, with a lower mean f0, and with a smaller f0 range in Siri-DS. Many differences in Siri-DS emerged as dynamic adjustments over the course of the interaction. Additionally, error rate shapes how register differences are realized. One targeted error correction was observed: speakers produce more vowel hyperarticulation in coda repairs in Siri-DS. Taken together, these findings contribute to our understanding of speech register and the dynamic nature of talker-interlocutor interactions.
... For example, [31] found that the extent to which participants responded positively to a computer's flattering praise varied as a function of their cognitive style: individuals with less analytical and more intuition-driven traits were more greatly affected by the computer's flattery. Also, [32] found that participants' subconscious vocal entrainment behavior toward human and Siri voices varied based on their cognitive processing styles, measured by the Autism Quotient (AQ). ...
... The AQ [33] is a common non-clinical instrument across studies of speech and language behavior used to assess differences in individuals' cognitive processing style [26], [32]. The AQ has been shown to capture variation within neurotypical populations and is consistent with those formally diagnosed with Autism Spectrum Disorder (ASD), a condition that results in significant atypicality in social, emotional, and communicative behavior (DSM-5 [34]). ...
Conference Paper
Full-text available
More and more, humans are engaging with voice-activated artificially intelligent (voice-AI) systems that have names (e.g., Alexa), apparent genders, and even emotional expression; they are in many ways a growing 'social' presence. But to what extent do people display sociolinguistic attitudes, developed from human-human interaction, toward these disembodied text-to-speech (TTS) voices? And how might they vary based on the cognitive traits of the individual user? The current study addresses these questions, testing native English speakers' judgments for 6 traits (intelligent, likeable, attractive, professional, human-like, and age) for a naturally-produced female human voice and the US-English default Amazon Alexa voice. Following exposure to the voices, participants completed these ratings for each speaker, as well as the Autism Quotient (AQ) survey, to assess individual differences in cognitive processing style. Results show differences in individuals' ratings of the likeability and human-likeness of the human and AI talkers based on AQ score. Results suggest that humans transfer social assessment of human voices to voice-AI, but that the way they do so is mediated by their own cognitive characteristics.
... Above and beyond documenting the presence of alignment toward computers and voice-AI, a growing body of work has suggested that the magnitude of speech alignment may differ by interlocutor: individuals tend to show weaker vocal alignment toward voice-AI, relative to human interlocutors [6], [8], [9]. Conversely, others have found that people adopt the lexical choices [13] and syntactic structures [14] for the computers to a greater extent than for human interlocutors. ...
... That we see these simple factors leading to convergence toward the human, but divergence from voice-AI, is contrary to prior reports of greater alignment toward computers than toward human interactors [13], [14]. Yet, it does align with recent work reporting less alignment toward voice-AI systems, relative to naturally produced human voices [6], [8], [9]. The observation of the same factors leading to different patterns of alignment toward humans and voice-AI does not support predictions made by theories of computer personification, e.g. ...
Conference Paper
Full-text available
Increasingly, people are having conversational interactions with voice-AI systems, such as Amazon's Alexa. Do the same social and functional pressures that mediate alignment toward human interlocutors also predict align patterns toward voice-AI? We designed an interactive dialogue task to investigate this question. Each trial consisted of scripted, interactive turns between a participant and a model talker (pre-recorded from either a natural production or voice-AI): First, participants produced target words in a carrier phrase. Then, a model talker responded with an utterance containing the target word. The interlocutor responses varied by 1) communicative affect (social) and 2) correctness (functional). Finally, participants repeated the carrier phrase. Degree of phonetic alignment was assessed acoustically between the target word in the model's response and participants' response. Results indicate that social and functional factors distinctly mediate alignment toward AI and humans. Findings are discussed with reference to theories of alignment and human-computer interaction.
... While the CASA account proposes an automatic mechanism of personification, there is reason to believe that any such response will vary considerably across individuals. For example, participants displayed different patterns of vocal alignment toward voice-AI (Apple Siri) voices based on their cognitive processing style [20]. In another study, individuals interacting with the same robot receptionist communicated differently depending on their attitude towards the virtual interlocutor: as being more 'human-social' or a 'computational-tool' [21]. ...
Conference Paper
Full-text available
The present study compares how individuals perceive gradient acoustic realizations of emotion produced by a human voice versus an Amazon Alexa text-to-speech (TTS) voice. We manipulated semantically neutral sentences spoken by both talkers with identical emotional synthesis methods, using three levels of increasing 'happiness' (0 %, 33 %, 66 % 'happier'). On each trial, listeners (native speakers of American English, n=99) rated a given sentence on two scales to assess dimensions of emotion: valence (negative-positive) and arousal (calm-excited). Participants also rated the Alexa voice on several parameters to assess anthropomorphism (e.g., naturalness, human-likeness, etc.). Results showed that the emotion manipulations led to increases in perceived positive valence and excitement. Yet, the effect differed by interlocutor: increasing 'happiness' manipulations led to larger changes for the human voice than the Alexa voice. Additionally, we observed individual differences in perceived valence/arousal based on participants' an-thropomorphism scores. Overall, this line of research can speak to theories of computer personification and elucidate our changing relationship with voice-AI technology.
Article
Full-text available
Voice assistants are software agents that can interpret human speech and respond via synthesized voices. Apple’s Siri, Amazon’s Alexa, Microsoft’s Cortana, and Google’s Assistant are the most popular voice assistants and are embedded in smartphones or dedicated home speakers. Users can ask their assistants questions, control home automation devices and media playback via voice, and manage other basic tasks such as email, to-do lists, and calendars with verbal commands. This column will explore the basic workings and common features of today’s voice assistants. It will also discuss some of the privacy and security issues inherent to voice assistants and some potential future uses for these devices. As voice assistants become more widely used, librarians will want to be familiar with their operation and perhaps consider them as a means to deliver library services and materials.
Article
Full-text available
Using the structural priming paradigm, the present study explores predictions made by the implicit prosody hypothesis (IPH) by testing whether an implicit prosodic boundary generated from a silently read sentence influences attachment preference for a novel, subsequently read sentence. Results indicate that such priming does occur, as evidenced by an effect on relative clause attachment. In particular, priming an implicit boundary directly before a relative clause – cued by commas in orthography – encouraged high attachment of that relative clause, although the size of the effect depended somewhat on individual differences in pragmatic/communication skills (as measured by the Autism Spectrum Quotient). Thus, in addition to supporting the basic claims of the IPH, the present study demonstrates the relevance of such individual differences to sentence processing, and that implicit prosodic structure, like syntactic structure, can be primed.
Article
Individual variation is ubiquitous and empirically observable in most phonological behaviors, yet relatively few studies aim to capture the heterogeneity of language processing among individuals, as opposed to those focusing primarily on group-level patterns. The study of individual differences can shed light on the nature of the cognitive representations and mechanisms involved in phonological processing. To guide our review of individual variation in the processing of phonological information, we consider studies that can illuminate broader issues in the field, such as the nature of linguistic representations and processes. We also consider how the study of individual differences can provide insight into long-standing issues in linguistic variation and change. Since linguistic communities are made up of individuals, the questions raised by examining individual differences in linguistic processing are relevant to those who study all aspects of language.
Conference Paper
Despite their ability to complete certain tasks, dialog systems still suffer from poor adaptation to users' engagement and attention. We observe human behaviors in different conversational settings to understand human communication dynamics and then transfer the knowledge to multimodal dialog system design. To focus solely on maintaining engaging conversations, we design and implement a non-task oriented multimodal dialog system, which serves as a framework for controlled multimodal conversation analysis. We design computational methods to model user engagement and attention in real time by leveraging automatically harvested multimodal human behaviors, such as smiles and speech volume. We aim to design and implement a multimodal dialog system to coordinate with users' engagement and attention on the fly via techniques such as adaptive conversational strategies and incremental speech production.
Chapter
This chapter focuses on individual differences in human–computer interaction. Differences among users have not been a major concern of commercial computer interface designers. Even behavioral scientists usually select narrowly defined user samples to minimize experimental error when comparing the mean performance of different systems. Those behavioral studies that have analyzed differences among users often have produced descriptive results rather than prescriptions for interface design. In the future, interface designers should focus a great deal of attention on the differences among potential users for three reasons. First, individual differences usually play a major role in determining whether humans can use a computer to perform a job effectively. Second, personnel selection testing, the standard solution to problems of job-related individual differences, cannot be applied to many settings where humans interact with computers. The third reason for designers to be concerned with individual differences is that the technology has reached the point where it is possible to accommodate more user differences.
Article
This study examined individual differences in sensitivity to human-like features of a robot’s behavior. The paradigm comprised a non-verbal Turing test with a humanoid robot. A “programmed” condition differed from a “human-controlled” condition by onset times of the robot’s eye movements, which were either fixed across trials or modeled after prerecorded human reaction times, respectively. Participants judged whether the robot behavior was programmed or human-controlled, with no information regarding the differences between respective conditions. Autistic traits were measured with the autism-spectrum quotient (AQ) questionnaire in healthy adults. We found that the fewer autistic traits participants had, the more sensitive they were to the difference between the conditions, without explicit awareness of the nature of the difference. We conclude that although sensitivity to fine behavioral characteristics of others varies with social aptitude, humans are in general capable of detecting human-like behavior based on very subtle cues.