Conference PaperPDF Available

An experimental setup for capturing multimodal accommodation using dual electromagnetic articulography, audio, and video

Authors:

Abstract and Figures

When engaging in a conversation, interlocutors frequently accommodate to each other in their speech patterns and co-speech movements. However, only a small number of studies have investigated both domains in a multimodal approach. An additional challenge for studies is accounting for information structure, which not only influences the production of speech and co-speech motion in a speaker but also affects the patterns of accommodation between speakers. Due to the high complexity of the required experimental design, it has not yet been comprehensively studied whether speakers accommodate to each other in their strategies of encoding information structure. This paper present a methodological approach for capturing multimodal focus marking patterns in dyads, which allows to address this research question. We introduce DiCE, a cooperative game to elicit lexically and prosodically controlled data in German, and present details of the experimental setup involving dual EMA, audio, and video.
Content may be subject to copyright.
An experimental setup for capturing multimodal accommodation using dual
electromagnetic articulography, audio, and video
Lena Pagel1, Simon Roessig2, Doris Mücke1
1University of Cologne, Germany
2University of York, United Kingdom
lena.pagel@uni-koeln.de, simon.roessig@york.ac.uk, doris.muecke@uni-koeln.de
Abstract
When engaging in a conversation, interlocutors frequently
accommodate to each other in their speech patterns and co-
speech movements. However, only a small number of studies
have investigated both domains in a multimodal approach. An
additional challenge for studies is accounting for information
structure, which not only influences the production of speech
and co-speech motion in a speaker but also affects the patterns
of accommodation between speakers. Due to the high
complexity of the required experimental design, it has not yet
been comprehensively studied whether speakers accommodate
to each other in their strategies of encoding information
structure. This paper present a methodological approach for
capturing multimodal focus marking patterns in dyads, which
allows to address this research question. We introduce DiCE, a
cooperative game to elicit lexically and prosodically controlled
data in German, and present details of the experimental setup
involving dual EMA, audio, and video.
Keywords: electromagnetic articulography, dual EMA,
accommodation, multimodality, focus structure.
1. Introduction
Previous research has demonstrated that speakers frequently
accommodate to their interlocutorsspeech patterns and speech-
accompanying movements. There is evidence for convergence
of, e.g., head motion (Hale et al. 2020), manual gestures (Mol
et al. 2012), and postural sway (Shockley, Santana, & Fowler
2003). Interlocutors may also accommodate in terms of
intonation (Babel & Bulatov 2012), speaking rate and phrasing
(Cummins 2002), as well as acoustic properties of vowels and
consonants (Pardo et al. 2012; Nielsen 2011). Leveraging recent
technological advancements, a limited number of studies have
used dual electromagnetic articulography (dual EMA) to
elucidate accommodation in supra-laryngeal speech kinematics,
reporting results for jaw, lip, and tongue movements (Lee et al.
2018; Mukherjee et al. 2018; Tiede & Mooshammer 2013).
However, only a few studies have integrated the multiple
modalities of accommodation within a single experimental
setting yet (but cf. Duran & Fusaroli 2017; Louwerse et al.
2012; Oben & Brône 2016). To our knowledge, only one study
has analysed multimodal accommodation using dual EMA, with
results presented for only one dyad (Tiede et al. 2010).
One factor that markedly shapes a speaker’s production of
speech and co-speech movements is information structure
(Ladd 2008; Wagner, Malisz & Kopp 2014). Depending on,
e.g., the focus structure of an utterance, a particular word may
be produced with a larger F0 rise, a more distinct tongue
articulation, and/or a more pronounced head nod. Typically, we
deal with a high amount of speaker-specific variability in
patterns of focus marking. This complicates the design of
studies on accommodation because comparing words that occur
under different structural circumstances may confound the
results. Furthermore, it has been shown that words are more
sensitive to interpersonal accommodation when they occur in
prosodically salient positions (Lee et al. 2018). This
underscores the potential of including controlled information
structure in experimental designs, particularly in those targeting
accommodation. What remains an open area for investigation is
whether these patterns of marking information structure (or
more specifically, focus) are themselves subject to interpersonal
accommodation. We aim to address this question in the future
through an analysis of the recorded data set described below.
In this paper, we present a comprehensive methodological
approach to capturing multimodal accommodation across
various acoustic, visual, and kinematic levels of speech
production. With this, we hope to contribute valuable insights
for future studies sharing similar research goals. We introduce
DiCE (Dialogic Collecting Expedition), a cooperative card
game in German that provides a natural context to elicit speech
material controlled for segmental context of target words and
focus structure of utterances. We provide practical information
on the experimental setup with dual EMA, audio, and video.
The method allows to capture multimodal accommodation of
focus encoding patterns within dyads. We have successfully
applied it in recordings of 15 dyads of German native speakers.
2. Methods
The complete game material for DiCE is publicly available for
future use at https://osf.io/9fmqh/.
2.1. Technical set-up and procedure
EMA and audio recordings are conducted using two 3D
electromagnetic articulographs (Carstens AG501 and AG501
Twin) and two head-mounted condenser microphones
(MicroMic C544 L, connected to a MicroMic MPA V L
phantom adapter). The technical setup is schematically
illustrated in Figure 1. Each of the two articulographs is
connected to a unique SyBox (Carstens SyBox2), which are
interlinked and connected to an interface (Tascam US4x4).
Additionally, each articulograph is linked to a unique recording
laptop, and these laptops are interconnected (via a router that
does not access the internet) to allow for data transfer. The two
microphones are plugged into the same interface as the two
articulographs, which enables a temporal synchronisation of the
signal streams. The interface is connected to the recording
laptop associated with EMA1, where each recording sweep is
initiated. The EMA signal is simultaneously recorded on both
articulographs at 1250Hz and then downsampled to 250Hz. The
audio is recorded at 48kHz with a bit depth of 16.
ISSP 2024 - 13th International Seminar on Speech Production
13-17 May 2024, Autrans, France
119 10.21437/issp.2024-31
Figure 1: Recording setup for EMA and audio.
EMA sensors are attached to both speakers on the torso (both
shoulders, chest, and spine), the head (forehead, eyebrows, and
behind both ears), and the articulators (lower jaw, upper and
lower lip, both corners of the mouth, tongue tip, and tongue
dorsum), as illustrated in Figure 2.
Figure 2: Illustration of EMA sensor placements.
In addition to EMA and audio recordings, videos are captured
using three cameras strategically positioned in the room: Two
cameras are placed on the table between the two articulographs,
each directed at one participant (2x GoPro Hero9 Black), and
one camera is positioned to capture both participants together
from the side (Panasonic HC-V520). The videos are recorded at
a resolution of 1080p, a frame rate of 50fps, and a constant
shutter speed of 1/100s. Post-hoc synchronization of the videos
to the EMA and audio recordings is achieved using the auditory
signal from a clapperboard at the beginning of each recording
sweep.
Each recording lasts between three and four hours, including
preparations and breaks. The recordings consist of two main
parts: one solo condition per participant and one dialogue
condition with both participants (cf. Figure 3). In the solo
condition, each speaker is recorded individually in a simplified
digital version of the experimental game. After both speakers
have completed their solo condition, they are introduced to each
other and cooperatively play the experimental game in the
dialogue condition.
Figure 3: Scheme of recording procedure with two
participants (A and B) in two rooms. Prep. I includes
informed consent, attachment of EMA sensors and
instructions for the solo condition, prep. II involves a
break and instructions for the dialogue condition.
2.2. Speech material
The speech material produced in the experiment is highly
controlled both lexically and prosodically. To allow for an
analysis of supra-laryngeal articulation using EMA, eight
carefully chosen target words are included, comprising four
objects and four cities (cf. Table 1). Their lexically stressed
penultimate syllable, which occurs in a controlled segmental
context, can be used for analyses. As objects, we selected
existing words in German. As cities, we selected two words for
existing cities (Medina, Manila), one word for an existing city
borrowed from Italian (Milano), and one pseudoword (Benali).
Participants in our recordings encountered no difficulties in
producing these words nor did they report any unease.
Table 1: Target words.
objects
cities
Bohne bean
[ˈboːnə]
Medina
[meˈdiːna]
Mode fashion
[ˈmoːdə]
Manila
[maˈniːla]
Vase vase
[ˈvaːzə]
Milano
[miˈlaːno]
Made maggot
[ˈmaːdə]
Benali
[beˈnaːli]
These target words are embedded in consistent question-answer
sets in German, which the two participants produce. Each set
contains two questions and two answers (cf. Table 2). Speakers
are instructed to consistently adhere to the lexical structure of
the carrier sentence and only replace the object and/or city in the
utterance. This approach ensures that the corpus includes a
substantial amount of lexically consistent speech material.
Table 2: Exemplary question-answer set.
Q1
Habe ich die Bohne aus Medina auf der Hand?
Am I holding the bean from Medina?
A1
Du hast die Mode aus Medina auf der Hand.
You are holding the fashion from Medina.
Q2
Wo?
Where?
A2
Da.
There.
Particularly utterance A1 is of interest, since its information
structure is controlled. Based on the preceding question Q1, it
is produced with one of three possible focus structures: either
(i) the object is in corrective focus and the city is in the
background, as in the example in Table 2, (ii) the object is in
the background and the city is in corrective focus, or (iii) both
the object and city are in corrective focus.
2.3. Task
The speech material is obtained through a custom-designed card
game called DiCE (Dialogic Collecting Expedition). In the
dialogue condition, the game is played cooperatively by the two
participants, while in the solo condition, a simplified digital
version is played by each participant separately in front of a
screen. In the game narrative, the subjects assume the roles of
collectors who have discovered a basement filled with valuable
items. These items are the target words, i.e. the four objects
from the four cities described above (cf. Table 1). They are
represented by 16 playing cards (4 objects × 4 cities). The
participants’ objective in the game is to organise the cards based
on the objectsvalues and origins. They succeed when they have
120
collectively arranged the cards into four piles in the middle of
the tableone pile for each city of origin, with the four objects
in ascending order from value one to four.
In the dialogue condition, each participant has three cards on a
stand in front of them, which are positioned in a way that allows
them to see only the other participants cards, not their own.
Through the question-answer sets (cf. Table 2), each participant
aims to identify their own cards. One participant asks if they are
holding a specific card, and their interlocutor replies. Then, they
ask where, and their interlocutor replies and points to the
intended card. When the participant finds a suitable card in their
hand, they place it on the table, contributing to the incremental
and cooperative ordering of the collection. A photo of the game
in the dialogue condition is shown in Figure 4.
Figure 4: Two participants sitting underneath
articulographs, playing the dialogic game.
Crucially, the task is complicated through constraints on the
cards participants are allowed to mention in their responses.
These constraints are designed to elicit the three focus structures
for utterance A1 (cf. Table 2), as described before. They are
introduced as strict communication rules within the community
of collectors. Participants are prohibited from answering with a
simple “yes” or “no” to their interlocutor’s question Q1. Instead,
they are required to refer to a different card than the one asked
about, but always one genuinely present on their interlocutor’s
stand. This can be accomplished by substituting either the word
for the object or the city mentioned in the question. For instance,
if one participant asks if they have the vase from Milano, the
other speaker can, e.g., refer to the bean from Milano, or to the
vase from Benali, or, if no suitable card is present, to the bean
from Benali. In this manner, speakers produce one of the three
intended focus structures in the target utterance A1. The two
participants take turns and have the freedom to choose which
card to ask about and which one to refer to in their response,
within the given rules. Penalty points are assigned in cases
where they refer to a false card or fail to adhere to the specified
lexical structures. In total, six rounds of the game are played in
the dialogue condition. The number of question-answer sets
produced in one round varies between dyads and rounds.
In the solo condition, where only one speaker is present in the
room, they are seated in front of a screen and engage with a
simplified digital version of the game (cf. Figure 5). In this
setup, the participant exclusively responds to questions and does
not initiate questions themselves. To present the stimuli,
OpenSesame (Mathôt, Schreij & Theeuwes 2012) is used. For
each question-answer set, three cards along with the question
Q1 are displayed on the screen. The participant produces their
response A1. Subsequently, the screen displays the question Q2,
and the speaker points towards the intended card while
producing their response A2. In total, each participant is
prompted to produce 76 answers for both A1 and A2 (4 objects
× 4 cities × 2 focus conditions × 2 renditions + 12 additional
trials), with randomisation of trial order applied per participant.
Figure 5: One participant sitting underneath the
articulograph, playing the game in the solo condition.
2.4. Corpus example
For our data set, 30 participants were recorded with the
described methods at IfL Phonetics, University of Cologne,
Germany. The participants ranged in age from 18 to 36 years
(mean: 24.67, SD: 4.51) and had grown up in Germany with
German as their native language. Six of the participants were
bilingual, having at least one additional native language, but
German was reported as the dominant language for all speakers.
17 of the subjects were female, 12 male and one non-binary. The
participants were naive to the purpose of the experiment and did
not possess advanced knowledge of phonetic sciences. While
some participants mentioned the ability to speak a German
dialect, they all spoke accent-free standard German during the
experiment. For each recording, two participants were paired
into a dyad with no constraints other than not being previously
familiar with each other, resulting in a total of 15 dyads.
Following the data recording, several processing steps were
required before the analysis of the data set. The video files were
synchronised with the microphone-recorded audio files using
the auditory signal from the clap, and they were trimmed to the
same length using DaVinci Resolve. The audio files were
transcribed with automatic speech recognition using OpenAI
Whisper and manually checked for errors. Then, they were
annotated based on the transcript using the Montreal Forced
Aligner (McAuliffe et al. 2017), with manual corrections and
the addition of further annotation tiers. EMA files were
processed using the ema2wav converter (Buech et al. 2022).
An example of the multimodal data set that we recorded with
the presented methods is illustrated in Figure 6, showcasing an
exemplary question-answer set from the dialogue condition of
one dyad. Four parameters (namely lip aperture, vertical tongue,
head, and eyebrow motion) are selected from the extensive array
of possible supra-laryngeal and co-speech kinematics, aiming to
exemplify the nature of the recorded multimodal data.
3. Discussion and conclusion
We introduce a methodological approach for capturing
multimodal accommodation, providing practical details on the
technical setup, procedure, speech material, and task. Through
the cooperative card game DiCE (Dialogic Collecting
Expedition), lexically and prosodically controlled German
speech material is elicited within an engaging scenario, and is
recorded with dual 3D electromagnetic articulography, audio,
and video. The synchronisation of multiple data streams makes
it possible to analyse a wide range of parameters in the auditory
and visual modalities, potentially shedding new light on their
spatiotemporal interrelations. Through this novel approach, it
can be investigated in a fine-grained manner whether
multimodal patterns of focus encoding are subject to
interpersonal accommodation.
121
4. Acknowledgements
The authors would like to thank Theo Klinker, Tabea Thies,
Katinka Wüllner, and Elisa Herbig for their help with the
recordings. This work was supported by the German Research
Foundation (DFG) as part of the SFB1252 “Prominence in
Language” (Project-ID 281511265, project A04) and the
a.r.t.e.s. Graduate School for the Humanities Cologne.
5. References
Babel, M., & Bulatov, D. (2012). The Role of Fundamental Frequency
in Phonetic Accommodation. Language and Speech, 55(2), 231248.
Buech, P., Roessig, S., Pagel, L., Hermes, A., & Mücke, D. (2022).
ema2wav: doing articulation by Praat. Proceedings of Interspeech,
18-22 September 2022, Incheon, Korea, 13521356.
Cummins, F. (2002). On synchronous speech. Acoustic Research
Letters Online, 3(1), 711.
Duran, N. D., & Fusaroli, R. (2017). Conversing with a devil’s
advocate: Interpersonal coordination in deception and disagreement.
PLoS ONE, 12(6), e0178140, 125.
Hale, J., Ward, J. A., Buccheri, F., Oliver, D., & Hamilton, A. F. d. C.
(2020). Are You on My Wavelength? Interpersonal Coordination in
Dyadic Conversations. Journal of Nonverbal Behavior, 44, 6383.
Ladd, R. D. (2008). Intonational Phonology. Cambridge Univ. Press.
Lee, Y., Danner, S. G., Parrell, B., Lee, S., Goldstein, L., & Byrd, D.
(2018). Articulatory, acoustic, and prosodic accommodation in a
cooperative maze navigation task. PLoS ONE, 13(8), e0201444.
Louwerse, M. M., Dale, R., Bard, E. G., & Jeuniaux, P. (2012).
Behavior matching in multimodal communication is synchronized.
Cognitive Science, 36, 14041426.
Mathôt, S., Schreij, D., & Theeuwes, J. (2012). OpenSesame: An open-
source, graphical experiment builder for the social sciences.
Behavior Research Methods, 44(2), 314-324.
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M.
(2017). Montreal Forced Aligner: trainable text-speech alignment
using Kaldi. Proceedings of Interspeech, 20-24 August, Stockholm,
Sweden, 498502.
Mol, L., Krahmer, E., Maes, A., & Swerts, M. (2012). Adaptation in
gesture: Converging hands or converging minds? Journal of Memory
and Language, 66, 249264.
Mukherjee, S., Legou, T., Lancia, L., Hilt, P., Tomassini, A., Fadiga, L.,
D’Ausilio, A., Badino, L., & Nguyen, N. (2018). Analyzing vocal
tract movements during speech accommodation. Proceedings of
Interspeech, 2-6 September, Hyderabad, India, 561565.
Nielsen, K. (2011). Specificity and abstractness of VOT imitation.
Journal of Phonetics, 39, 132142.
Oben, B., & Brône, G. (2016). Explaining interactive alignment: A
multimodal and multifactorial account. Journal of Pragmatics, 104,
3251.
Pardo, J. S., Gibbons, R., Suppes, A., & Krauss, R. M. (2012). Phonetic
convergence in college roommates. Journal of Phonetics, 40, 190
197.
Shockley, K., Santana, M.-V., & Fowler, C. A. (2003). Mutual
interpersonal postural constraints are involved in cooperative
conversation. Journal of Experimental Psychology: Human
Perception and Performance, 29(2), 326332.
Tiede, M., Bundgaard-Nielsen, R., Kroos, C., Gibert, G., Attina, V.,
Kasisopa, B., Vatikiotis-Bateson, E., & Best, C. (2010). Speech
articulator movements recorded from facing talkers using two
electromagnetic articulometer systems simultaneously. Proceedings
of Meetings on Acoustics, 15-19 November, Cancún, Mexico,
4pSC10.
Tiede, M., & Mooshammer, C. (2013). Evidence for an articulatory
component of phonetic convergence from dual electromagnetic
articulometer observation of interacting talkers. Proceedings of
Meetings on Acoustics, 2-7 June, Montreal, Canada, 3aSCa3.
Wagner, P., Malisz, Z., & Kopp, S. (2014). Gesture and speech in
interaction: An overview. Speech Communication, 57, 209-232.
122
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Conversation between two people involves subtle nonverbal coordination in addition to speech. However, the precise parameters and timing of this coordination remain unclear, which limits our ability to theorize about the neural and cognitive mechanisms of social coordination. In particular, it is unclear if conversation is dominated by synchronization (with no time lag), rapid and reactive mimicry (with lags under 1 s) or traditionally observed mimicry (with several seconds lag), each of which demands a different neural mechanism. Here we describe data from high-resolution motion capture of the head movements of pairs of participants (n = 31 dyads) engaged in structured conversations. In a pre-registered analysis pathway, we calculated the wavelet coherence of head motion within dyads as a measure of their nonverbal coordination and report two novel results. First, low-frequency coherence (0.2–1.1 Hz) is consistent with traditional observations of mimicry, and modeling shows this behavior is generated by a mechanism with a constant 600 ms lag between leader and follower. This is in line with rapid reactive (rather than predictive or memory-driven) models of mimicry behavior, and could be implemented in mirror neuron systems. Second, we find an unexpected pattern of lower-than-chance coherence between participants, or hypo-coherence, at high frequencies (2.6–6.5 Hz). Exploratory analyses show that this systematic decoupling is driven by fast nodding from the listening member of the dyad, and may be a newly identified social signal. These results provide a step towards the quantification of real-world human behavior in high resolution and provide new insights into the mechanisms of social coordination.
Article
Full-text available
This study uses a maze navigation task in conjunction with a quasi-scripted, prosodically controlled speech task to examine acoustic and articulatory accommodation in pairs of interacting speakers. The experiment uses a dual electromagnetic articulography set-up to collect synchronized acoustic and articulatory kinematic data from two facing speakers simultaneously. We measure the members of a dyad individually before they interact, while they are interacting in a cooperative task, and again individually after they interact. The design is ideally suited to measure speech convergence, divergence, and persistence effects during and after speaker interaction. This study specifically examines how convergence and divergence effects during a dyadic interaction may be related to prosodically salient positions, such as preceding a phrase boundary. The findings of accommodation in fine-grained prosodic measures illuminate our understanding of how the realization of linguistic phrasal structure is coordinated across interacting speakers. Our findings on individual speaker variability and the time course of accommodation provide novel evidence for accommodation at the level of cognitively specified motor control of individual articulatory gestures. Taken together, these results have implications for understanding the cognitive control of interactional behavior in spoken language communication.
Article
Full-text available
This study investigates the presence of dynamical patterns of interpersonal coordination in extended deceptive conversations across multimodal channels of behavior. Using a novel "devil’s advocate" paradigm, we experimentally elicited deception and truth across topics in which conversational partners either agreed or disagreed, and where one partner was surreptitiously asked to argue an opinion opposite of what he or she really believed. We focus on interpersonal coordination as an emergent behavioral signal that captures interdependencies between conversational partners, both as the coupling of head movements over the span of milliseconds, measured via a windowed lagged cross correlation (WLCC) technique, and more global temporal dependencies across speech rate, using cross recurrence quantification analysis (CRQA). Moreover, we considered how interpersonal coordination might be shaped by strategic, adaptive conversational goals associated with deception. We found that deceptive conversations displayed more structured speech rate and higher head movement coordination, the latter with a peak in deceptive disagreement conversations. Together the results allow us to posit an adaptive account, whereby interpersonal coordination is not beholden to any single functional explanation, but can strategically adapt to diverse conversational demands.
Article
Full-text available
Gestures and speech interact. They are linked in language production and perception, with their interaction contributing to felicitous communication. The multifaceted nature of these interactions has attracted considerable attention from the speech and gesture community. This article provides an overview of our current understanding of manual and head gesture form and function, of the principle functional interactions between gesture and speech aiding communication, transporting meaning and producing speech. Furthermore, we present an overview of research on temporal speech-gesture synchrony, including the special role of prosody in speech-gesture alignment. In addition, we provide a summary of tools and data available for gesture analysis, and describe speech-gesture interaction models and simulations in technical systems. This overview also serves as an introduction to a Special Issue covering a wide range of articles on these topics. We provide links to the Special Issue throughout this paper.
Article
Full-text available
a b s t r a c t Previous studies have found that talkers converge or diverge in phonetic form during a single conversational session or as a result of long-term exposure to a particular linguistic environment. In the current study, five pairs of previously unacquainted male roommates were recorded at four time intervals during the academic year. Phonetic convergence over time was assessed using a perceptual similarity test and measures of vowel spectra. There were distinct patterns of phonetic convergence during the academic year across roommate pairs, and perceptual detection of convergence varied for different linguistic items. In addition, phonetic convergence correlated moderately with roommates' self-reported closeness. These findings suggest that phonetic convergence in college roommates is variable and moderately related to the strength of a relationship.
Article
A growing body of evidence shows that dialogue involves a process of synchronization across speakers at different semiotic levels. In this paper, we study which factors predict this synchronization process at the lexical and gestural level. A multifactorial analysis based on a video corpus of dyadic interactions reveals that cumulative priming can account for alignment at both levels. However, there is a crucial difference between the two modalities: at the lexical level cumulative priming is the only factor with explanatory power, whereas at the gestural level, alignment is best explained by how talkative speakers are, by whether or not two gestures overlap, and whether the gestures occur towards the end of the conversations. A comparison with related studies shows that high-level, referential synchronization and low-level, behavioural synchronization seem to be governed by different rules. Models of human interaction that focus on synchronization, should take both strands of research into account.
Article
The imitation paradigm (Goldinger, 1998) has shown that speakers shift their production phonetically in the direction of the imitated speech, indicating the use of episodic traces in speech perception. Although word-level specificity of imitation has been shown, it is unknown whether imitation also can take place with sub-lexical units. By using a modified imitation paradigm, the current study investigated: (1) the generalizability of phonetic imitation at phoneme and sub-phonemic levels, (2) word-level specificity through acoustic measurements of speech production; and (3) automaticity of phonetic imitation and its sensitivity to linguistic structure. The sub-phonemic feature manipulated in the experiments was VOT on the phoneme /p/. The results revealed that participants produced significantly longer VOTs after being exposed to target speech with extended VOTs. Furthermore, this modeled feature was generalized to new instances of the target phoneme /p/ and the new phoneme /k/, indicating that sub-lexical units are involved in phonetic imitation. The data also revealed that lexical frequency had an effect on the degree of imitation. On the other hand, target speech with reduced VOT was not imitated, indicating that phonetic imitation is phonetically selective.
Article
Interlocutors sometimes repeat each other’s co-speech hand gestures. In three experiments, we investigate to what extent the copying of such gestures’ form is tied to their meaning in the linguistic context, as well as to interlocutors’ representations of this meaning at the conceptual level. We found that gestures were repeated only if they could be interpreted within the meaningful context provided by speech. We also found evidence that the copying of gesture forms is mediated by representations of meaning. That is, representations of meaning are also converging across interlocutors rather than just representations of gesture form. We conclude that the repetition across interlocutors of representational hand gestures may be driven by representations at the conceptual level, as has also been proposed for the repetition of referring expressions across interlocutors (lexical entrainment). That is, adaptation in gesture resembles adaptation in speech, rather than it being an instance of automated motor-mimicry.