Content uploaded by Lena Pagel
Author content
All content in this area was uploaded by Lena Pagel on Jul 05, 2024
Content may be subject to copyright.
An experimental setup for capturing multimodal accommodation using dual
electromagnetic articulography, audio, and video
Lena Pagel1, Simon Roessig2, Doris Mücke1
1University of Cologne, Germany
2University of York, United Kingdom
lena.pagel@uni-koeln.de, simon.roessig@york.ac.uk, doris.muecke@uni-koeln.de
Abstract
When engaging in a conversation, interlocutors frequently
accommodate to each other in their speech patterns and co-
speech movements. However, only a small number of studies
have investigated both domains in a multimodal approach. An
additional challenge for studies is accounting for information
structure, which not only influences the production of speech
and co-speech motion in a speaker but also affects the patterns
of accommodation between speakers. Due to the high
complexity of the required experimental design, it has not yet
been comprehensively studied whether speakers accommodate
to each other in their strategies of encoding information
structure. This paper present a methodological approach for
capturing multimodal focus marking patterns in dyads, which
allows to address this research question. We introduce DiCE, a
cooperative game to elicit lexically and prosodically controlled
data in German, and present details of the experimental setup
involving dual EMA, audio, and video.
Keywords: electromagnetic articulography, dual EMA,
accommodation, multimodality, focus structure.
1. Introduction
Previous research has demonstrated that speakers frequently
accommodate to their interlocutors’ speech patterns and speech-
accompanying movements. There is evidence for convergence
of, e.g., head motion (Hale et al. 2020), manual gestures (Mol
et al. 2012), and postural sway (Shockley, Santana, & Fowler
2003). Interlocutors may also accommodate in terms of
intonation (Babel & Bulatov 2012), speaking rate and phrasing
(Cummins 2002), as well as acoustic properties of vowels and
consonants (Pardo et al. 2012; Nielsen 2011). Leveraging recent
technological advancements, a limited number of studies have
used dual electromagnetic articulography (dual EMA) to
elucidate accommodation in supra-laryngeal speech kinematics,
reporting results for jaw, lip, and tongue movements (Lee et al.
2018; Mukherjee et al. 2018; Tiede & Mooshammer 2013).
However, only a few studies have integrated the multiple
modalities of accommodation within a single experimental
setting yet (but cf. Duran & Fusaroli 2017; Louwerse et al.
2012; Oben & Brône 2016). To our knowledge, only one study
has analysed multimodal accommodation using dual EMA, with
results presented for only one dyad (Tiede et al. 2010).
One factor that markedly shapes a speaker’s production of
speech and co-speech movements is information structure
(Ladd 2008; Wagner, Malisz & Kopp 2014). Depending on,
e.g., the focus structure of an utterance, a particular word may
be produced with a larger F0 rise, a more distinct tongue
articulation, and/or a more pronounced head nod. Typically, we
deal with a high amount of speaker-specific variability in
patterns of focus marking. This complicates the design of
studies on accommodation because comparing words that occur
under different structural circumstances may confound the
results. Furthermore, it has been shown that words are more
sensitive to interpersonal accommodation when they occur in
prosodically salient positions (Lee et al. 2018). This
underscores the potential of including controlled information
structure in experimental designs, particularly in those targeting
accommodation. What remains an open area for investigation is
whether these patterns of marking information structure (or
more specifically, focus) are themselves subject to interpersonal
accommodation. We aim to address this question in the future
through an analysis of the recorded data set described below.
In this paper, we present a comprehensive methodological
approach to capturing multimodal accommodation across
various acoustic, visual, and kinematic levels of speech
production. With this, we hope to contribute valuable insights
for future studies sharing similar research goals. We introduce
DiCE (Dialogic Collecting Expedition), a cooperative card
game in German that provides a natural context to elicit speech
material controlled for segmental context of target words and
focus structure of utterances. We provide practical information
on the experimental setup with dual EMA, audio, and video.
The method allows to capture multimodal accommodation of
focus encoding patterns within dyads. We have successfully
applied it in recordings of 15 dyads of German native speakers.
2. Methods
The complete game material for DiCE is publicly available for
future use at https://osf.io/9fmqh/.
2.1. Technical set-up and procedure
EMA and audio recordings are conducted using two 3D
electromagnetic articulographs (Carstens AG501 and AG501
Twin) and two head-mounted condenser microphones
(MicroMic C544 L, connected to a MicroMic MPA V L
phantom adapter). The technical setup is schematically
illustrated in Figure 1. Each of the two articulographs is
connected to a unique SyBox (Carstens SyBox2), which are
interlinked and connected to an interface (Tascam US4x4).
Additionally, each articulograph is linked to a unique recording
laptop, and these laptops are interconnected (via a router that
does not access the internet) to allow for data transfer. The two
microphones are plugged into the same interface as the two
articulographs, which enables a temporal synchronisation of the
signal streams. The interface is connected to the recording
laptop associated with EMA1, where each recording sweep is
initiated. The EMA signal is simultaneously recorded on both
articulographs at 1250Hz and then downsampled to 250Hz. The
audio is recorded at 48kHz with a bit depth of 16.
ISSP 2024 - 13th International Seminar on Speech Production
13-17 May 2024, Autrans, France
119 10.21437/issp.2024-31
Figure 1: Recording setup for EMA and audio.
EMA sensors are attached to both speakers on the torso (both
shoulders, chest, and spine), the head (forehead, eyebrows, and
behind both ears), and the articulators (lower jaw, upper and
lower lip, both corners of the mouth, tongue tip, and tongue
dorsum), as illustrated in Figure 2.
Figure 2: Illustration of EMA sensor placements.
In addition to EMA and audio recordings, videos are captured
using three cameras strategically positioned in the room: Two
cameras are placed on the table between the two articulographs,
each directed at one participant (2x GoPro Hero9 Black), and
one camera is positioned to capture both participants together
from the side (Panasonic HC-V520). The videos are recorded at
a resolution of 1080p, a frame rate of 50fps, and a constant
shutter speed of 1/100s. Post-hoc synchronization of the videos
to the EMA and audio recordings is achieved using the auditory
signal from a clapperboard at the beginning of each recording
sweep.
Each recording lasts between three and four hours, including
preparations and breaks. The recordings consist of two main
parts: one solo condition per participant and one dialogue
condition with both participants (cf. Figure 3). In the solo
condition, each speaker is recorded individually in a simplified
digital version of the experimental game. After both speakers
have completed their solo condition, they are introduced to each
other and cooperatively play the experimental game in the
dialogue condition.
Figure 3: Scheme of recording procedure with two
participants (A and B) in two rooms. Prep. I includes
informed consent, attachment of EMA sensors and
instructions for the solo condition, prep. II involves a
break and instructions for the dialogue condition.
2.2. Speech material
The speech material produced in the experiment is highly
controlled both lexically and prosodically. To allow for an
analysis of supra-laryngeal articulation using EMA, eight
carefully chosen target words are included, comprising four
objects and four cities (cf. Table 1). Their lexically stressed
penultimate syllable, which occurs in a controlled segmental
context, can be used for analyses. As objects, we selected
existing words in German. As cities, we selected two words for
existing cities (Medina, Manila), one word for an existing city
borrowed from Italian (Milano), and one pseudoword (Benali).
Participants in our recordings encountered no difficulties in
producing these words nor did they report any unease.
Table 1: Target words.
objects
cities
Bohne bean
[ˈboːnə]
Medina
[meˈdiːna]
Mode fashion
[ˈmoːdə]
Manila
[maˈniːla]
Vase vase
[ˈvaːzə]
Milano
[miˈlaːno]
Made maggot
[ˈmaːdə]
Benali
[beˈnaːli]
These target words are embedded in consistent question-answer
sets in German, which the two participants produce. Each set
contains two questions and two answers (cf. Table 2). Speakers
are instructed to consistently adhere to the lexical structure of
the carrier sentence and only replace the object and/or city in the
utterance. This approach ensures that the corpus includes a
substantial amount of lexically consistent speech material.
Table 2: Exemplary question-answer set.
Q1
Habe ich die Bohne aus Medina auf der Hand?
Am I holding the bean from Medina?
A1
Du hast die Mode aus Medina auf der Hand.
You are holding the fashion from Medina.
Q2
Wo?
Where?
A2
Da.
There.
Particularly utterance A1 is of interest, since its information
structure is controlled. Based on the preceding question Q1, it
is produced with one of three possible focus structures: either
(i) the object is in corrective focus and the city is in the
background, as in the example in Table 2, (ii) the object is in
the background and the city is in corrective focus, or (iii) both
the object and city are in corrective focus.
2.3. Task
The speech material is obtained through a custom-designed card
game called DiCE (Dialogic Collecting Expedition). In the
dialogue condition, the game is played cooperatively by the two
participants, while in the solo condition, a simplified digital
version is played by each participant separately in front of a
screen. In the game narrative, the subjects assume the roles of
collectors who have discovered a basement filled with valuable
items. These items are the target words, i.e. the four objects
from the four cities described above (cf. Table 1). They are
represented by 16 playing cards (4 objects × 4 cities). The
participants’ objective in the game is to organise the cards based
on the objects’ values and origins. They succeed when they have
120
collectively arranged the cards into four piles in the middle of
the table – one pile for each city of origin, with the four objects
in ascending order from value one to four.
In the dialogue condition, each participant has three cards on a
stand in front of them, which are positioned in a way that allows
them to see only the other participant’s cards, not their own.
Through the question-answer sets (cf. Table 2), each participant
aims to identify their own cards. One participant asks if they are
holding a specific card, and their interlocutor replies. Then, they
ask where, and their interlocutor replies and points to the
intended card. When the participant finds a suitable card in their
hand, they place it on the table, contributing to the incremental
and cooperative ordering of the collection. A photo of the game
in the dialogue condition is shown in Figure 4.
Figure 4: Two participants sitting underneath
articulographs, playing the dialogic game.
Crucially, the task is complicated through constraints on the
cards participants are allowed to mention in their responses.
These constraints are designed to elicit the three focus structures
for utterance A1 (cf. Table 2), as described before. They are
introduced as strict communication rules within the community
of collectors. Participants are prohibited from answering with a
simple “yes” or “no” to their interlocutor’s question Q1. Instead,
they are required to refer to a different card than the one asked
about, but always one genuinely present on their interlocutor’s
stand. This can be accomplished by substituting either the word
for the object or the city mentioned in the question. For instance,
if one participant asks if they have the vase from Milano, the
other speaker can, e.g., refer to the bean from Milano, or to the
vase from Benali, or, if no suitable card is present, to the bean
from Benali. In this manner, speakers produce one of the three
intended focus structures in the target utterance A1. The two
participants take turns and have the freedom to choose which
card to ask about and which one to refer to in their response,
within the given rules. Penalty points are assigned in cases
where they refer to a false card or fail to adhere to the specified
lexical structures. In total, six rounds of the game are played in
the dialogue condition. The number of question-answer sets
produced in one round varies between dyads and rounds.
In the solo condition, where only one speaker is present in the
room, they are seated in front of a screen and engage with a
simplified digital version of the game (cf. Figure 5). In this
setup, the participant exclusively responds to questions and does
not initiate questions themselves. To present the stimuli,
OpenSesame (Mathôt, Schreij & Theeuwes 2012) is used. For
each question-answer set, three cards along with the question
Q1 are displayed on the screen. The participant produces their
response A1. Subsequently, the screen displays the question Q2,
and the speaker points towards the intended card while
producing their response A2. In total, each participant is
prompted to produce 76 answers for both A1 and A2 (4 objects
× 4 cities × 2 focus conditions × 2 renditions + 12 additional
trials), with randomisation of trial order applied per participant.
Figure 5: One participant sitting underneath the
articulograph, playing the game in the solo condition.
2.4. Corpus example
For our data set, 30 participants were recorded with the
described methods at IfL Phonetics, University of Cologne,
Germany. The participants ranged in age from 18 to 36 years
(mean: 24.67, SD: 4.51) and had grown up in Germany with
German as their native language. Six of the participants were
bilingual, having at least one additional native language, but
German was reported as the dominant language for all speakers.
17 of the subjects were female, 12 male and one non-binary. The
participants were naive to the purpose of the experiment and did
not possess advanced knowledge of phonetic sciences. While
some participants mentioned the ability to speak a German
dialect, they all spoke accent-free standard German during the
experiment. For each recording, two participants were paired
into a dyad with no constraints other than not being previously
familiar with each other, resulting in a total of 15 dyads.
Following the data recording, several processing steps were
required before the analysis of the data set. The video files were
synchronised with the microphone-recorded audio files using
the auditory signal from the clap, and they were trimmed to the
same length using DaVinci Resolve. The audio files were
transcribed with automatic speech recognition using OpenAI
Whisper and manually checked for errors. Then, they were
annotated based on the transcript using the Montreal Forced
Aligner (McAuliffe et al. 2017), with manual corrections and
the addition of further annotation tiers. EMA files were
processed using the ema2wav converter (Buech et al. 2022).
An example of the multimodal data set that we recorded with
the presented methods is illustrated in Figure 6, showcasing an
exemplary question-answer set from the dialogue condition of
one dyad. Four parameters (namely lip aperture, vertical tongue,
head, and eyebrow motion) are selected from the extensive array
of possible supra-laryngeal and co-speech kinematics, aiming to
exemplify the nature of the recorded multimodal data.
3. Discussion and conclusion
We introduce a methodological approach for capturing
multimodal accommodation, providing practical details on the
technical setup, procedure, speech material, and task. Through
the cooperative card game DiCE (Dialogic Collecting
Expedition), lexically and prosodically controlled German
speech material is elicited within an engaging scenario, and is
recorded with dual 3D electromagnetic articulography, audio,
and video. The synchronisation of multiple data streams makes
it possible to analyse a wide range of parameters in the auditory
and visual modalities, potentially shedding new light on their
spatiotemporal interrelations. Through this novel approach, it
can be investigated in a fine-grained manner whether
multimodal patterns of focus encoding are subject to
interpersonal accommodation.
121
4. Acknowledgements
The authors would like to thank Theo Klinker, Tabea Thies,
Katinka Wüllner, and Elisa Herbig for their help with the
recordings. This work was supported by the German Research
Foundation (DFG) as part of the SFB1252 “Prominence in
Language” (Project-ID 281511265, project A04) and the
a.r.t.e.s. Graduate School for the Humanities Cologne.
5. References
Babel, M., & Bulatov, D. (2012). The Role of Fundamental Frequency
in Phonetic Accommodation. Language and Speech, 55(2), 231–248.
Buech, P., Roessig, S., Pagel, L., Hermes, A., & Mücke, D. (2022).
ema2wav: doing articulation by Praat. Proceedings of Interspeech,
18-22 September 2022, Incheon, Korea, 1352–1356.
Cummins, F. (2002). On synchronous speech. Acoustic Research
Letters Online, 3(1), 7–11.
Duran, N. D., & Fusaroli, R. (2017). Conversing with a devil’s
advocate: Interpersonal coordination in deception and disagreement.
PLoS ONE, 12(6), e0178140, 1–25.
Hale, J., Ward, J. A., Buccheri, F., Oliver, D., & Hamilton, A. F. d. C.
(2020). Are You on My Wavelength? Interpersonal Coordination in
Dyadic Conversations. Journal of Nonverbal Behavior, 44, 63–83.
Ladd, R. D. (2008). Intonational Phonology. Cambridge Univ. Press.
Lee, Y., Danner, S. G., Parrell, B., Lee, S., Goldstein, L., & Byrd, D.
(2018). Articulatory, acoustic, and prosodic accommodation in a
cooperative maze navigation task. PLoS ONE, 13(8), e0201444.
Louwerse, M. M., Dale, R., Bard, E. G., & Jeuniaux, P. (2012).
Behavior matching in multimodal communication is synchronized.
Cognitive Science, 36, 1404–1426.
Mathôt, S., Schreij, D., & Theeuwes, J. (2012). OpenSesame: An open-
source, graphical experiment builder for the social sciences.
Behavior Research Methods, 44(2), 314-324.
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M.
(2017). Montreal Forced Aligner: trainable text-speech alignment
using Kaldi. Proceedings of Interspeech, 20-24 August, Stockholm,
Sweden, 498–502.
Mol, L., Krahmer, E., Maes, A., & Swerts, M. (2012). Adaptation in
gesture: Converging hands or converging minds? Journal of Memory
and Language, 66, 249–264.
Mukherjee, S., Legou, T., Lancia, L., Hilt, P., Tomassini, A., Fadiga, L.,
D’Ausilio, A., Badino, L., & Nguyen, N. (2018). Analyzing vocal
tract movements during speech accommodation. Proceedings of
Interspeech, 2-6 September, Hyderabad, India, 561–565.
Nielsen, K. (2011). Specificity and abstractness of VOT imitation.
Journal of Phonetics, 39, 132–142.
Oben, B., & Brône, G. (2016). Explaining interactive alignment: A
multimodal and multifactorial account. Journal of Pragmatics, 104,
32–51.
Pardo, J. S., Gibbons, R., Suppes, A., & Krauss, R. M. (2012). Phonetic
convergence in college roommates. Journal of Phonetics, 40, 190–
197.
Shockley, K., Santana, M.-V., & Fowler, C. A. (2003). Mutual
interpersonal postural constraints are involved in cooperative
conversation. Journal of Experimental Psychology: Human
Perception and Performance, 29(2), 326–332.
Tiede, M., Bundgaard-Nielsen, R., Kroos, C., Gibert, G., Attina, V.,
Kasisopa, B., Vatikiotis-Bateson, E., & Best, C. (2010). Speech
articulator movements recorded from facing talkers using two
electromagnetic articulometer systems simultaneously. Proceedings
of Meetings on Acoustics, 15-19 November, Cancún, Mexico,
4pSC10.
Tiede, M., & Mooshammer, C. (2013). Evidence for an articulatory
component of phonetic convergence from dual electromagnetic
articulometer observation of interacting talkers. Proceedings of
Meetings on Acoustics, 2-7 June, Montreal, Canada, 3aSCa3.
Wagner, P., Malisz, Z., & Kopp, S. (2014). Gesture and speech in
interaction: An overview. Speech Communication, 57, 209-232.
Figure 6: Visualisation of recorded multimodal data for four selected parameters during one question-answer set by one
dyad (speakers A and B). Based on the question Q1 (speaker A), the answer A1 (speaker B) has a focus structure with the
target object in corrective focus and the target city in the background. Target words in A1 are marked by yellow rectangles.
122