Content uploaded by Andy Lücking
Author content
All content in this area was uploaded by Andy Lücking on Nov 22, 2022
Content may be subject to copyright.
Leading voices
Dialogue semantics, cognitive science, and the polyphonic structure of
multimodal interaction
Andy Lücking
Université Paris Cité, Laboratoire de Linguistique Formelle (LLF)
Goethe University Frankfurt
andy.luecking@u-paris.fr
Jonathan Ginzburg
Université Paris Cité, Laboratoire de Linguistique Formelle (LLF)
yonatan.ginzburg@u-paris.fr
2022
To appear in Language and Cognition—please cite published version
only
Abstract
The neuro-cognition of multimodal interaction—the embedded, embodied, predictive
processing of vocal and non-vocal communicative behaviour—has developed into an im-
portant sub-field of cognitive science. It leaves a glaring lacuna, however, namely the dearth
of a precise investigation of the meanings of the verbal and non-verbal communication
signals that constitute multimodal interaction. Cognitively construable dialogue semantics
provides a detailed and context-aware notion of meaning, and thereby contributes content-
based identity conditions needed for distinguishing syntactically or form-based defined
multimodal constituents. We exemplify this by means of two novel empirical examples:
dissociated uses of negative polarity utterances and head shaking, and attentional clarific-
ation requests addressing speaker/hearer roles. On this view, interlocutors are described as
co-active agents, thereby motivating a replacement of sequential turn organisation as a basic
organizing principle with notions of leading and accompanying voices. The Multimodal
Serialization Hypothesis is formulated: Multimodal natural language processing is driven in
part by a notion of vertical relevance—relevance of utterances occurring simultaneously—
which we suggest supervenes on sequential (‘horizontal’) relevance—relevance of utterances
succeeding each other temporally.
keywords
dialogue semantics; multimodal interaction; turn taking; overlap; clarification requests
1
1 Introduction
Let’s face it: it’s all about meaning. A phoneme is the smallest meaning-distinguishing sound,
a morpheme a meaning-carrying form. Most distinctions even in syntax—long regarded the
core of linguistics—are based on semantic considerations. Now, investigating meanings poses
a perplexing problem: we cannot directly encounter them or point at them or count them, and
talking about meaning itself requires meaning. There are different ways to proceed in this
situation. In psycholinguistics, for instance, experimental studies are used, where meaning is
observed indirectly by observable features of language users’ processing of stimuli sentences. A
quite different approach has been developed in philosophy and formal semantics: here the act of
interpretation is objectified in terms of mathematical models, that is, ‘small worlds’ which are used
as items within which semantic representations of natural language expressions are evaluated.
Both approaches exemplify research programs that target distinct levels of meaning: This has
recently been discussed in terms of Marrian (Marr,1982) implementation vs. computation,
respectively neural activity vs. behaviour (Krakauer et al.,2017), and in terms of cognitive
architectures complementing algorithmic-representational models (Cooper and Peebles,2015),
among others. With regard to language, there has been a longstanding collaboration: answers to
What? questions are provided by formal grammar and theoretical linguistics, How? questions are
addressed in psycholinguistics. Yet this cooperation cooled down for a while (Ferreira,2005).
There are several reasons for the disenchantment. With regards to meaning proper—that is,
semantics—we think that theoretical linguistics ‘underaccomplishes’ the obligation to provide
cognitively potent models of meaning given mainstream formal semantics’ sentence-oriented
approach. The reason is this: consider a toy world that consists of three individuals, 𝑎(Aydın),
𝑛(Nuria), and 𝑥(Xinying). A mainstream model-theoretic approach to semantics maps natural
language expressions onto terms of a formal language (mostly predicate logic), which in turn are
interpreted in terms of the individuals of a world (denotation or reference). The meaning of an
one-place predicate like sleep, for instance, is the set Jsleep′Kassigned to the formal translation
sleep′of the verb, and in our toy model (let us assume) is {𝑎, 𝑥}(i.e., Aydın and Xinying sleep).
The meaning of the sentence Aydın sleeps is compositionally derived as sleep′(𝑎)and is true
iff (abbreviates if and only if )𝑎∈Jsleep′K. However, the formulæ used in traditional formal
semantics (e.g., sleep′(𝑎)) are dispensable: they eventually get reduced to the basic notions
of truth and reference (sleep′(𝑎), for instance, is true in our toy model) and therefore have no
cognitive bearing. Hence, while being formally precise, it is unclear whether an approach of
such a kind succeeds to ‘formulate the computational problem under consideration’, as Cooper
and Peebles (2015, 245) put it.
Nonetheless, over the past 30 years theoretical linguistics has developed a different sort of a
formal model of meaning, namely dynamic update semantics—most notably Discourse Repres-
entation Theory (Kamp and Reyle,1993)—where the construction of semantic representations
is constitutive of meaning (Kamp,1979, 409) and has cognitive (Hamm et al.,2006) and neuros-
cientific (Brogaard,2019) interpretations (see also Garnham,2010). The sentence Aydın sleeps is
processed within a dynamic update semantics in such a way that a file1for an object 𝑥(due to the
1The metaphor of files and file changing is due to Heim (1982); in cognitive science the closely related notion of
mental files is used (Perner et al.,2015), see also their reemergence in the philosophy of mind (Recanati,2012).
2
proper name Aydın) is opened (if new) or continued (if known). We emphasize this detail since it
reveals a dynamic shift in the notion of meaning: the meaning of an utterance updates a previous
context and returns an updated context. Hence, reducing the meaning of an assertive sentence to
truth and reference is replaced (or at least complemented) by its context change potential. The
(new or continued) file is then populated with conditions that 𝑥is named Aydın (if not already
known) and 𝑥is sleeping.2A dynamic update semantics rooted in spoken language—also known
as dialogue semantics—is KoS (Ginzburg,1994,2012). KoS [not an acronym] is formulated by
means of types from a Type Theory with Records (TTR; Cooper,in press;Cooper and Ginzburg,
2015) instead of terms and expressions from an interpreted language like predicate logic. There is
a straightforward model-theoretic, denotational construal of types much in the spirit of classical
formal semantics (Cooper,in press), but one can also think of types as symbolic but embodied
structures which are rooted in perception (as Cooper,in press points out), which label instances
of linguistic processing (Connell,2019;Frankland and Greene,2020), and are associated with
motor and perception activation (Bickhard,2008;Hummel,2011;Meteyard et al.,2012). Indeed
types can also be construed neurally (Cooper,2019).
Why promote dialogue semantics and to a cognitive science audience? Cognitive science
has come to acknowledge that multimodal interaction is the ‘central ecological niche’ of sen-
tence processing (Holler and Levinson,2019, 639). The dominant view on interaction and
coordination in cognitive science is a systemic view: interlocutors are observed and construed as
a complex system—there is work on systemic coupling on neural, behavioral, and attentional,
goal-predicting levels (Fusaroli et al.,2014;Hasson et al.,2012;Pickering and Garrod,2013;
Pickering and Garrod,2004;Sebanz and Knoblich,2009).
However, while a systemic view certainly provides important insights into the neural and cog-
nitive underpinnings of alignment and communication within its ecological niche, we argue that
significant lacunae remain unless this is complemented by analyses of the verbal and nonverbal
signals (and their interactions) that constitute multimodal communication: The multimodal,
interactive turn in cognitive science induces a renewed need for a precise formulation of the
computational What? problem. Simplifying to a necessary degree, Figure 1summarizes the
semantic position within the multimodal discourse landscape. We focus on contents (cont)
here and demonstrate throughout how contents depend to a very large extent on a fine-grained
structured context (ctxt).
In particular we argue that a dialogue-semantics perspective makes at least three crucial
contributions:
•Dialogue semantics provides a formal notion of content that is needed in order to define dif-
ferent kinds of cross-modal signals. From gesture studies we have the notion of multimodal
ensembles (Kendon,2004)—utterances including speech–gesture composites—, and from
psycholinguistics multimodal gestalts (Holler and Levinson,2019)—recurrent, statistically
significant multimodal actions, signals or features which are interlinked by a (common)
communicative intent or meaning.3However, recurrent ensembles or gestalts often occur
2This is the minimal information that is received from the sentence. One can also add that Aydın very likely is
human since it is a common first or family name, and, in recent memory-oriented approaches, that the semantic
value for the proper name is to be found in long-term memory (Cooper,in press;Ginzburg and Lücking,2020).
3In fact, there is information-theoretic evidence for such gestalts at least on the level of manual co-speech gestures
3
embedded, embodied processing
coupling, prediction
neuro-cognitive
sciences
speech-sign
phon phon
syn syn
ctxt ctxt
cont sem
processing
gesture-sign
shape shape
ctxt ctxt
cont sem
processing
perspective
on
perspective
on
dialogue semantics (focus of this paper)
executes
executes
Figure 1: A dialogue semantics perspective for completing the systemic understanding of mul-
timodal discourse.
with a simplification in form (Lücking et al.,2008). This raises the issue of how formally
different gestalt or ensemble tokens are assigned to a common type instead of to different
ones. Moreover, how to account for communicative signals, features or utterances which
do not belong to a unique ensemble or gestalt? We argue mainly based on data from head
shaking (subsections 2.1 and 4.2) that an explicit semantic analysis is needed to provide
the required identity conditions and, among others, tell apart multimodal behaviour that,
with respect to its perceptual forms, deceptively looks like a unified composite utterance.
•In line with research on attentional mechanisms (Mundy and Newell,2007), we discuss
(non-)attending to interlocutors as new attentional data and argue that it can be used to
explain—as far as we know— hitherto unstudied occurrences of specific types of other-
repair in discourse targetting the speaker and hearer roles.
•Timing and coherence within multimodal interaction is a subject sui generis for both
cognitive science and dialogue semantics: Dialogue agents are co-active during more or
less the whole time of interaction—see also the analysis of Mondada (2016).4Accordingly,
the notion of turn should be replaced by the notion of leading voice. And this applies
even to spoken contributions, where despite the entrenched assumption of one speaker per
turn, assumed in Conversation Analysis to be one of the essential and universal structuring
notions of conversation (Levinson and Torreira,2015), overlap is in certain situations
and with inter-subject variation an acceptable option (Bennett,1978;Falk,1980;Hilton,
(Mehler and Lücking,2012). The notion of ‘local gestalts’ used by Mondada (2014) seems to be a generalization
of the notion of ensembles, but to lack the statistical import gained from recurrence.
4A more conservative view seems to be embraced by Streeck and Hartge (1992), who analyze mid-turn gestures to
‘contextualize “next speech units”’, including a preparation of potential transition places (p. 137). This view is
reinforced in Streeck (2009, chap. 8).
4
2018;Tannen,1984;Yuan et al.,2006). In this respect, multimodal interaction is akin
to a polyphonic musical piece.5Just like polyphonic music is organized by harmonic or
contrapuntal composition techniques, polyphonic interaction is driven partly by dialogical
relevance or coherence.6Note that the terms ‘leading’ and ‘accompanying voices’ also
give rise to a subjective interpretation: a speaker or gesturer may have the impression to
hold the leading voice in a conversation regardless of observational evidence.
Section 2illustrates the above-mentioned challenges posed by multimodal phenomena. Sec-
tion 3then sketches a formal theory of multimodal interaction that involves: (i) semantic
representations which can (and should) be construed as cognitive information state represent-
ations, (ii) partiturs (multimodal input representations), (iii) lexical entries and conversational
update rules that capture dialogical relevance enabling incremental and predictive processing.
The machinery is applied to analyze the sample observations in section 4. The formal theory may
appear complex to those exposed to it for a first time or who do not endorse formal approaches,
but its expressive granularity has been developed in light of many diverse dialogical phenomena,
as explained in subsection 3.2. In particular, it facilitates formulating our ultimate upshot in sec-
tion 5in an explicit way: Our claim, the multimodal serialization hypothesis, is that vertical
relevance—relevance of utterances occurring simultaneously—supervenes7on horizontal
relevance—relevance of utterances succeeding each other temporally. Hence, multimodal-
ity compresses interaction temporally, but is not richer in terms of semantic expressivity.
In other words, and with certain caveats we will spell out, simultaneous interaction, though
more efficient and perhaps more emotionally engaging and aesthetically pleasing, can al-
ways be serialized without loss of semantic information. This is a rather strong claim, and
it needs to be refined right away. On the one hand, there are multimodal signals which simply
cannot be separated—for instance, you cannot separate spoken utterances from their intonation:
they are co-articulated. This is, however, not just due to a common channel: speech–laughter is
transmitted via the acoustic channel but can be separated into speech and laughter. On the other
hand, serializing multimodal input gives rise to different possible orderings. We do not claim
5Thinking of conversational interaction in musical terms has been proposed by Thompson (1993), whereas Clark
(1996, Chapter 2: 50) mentions string quartets as a ‘mostly nonlinguistic joint activity’. In fact, string quartets
were originally inspired by the 18th century French salon tradition (Hanning,1989). Duranti’s paper (Duranti,
1997) documents what he calls ‘polyphony’ (‘normative overlap’) in Samoan ceremonial greetings. Based on
a convergent effect of joint musical improvisation on the alignment of body movements and periodicity across
speech turns it has recently been argued that music and linguistic interaction both belong to a common human
communicative facility (Daltrozzo and Schön,2009;Robledo et al.,2021). However, despite the fact that we
use the term leading voice in the very title, we use it here solely as a metaphor for depicting the structure of
multimodal communication. In particular we do not derive strong implications for the organization of dialogue
(or music) from it; in fact, other comparisons such as contrapuntal structure serve similar purposes, as we discuss
below.
6These two terms are frequently used interchangeably; we use the former for consistency with earlier work in the
framework utilized in this paper, KoS. Coherence has been emphasized as a fundamental principle of the alignment
of manual co-speech gesture and speech by Lascarides and Stone (2009).
7Supervenience is a non-reductionist but asymmetric mode of dependence (see e.g. Kim,1984) which, with respect
to the multimodal serialization hypothesis, can be paraphrased as follows: any difference in the set of properties
of vertical coherence is accounted for by some difference in the set properties of horizontal coherence, but not
the other way round. In this sense does vertical relevance depend on horizontal relevance but does not get
ontologically reduced to it.
5
that every order of the elements from multimodal input when put into a sequence is equivalent,
quite the contrary: we provide evidence for the opposite below. But in accordance with the
claim, one of the possible orderings is semantically equivalent to the original multimodal input.
Simultaneity and sequentiality in multimodal interaction can always become manifest in two
ways: (i) across interlocutors, and (ii) within one interlocutor. The multimodal serialization
hypothesis intentionally generalizes over both manifestations (in fact, the empirical phenomena
discussed in the following involve both kinds). Given these qualifications, the expressivity claim
is a hypothesis that has to be explored in multimodal communication studies by cognitive science,
theoretical linguistics, gestures studies, and related disciplines.
2 Observations
2.1 Head shake
Eight uses of the head shake are documented by Kendon,2002. The most well known (Kendon’s
use I) is a non-verbal expression of the answer particle ‘No.’ Thus, a head shake can be used in
order to answer a polar question:
(1) a. A: (i) Do you want some coffee? / (ii) You don’t want some coffee?
b. B: head shake
Depending on whether A produced a negative or a positive propositional kernel in the question,
B’s head shake is either a denial of the positive proposition or a confirmation of the negative one
(which is not discussed by Kendon,2002). In uses such as those documented in (1) the head
shake conveys a proposition. However, the proposition expressed by the head shake is in part
determined by the context in which it occurs—in (1) it can be one of two contradictory dialogue
moves, a denial or a confirmation one. Hence, what is needed for instances such as (1) is a notion
of contextually-aware content. We provide such a content in subsection 4.2 below.
Having a context-aware semantic representation of denial at our disposal, it makes predictions
for head shakes in other contexts as well. Consider (2):
(2) a. I don’t believe you.
head shake
b. ? I do believe you.
head shake
While (2a) is fully coherent, (2b) (at least without additional context—examples of which we
provide in subsection 4.2) has a contradictory flavor: the head’s denial is not matched in speech.
Hence, in order to discuss apparently simple uses of head shakes one already has to draw on a
precisely formulated, contextually-aware notion of contents.
2.2 Co-activity and communicative breakdown
A well-known pattern of co-activity in spoken discourse is the interplay of monitoring and
feedback signals. For instance, backchannelling signals such as nodding or vocalizations such
6
as ‘mhm’ influence the development of discourse (Bavelas and Gerwing,2011). The absence of
monitoring or feedback signals leads to communicative breakdown since it raises the question
whether one is still engaged in the joint interaction. Suppose 𝐴and 𝐵are sitting on a window seat
in a café. 𝐴is telling 𝐵about a near-accident she witnessed on Main Street the day before. While
𝐴has been talking, 𝐵has been continuously staring out of the window. Thus, 𝐴lacks attentional
gaze signals, which in turn raises doubts about 𝐵’s conversational involvement. Accordingly, 𝐴
will try to clarify 𝐵’s addressee role:
(3) Hey, are you with me?
𝐴’s clarification request or (other–initiated) repair8is a natural response in the sketched situation
since it is triggered by a neuro-cognitive social attention mechanism (Nummenmaa and Calder,
2009) in response to a violation of a behavioral norm. However, seen from a turn-based view (3)
is not easy to explain: 𝐴is speaking and 𝐵is listening, so the all-important roles of hearer and
speaker are clearly filled—and now it is 𝐵’s turn. Crucially (3) could equally be made by 𝐵if
s/he gets the impression 𝐴is rambling incoherently.
2.3 Summary
The upshot of the few phenomena we have discussed above is that multimodal interaction is:
•driven by a richly structured and fine-grained context
•which is distinct but aligned across the participants
•where the participants typically monitor each other’s co-activity.
In the following we introduce a theoretical dialogue framework which can capture these obser-
vations.
3 Polyphonic interaction: Cognitive-formal tools
3.1 Partiturs
A prerequisite for an analysis of multimodal interaction is a systematic means for telling apart
the manifold verbal and non-verbal signals. We employ tiers in this respect, where a tier is built
following the model of phonetics and phonology. Phonetics comprises the triple of articulatory
phonetics,acoustic phonetics, and auditory phonetics. Signaling of other communication means
can be construed in an analogous way. For instance: facial muscles—facial display—vision
defines the tier for facial expression. Tiers give rise to a uniform approach to linguistic analysis
which ultimately rests on perceptual classification (cf. Cooper,in press), which we formulate
in terms of TTR (a Type Theory with Records,Cooper,in press;Cooper and Ginzburg,2015).
Classification in TTR is expressed as a judgement: generally, object 𝑜is of type 𝑇. With regard to
8We assume these two latter terms are synonymous, the former often used in the dialogue community, the latter
among Conversation Analysis researchers.
7
spoken utterances, a record 𝑟(situation) providing a sound event (construed as an Individual)—
𝑟="sevent = a
c = s1#—is correctly classified by a record type 𝑇="sevent :Ind
c : Sign(sevent)#(i.e., 𝑟:𝑇), iff the
object labeled ‘sevent’ (in this case, a soundwave) belongs to the phonological repertoire of the
language in question.9
Tiers can be likened to different instruments on a musical score: a partitur.10 Building on
Cooper (2015), we represent partiturs as strings of multimodal communication events 𝑒, which
are temporally ordered sequences of types. One can think of strings in terms of a flip-book: a
dynamic event is cut into slices, and each slice is modeled as a record type. Such string types
(Cooper,in press;Fernando,2007) are notated in round brackets and typed in an obvious manner,
where RecType is the general type of a record type:
(4)
partitur B
e:(
espeech :Phon
egesture :Trajectory
egaze :RecType
ehead :headMove
eface :faceExpr
)+
The progressive unfolding of sub-events on the various tiers in time gives rise to incremental
production and perception. Formally, this is indicated by the Kleene plus (‘+’). (4) shows the
type of of multi-tier signalling, it remains silent concerning potential inherent rhythms of the
individual tiers. In fact, it has been argued that different kinds of gestures exhibit a specific
‘rhythmic pulse’ (Tuite,1993), as does speech, which lead to tier-specific temporal production
cycles that may jointly peak in synchronized intervals (Loehr,2007). The temporal relationship
between signals on different tiers is therefore specified in a relative way, following the example
set by the Behavior Markup Language (Vilhjálmsson et al.,2007). It should be noted that the
sub-events on partiturs can be made as detailed as needed—from phonetic features to complete
sentences or turns. A reasonable fine-grained temporal resolution of partiturs seems to be the
level of syllables. Arguably, syllables constitute coherent events as do tones in a melodic phrase
and movement elements in locomotion, and to which attentional processes are rhythmically
attuned in the sense of Jones and Boltz (1989). See Lücking and Ginzburg (2020) for more
details on parsing on partiturs. We will make crucial use of record type representations along
these lines in the following.
3.2 Cognitive states in dialogue semantics
We model cognitive states by means of dialogue agent-specific Total Cognitive States (TCS) of
KoS (Ginzburg,1994,2012;Larsson,2002;Purver,2006). A TCS has two partitions, namely a
9Sign is modeled in terms of phonology–syntax–semantics structures developed in Head-Driven Phrase Structure
Grammar (Pollard and Sag,1994). We abstract over a speaker’s knowledge of a language and the language system
where it does not do any harm, as an anonymous reviewer of Language and Cognition observed. A speaker who is
not aware of a certain word form (sound) will, however, not be able to provide a witness for a sign type containing
that form as value of the phon feature. This, in turn, can trigger clarification interaction.
10We use the Italian word partitur (and its English plural variant) since in semantics the term score is already taken
due to the work of Lewis (1979).
8
private and a public one. A TCS is formally represented in (5). In a dialogue between A and B
there are both, A.TCS and B.TCS.11
(5) TCS B"public : DGBType
private : Private #
(The symbol “B” indicates a definition relation.)
Now, trivially, communication events take place in some context. The simplest model of
context, going back to Montague (1974), is one which specifies the existence of a speaker,
addressing an addressee at a particular time. This can be captured in terms of the type in (6),
which classifies situations (records) that involve the obvious entities and actions.
(6)
spkr : Ind
addr : Ind
u-time : Time
cutt : addressing(spkr,addr,u-time)
However, over the last four decades it has become clearer how much more pervasive reference
to context in interaction is. Indeed, arguably, this traditional formulation gets things backwards
in that it seems to imply that ‘context’ is some distinct component one refers to. In fact, as will
become clear, following Barwise and Perry (1983), we take utterances—multimodal events—to
be the basic units interlocutors assign contents to given their current cognitive states and from
this generalize to obtain utterance types, the meanings/characters semanticists postulate.
The visual situation is a key component in interaction from birth (see Tomasello,1999,
Chap. 3).12 Expectations arise due to illocutionary acts—one act (querying, assertion, greeting)
giving rise to anticipation of an appropriate response (answer, acceptance, counter–greeting),
also known as adjacency pairs (Schegloff,2007). Extended interaction gives rise to shared as-
sumptions or presuppositions (Stalnaker,1978), whereas uncertainty about mutual understanding
that remain to be resolved across participants—questions under discussion—are a key notion in
explaining coherence and various anaphoric processes (Ginzburg,1994,2012;Roberts,1996).
These considerations among several additional significant ones lead to positing a significantly
richer structure to represent each participant’s view of publicized context, the dialogue game-
board (DGB), whose basic make up is given in (7):
11We restrict attention here to two person dialogue; for discussion on the differences between two person and
multi-party dialogue and how to extend an account of the former to the latter see Ginzburg (2012, section 8.1).
12The importance of vision in the establishment of joint attention is affirmed by studies on the development of joint
attention in congenitally blind infants (Bigelow,2003). Blind children must rely on non-visual attention-getting
strategies such as hearing or touching. As a consequence, they not only develop joint attention at later stages
than sighted children, they also depend on their interlocutors to establish a common focus of attention—at least
until the symbolic competence of speech is developed to a sufficient degree (Bigelow,2003). Furthermore, it
has been found in ERP studies that congenitally blind subjects (but not sighted ones) recruit posterior cortical
areas for the processing of information relevant for an auditory attention task, and in a temporally ordered manner
(Liotti et al.,1998). The authors of the study speculate that the observed topographical changes might be due
to a ‘reorganization in primary visual cortex’ (p. 1011). Wrt. the Vis-Sit in KoS this can be seen as evidence
that at least some of the visual information is replaced by information from other tiers of the partitur. Hence,
a corresponding formal model can in principle be devised, accounting for interactions with congenitally blind
interlocutors, an issue brought up by an anonymous reviewer of Language and Cognition.
9
(7)
DGBType B
spkr : Ind
addr : Ind
utt-time : Time
c-utt : addressing(spkr,addr,utt-time)
facts : Set(Prop)
vis-sit = hfoa : Ind ∨Reci:RecType
pending : List(LocProp)
moves : List(IllocProp)
qud : poset(Question)
mood : Appraisal
It should be emphasized (again) that there is not a single DGB covering a dialogical episode,
but a DGB for each participant. Participants’ DGB are usually coupled, that is, develop in par-
allel. Participant specific DGBs, however, allow to incorporate misunderstandings, negotiation,
coordination, and the like in a straightforward manner in KoS. In any case, facts represents
the shared assumptions of the interlocutors—identified with a set of propositions. In line with
TTR’s general conception of (linguistic) classification as type assignment—record types regi-
ment records—propositions are construed as typing relations between records (situations) and
record types (situation types), that is, as Austinian propositions (Austin,1950;Barwise and
Etchemendy,1987). More formally, propositions are records of type "sit : Rec
sit-type : RecType#.13 The
ontology of dialogue (Ginzburg,2012) knows two special sorts of Austinian proposition: gram-
mar types classifying phonetic events (Loc(utionary)Prop(ositions)) and speech acts classifying
utterances (Illoc(utionary)Prop(ositions)). Both types are part and parcel of locutionary and
illocutionary interaction: Dialogue moves that are in the process of being grounded or under
clarification are the elements of the pending list; already grounded moves (roughly, moves which
are not contested, or agreed-upon moves) are moved to the moves list. Within moves the first
element has a special status given its use to capture adjacency pair coherence and it is referred
to as LatestMove. The current question under discussion is tracked in the qud field, whose data
type is a partially ordered set (poset). Vis-sit represents the visual situation of an agent, includ-
ing his or her visual focus of attention (foa), which, if any (attention may be directed towards
something non-visual, even non-perceptual14), can be an object (Ind), or a situation or event
(which in TTR are modeled as records, i.e., entities of type Rec). Mood tracks a participant’s
public displays of emotion (i.e., externally observable appraisal indicators such as intonation or
facial expressions, which often do but need not coincide with the participant’s internal emotional
state), crucial for inter alia laughter, smiling, and sighing (Ginzburg et al.,2020b), and, as we
shall see, head shaking as well. The DGB structure in (7) might seem like an overly rich notion
for interlocutors to keep track of. Ginzburg and Lücking (2020) show how the DGB type can be
recast as a Baddeley-style (Baddeley,2012) multicomponent working memory model interfacing
13On this view, a proposition p ="sit = s
sit-type = T#is true iff 𝑠:𝑇—the situation 𝑠is of the type 𝑇. Note that
an incongruous situation type (inquired about by an anonymous reviewer) lacks any witnessing situations and
therefore in model-theoretic terms has an ‘empty’ extension.
14As is arguably the case in remembering and imagination (Irish,2020;Werning,2020).
10
with long-term memory.
Given that our signs (lexical entries/phrasal rules) are construed as types for interaction, they
refer directly to the DGB via the field dgb-params. For instance, the linguistic meaning of the
head shake from (1) in subsection 2.1 patterns with the lexical entry for “No” when used as an
answer particle to a polar question (a.k.a. a ‘yes-no’ question) and, following Tian and Ginzburg
(2016), is given in (8).
(8)
phon : no/shape : head shake
dgb-params :
spkr : Ind
addr: Ind
u-time: Time
c1 : addressing(spkr,addr,u-time)
𝑝:Prop
MaxQUD = 𝑝? : PolarQuestion
content = Assert(spkr,addr, u-time,NoSem(𝑝)) : IllocProp
When used in the context of a polar question with content 𝑝(the current question under
discussion—MaxQUD—is 𝑝?), saying “No” and/or shaking the head asserts a ‘No semantics’
applied to 𝑝.NoSem(𝑝) in turn is sensitive to the polarity of the proposition to which it applies—
cf. the discussion of the head shake in subsection 2.1. To this end, positive (PosProp) and
negative (NegProp) propositions have to be distinguished. If a negative particle (not,no,n’t,
never,nothing) is part of the constituents of a proposition ¬𝑝, then ¬𝑝is of type NegProp (¬𝑝:
NegProp). The corresponding positive proposition, the one with the negative particle removed,
is 𝑝(𝑝:PosProp). With this distinction at hand, NoSem works as follows:
(9) NoSem(𝑝)=(¬𝑝if 𝑝:PosProp
𝑝if 𝑝:NegProp
(Note that the result of ‘NoSem(𝑝)’ is always of type NegProp—𝑝:NegProp means that 𝑝=¬𝑞,
which NoSem leaves unchanged according to the second condition in (9).) (8) and (9) provide a
precise characterization of answer particle uses of negation and head shake and therefore make
testable predictions concerning meaning in context.
The evolution of context in interaction is described in terms of conversational rules, mappings
between two cognitive states, the precond(ition)s and the effects. Two rules are given in (10): a
DGB that satisfies preconds can be updated by effects.
(10) a. Assert QUD-incrementation: given a proposition 𝑝and Assert(A,B,𝑝) being the
LatestMove, one can update QUD with 𝑝?as MaxQUD.
preconds :"p : Prop
LatestMove = Assert(spkr,addr,p) : IllocProp#
effects : QUD = Dp?,pre.QUDE: poset(Question)
Example: the claim 𝑝=‘Carlsen will retain his title.’ is asserted by the speaker.
This leads to the question 𝑝?=‘Will Carlsen retain his title?’ becoming the topmost
11
question under discussion, waiting for the addressee to be accepted (in that case 𝑝
will be added to the set of propositions making up facts, see (7)) or discussed.
b. QSPEC: this rule—a formalization of Grice’s maxim of relevance—characterizes the
contextual background of reactive queries and assertions: if 𝑞is MaxQUD, then
subsequent to this either conversational participant may make a move constrained to
be specific to 𝑞(i.e., either About or Influencing 𝑞; for a formal characterization of
Qspecific, see Ginzburg,2012, section 4, further explications are given in section 5
below).
preconds : QUD = Dq, QE: poset(Question)
effects :
r : Question ∨Prop
R: IllocRel
LatestMove = R(spkr,addr,r) : IllocProp
c1 : Qspecific(r,q)
Example: the question 𝑟?=‘Where is my box of chocolates?’ has been posed by
the speaker, that is, 𝑟?is MaxQUD. Now both the assertion 𝑝=‘In the cupboard.’
(LatestMove =Assert(spkr,addr,𝑝)) and the question 𝑞?=‘Where were you snack-
ing from it last?’ (LatestMove =Ask(spkr,addr,𝑞?)) are q-specific wrt. 𝑟?, whereas
a question such as 𝑤?=‘Have you already seen the new movie?’ (LatestMove =
Ask(spkr,addr,𝑤?)) is not. The latter may therefore lead to other pragmatic interpret-
ations such as trying to change topics.
Within the dialogue update model of KoS, following Ginzburg et al. (2020a), QUD gets
modified incrementally, that is, at a word-by-word latency (or even higher).15 Technically, this
can be implemented by adopting the predictive principle of incremental interpretation in (11)
on top of partitur parsing (see subsection 3.1). This says that if one projects that the currently
pending utterance (the preconditions in (11)) will continue in a certain way (pending.proj in (11)),
then one can actually use this prediction to update one’s DGB, concretely to update LatestMove
with the projected move; this will, in turn, by application of the existing conversational rules,
trigger an update of QUD:16
(11) Utterance Projection B
preconds :hpending.sit-type.proj = a : Typei
effects :
e1 : Sign
LatestMove = "sit = e1
sit-type = a#:LocProp
15Ginzburg et al. (2020a) are motivated by data showing unfinished utterances can trigger updates driving e.g.,
elliptical phenomena like sluicing: He could bring the ball down, but opts to cushion a header towards . . . well,
who exactly? Nobody there. (From a live match blog)
16Since there are more and less likely hypotheses concerning the continuation of an ongoing utterance, utterance
projection should ultimately be formulated in a probabilistic manner using a probabilistic version of TTR (Cooper
et al.,2015). Instead of a single effect, a range of probabilistically ranked predictions is acknowledged, as
is common in statistical natural language parsing (e.g., Demberg et al.,2013). Incremental and predictive
processing underlies grammatical framework such as Dynamic Syntax from the ouset (Gregoromichelaki et al.,
2013;Kempson et al.,2001).
12
We will make use of utterance projection in analyzing head shakes synchronous to speech in
subsection 4.2 below and in section 5when explicating vertical relevance. Such projective
rules implement predictive processing in interactions and therefore provide a computational
underpinning of a central cognitive mechanism (Litwin and Miłkowski,2020).
4 Polyphonic interaction: Cognitive-formal analyses
The formal tools from section 3are used to provide precise analyses of the observations from
section 2: Attention and communicative breakdown (subsection 4.1), and the semantics of head
shake (subsection 4.2).
4.1 Conversational engagement
In two person conversation the values of spkr and addr of a DGB are rarely in question, apart from
initially (Who is speaking?Are you addressing me?), but the need to verify that the addressing
condition holds is what we take to drive attention monitoring. We conceive the two states of being
engaged or disengaged in conversation as two hypotheses in a probabilistic Bayesian framework.
Relevant data for the (dis-)engagement hypotheses can be found in gaze, which is an excellent
predictor of conversational attention (Nummenmaa and Calder,2009;Vertegaal et al.,2001).
The quoted sources as well as the discussion in the following concern unobstructed face-to-face
dialogue, that is, dialogue where participants stand or sit opposing each other and can freely
talk. The findings and the assumptions derived below do not carry over to ‘obstructed’ discourse
situations simpliciter, for instance, when interlocutors are talking while carrying a piano.
Within cognitive DGB modeling, the Vis-Sit field already provides an appropriate data struc-
ture for gaze. Mutual gaze can be formulated as a perspectival default condition on partiturs.17
Of course, there is no claim that mutual gazing occurs continuously. Indeed, continuous gaze is
often viewed as being rude or encroaching. In fact, mutual gaze tends to be short, often less than
a second (Kendon,1967).
Gaze is not the only attentional signaling system, however. Dialogue agents regularly provide
verbal and non-verbal feedback signals (Bavelas and Gerwing,2011). Among the verbal reactive
tokens (Clancy et al.,1996) the majority are backchannels. As with gaze, a lack of backchannel-
ling will result in communicative breakdown. In sum, there is ample evidence that gazing and
backchannelling provide important data points for tracking (mutual) attention. We combine both
into a probabilistic framework along the following lines:
(12) Bayes’ attention hypotheses and data
a. H={𝐻1=being engaged, 𝐻2=being disengaged}
b. D={𝐷1=ind. gaze, 𝐷2=mutual gaze, 𝐷3=gaze away, 𝐷 4=backchannel, 𝐷5=
no backchannel}
17Conditions or rules are perspectival if they are applicable only to particular dialogue participants; see Ginzburg
et al. (2020b, §4.1.2) for a first use of ‘participant sensitive’ conversational rules.
13
We assume that gazing provides slightly more attentional evidence than backchannelling by
a proportion of 0.6 to 0.4. We derive the prior probabilities for gaze under 𝐻1from Argyle
(1988, 159), the priors for gaze under 𝐻2are stipulated, as are the backchannel probabilities.
Furthermore, we assume that engagement is the probabilistic default case of interaction with a
plausibility of 0.8 to 0.2:
(13) •
𝐻1
𝐷1𝐷2𝐷3𝐷4𝐷5
𝐻2
𝐷1𝐷2𝐷3𝐷4𝐷5
0.8
0.36
0.18 0.06 0.3
0.1
0.2
0.15
0.05 0.4 0.05
0.35
If one of the kinds of gaze from Dis observed, the posterior probability can be calculated
from the probability tree in (13) by means of a Bayesian update according to Bayes’ theorem
(𝑃(𝐻|𝐷)=
𝑃(𝐷|𝐻)𝑃(𝐻)
𝑃(𝐷)). Let us illustrate an update triggered by an observation of individual
gaze, 𝐷1. Compared to the prior probabilities of the engagement and disengagement hypotheses,
𝐷1leads to an increase of the probability of 𝐻1at the expense of 𝐻2. The corresponding
numerical values are collected in Table 1.
Table 1: Bayesian update table
hypothesis prior likelihood Bayes numerator posterior
H𝑃(H ) 𝑃(𝐷1| H ) 𝑃(𝐷1| H)𝑃(H ) 𝑃(H | 𝐷1)
𝐻10.8 0.36 0.288 0.906
𝐻20.2 0.15 0.03 0.094
The change of the posteriors in comparison to the priors show that the already more probable
engagement hypothesis gains further plausibility (increasing from 0.8 to 0.9). Hence, observing
individual gaze, 𝐷1, supports (the public display of) mutual attention. Bayesian updates apply
iteratively: In this way, only a mixture of data observations of different kinds lead to an oscillation
of 𝐻1within a certain probability interval. This leads us to a testable hypothesis, namely that the
extrema of the oscillation interval constitute thresholds of mutual attention. If the engagement
posterior takes a value below the minimum of the interval, it triggers attention clarification: Are
you with me? Values that exceed the maximum lead to irritation: Why are you staring at me?
4.2 Head shake and noetics
In section 2, the exchange re-given in (14) was introduced as an obstacle for the No-semantics of
head shakes introduced in section 3.
14
(14) a. I don’t believe you.
head shake
b. ? I believe you.
head shake
If we make the (rather consensual) assumption, that the outcome of utterances are predicted
as soon as possible (see section 3, in particular example (11)) then an explanation of (14) is
straightforward: A’s utterance in (14a) provides a negative proposition, ¬believe(A,B), which
by NoSem the head shake affirms. On the other hand, (14b) provides a positive proposition,
believe(A,B), which by the same lexical entry the head shake negates, hence a contradiction
ensues.
The contradiction in (14b) can be ameliorated, however:
(15) (Context: Claims that B stole 500e)
a. B: They say I stole the money. But I didn’t.
b. A: I believe you.
head shake
In this case, one can understand A as verbally expressing his belief in B’s protestation of
innocence, whereas the head shake affirms the negative proposition B makes, ¬stole(B,500e)
(when related to the second sentence uttered by B), or expresses that A is upset about what ‘they’
did (when related to B’s initial uttered sentence). In either case, this requires us to assume that
the head shake can be disassociated from speech that is simultaneous with it, an assumption
argued for in some detail with respect to speech laughter by (Mazzocconi et al.,2020). Such
observations are of great importance for a multimodal theory. This is because it has been claimed
that multi-tier interpretation is guided by the heuristic ‘if multiple signs occur simultaneously,
take them as one’ (Enfield,2009, 9). Such heuristics have to be refined in consideration of the
above evidence.18
Examples like the head shake in (15b)—which can be glossed ‘I disapprove of 𝑝’—are therefore
subsumed to a ‘negative appraisal use’ of negation (Tian and Ginzburg,2016)byLücking and
Ginzburg (2021), and analyzed as a noetic act expressing a speaker’s attitude towards the content
of his or her speech via DGB’s Mood field.19 Note, finally, that A’s response in (15) can be
serialized as head shake followed by speech (i.e., head shake + ‘I believe you’). However, the
sequence ‘I believe you’ + head shake seems to be a bit odd, illustrating a remark concerning
18As pointed out by an anonymous reviewer, Enfield’s heuristics can be understood more loosely along the lines of
‘if multiple signs occur simultaneously, interpret them in relation to one another’. Since Enfield does not provide
a semantics, there remains some leeway for interpretation. The semantic and pragmatic synchrony rules stated
by McNeill (1992) are more explicit in this respect (‘[. . . ] speech and gesture, present the same meanings at the
same time’, p. 27; ‘[. . . ] if gestures and speech co-occur they perform the same pragmatic functions’, p. 29).
19The term ‘noetic’ is inspired by William James (James,1981, Chap. XXV) who emphasized, for instance, that
‘[i]nstinctive reactions and emotional expressions thus shade imperceptibly into each other.’ (p. 1058) In this
sense, noetics describes how feelings, sentiments, sensations, memories, emotions, and unconscious acts bear
on and are transmitted through a feedback-loop of thinking and knowledge (Krader,2010). We believe that
emphasizing the inherent integration of appraisal and content, among others, is a useful way of conceiving
attitudes in conversations.
15
the multimodal serialization hypothesis we made in Sec. 1, namely that sequential orderings need
not be equivalent. Such temporal effects need to be explored further in future studies.
5 Upshot: From ‘horizontal’ to ‘vertical’ relevance in multimodal
dialogue
In uni-modal interaction (best exemplified perhaps by chat conducted sequentially between users
across a network) conversation is constrained by relevance or coherence between successive
participant moves (and ultimately across longer stretches). For reasons related to our metaphor
with musical notion (cf. partiturs) we call this notion horizontal relevance.
Some examples for relevant (indicated by ‘✓’) responses to a query and to an assertion are
given in (16a,b) and irrelevant ones (indicated by ‘#’) to both in (16c).
(16) a. A: Is that chair new? B: ✓Yes / ✓It’s a Louis XIV replica / ✓New?;
b. A: Jill arrived late last night. B: ✓She did not. / ✓Why? / ✓Jill? / ✓To spite us.
c. B: # Tomorrow / # Please insert your card / # The train.
For conversation the query/response relation is the one studied in greatest detail (Berninger
and Garvey,1981;Ginzburg et al.,2022;Stivers and Enfield,2010). The basic characterization
of this relation given in Ginzburg et al. (2022) is that the class of responses to a question 𝑞1can
be partitioned into three classes.
(17) a. q(uestion)–specific: responses directly about or subquestions of 𝑞1;
b. MetaCommunicative: responses directly about or subquestions of a question defined
in part from the utterance of 𝑞1;
c. Evasion: responses directly about or subquestions of a question that is distinct from
𝑞1and arises from some other component of the context:
1. Ignore (address the situation, but not the question; e.g., Anon: on the Sunday
before you killed the animals, you didn’t in fact feed them. Why was that? Harry:
Only water. (BNC));
2. Change the topic (e.g., Nicola: Come on, let’s get dressed. Which pants are you
wearing? Oliver: What’s he got on his mouth? (BNC));
3. Motive (‘Why do you ask?’);
4. Difficult to provide a response (‘I don’t know’).
A formal account of horizontal relevance in terms of conversational rules is given in Ginzburg
(2012, §§4.4.5,6.7.1). The basic idea is that an utterance 𝑢is relevant in the current context iff 𝑢
can be integrated as the (situational component of the) LatestMove via some conversational rule.
16
But how does the sequential notion of horizontal relevance relate to simultaneous interaction
on partiturs, that is, to vertical relevance (to stick to the basic metaphor)? We believe that vertical
relevance is supervenient on horizontal relevance. To the best of our knowledge, a careful study,
either experimental or corpus-based, of vertical dialogical relevance has yet to be undertaken,
apart from one subclass of cases involving speech, known as overlaps and interruptions, to which
we return in our discussion below. We offer an initial, partial, and impressionistic characterization
of the notion of vertical relevance in Table 2.
Table 2: Vertical relevance: possible content relations between overlapping utterances across two
speakers
𝑢𝐴.cont 𝑢𝐵.cont relation examples
𝑝¬𝑝Negation B head shake/speech (Kendon,2002)
Disbelief B laugh (Ginzburg et al.,2020b)
𝑝 𝑝 Agreement B head nod/speech (Hadar et al.,1985)
B low arousal laugh
𝑝 𝑝𝑟 𝑜𝑏 (𝑝)< 𝜃 Doubt B head tilt (Heylen,2008)
Understand(B,𝑢𝐴) Acknowledgment B mild nod (Hadar et al.,1985)
¬Understand(B,𝑢𝐴) Clarification request B frown and head back/speech: what?
(Poggi,2001)
𝑝find_disgusting(B,𝑝) Negation, disgust B ‘Not face’ with action units AU9, AU10,
AU17 (Benitez-Quiroz et al.,2016), also
other faces discussed
disengaged(B,𝑢𝑎) Incapacity,
powerlessness,
indetermination,
indifference, obviousness
B shrug + rotating forearms outwards with
hand in ‘palm up’ position + mouth closed,
lips pulled downwards (potentially
combined with eyebrow raising and head
tilt; Debras,2017)
presupposes
¬WishDiscuss(B,𝑢𝐴)
Topic-changing,
interruptions
Simultaneous speech (Bennett,1978;
Hilton,2018)
"sit=s
sit-type =T1# "sit=s
sit-type =T2#Shared situation
assessments
(Falk,1980;Goodwin and Goodwin,1992)
Rel(A,B) CounterRel(B,A) Chordal greetings and
partings
(Schegloff,2000)
Table 2offers a selection of signals/contents that a non-leading voice 𝐵can express simul-
taneously relative to a leading voice 𝐴(speaking in terms of turn-replacements, not in terms of
subjectively assumed importance—cf. Sec. 1). Note that two cases can be distinguished. The
first case involves a single speaker for whom certain signals from the multimodal utterances may
take the leading voice over other ones. A natural leading voice is speech (de Ruiter,2004).
Co-leading or accompanying roles of non-verbal signals can be assigned in relation to speech. In
this respect, at-issue (≈co-leading) and non-at-issue (≈accompanying) uses of co-verbal manual
gestures have been distinguished (Ebert,2014).
The second case concerns the distribution of voices among several interlocutors. Inhabiting
17
a leading or an accompanying role is rooted in processes of utterance projection (11) and
incremental QUD construction, as we discuss in more formal detail below. We assume that the
interlocutor who is responsible for publicly constructing the initial QUD—a process which (by the
first case above) can be multimodal or even nonverbal itself—has/is the leading voice. We think
that the classic notion of turn holder dissolves into the notion of leading voice. Accompanying
voices are characterized by monitoring the incremental QUD construction and commenting
on it—in ways exemplified in Table 2. In the most trivial case this consists in providing
backchannelling, but it may also involve the joint production of an utterance (in which case it
could be argued that the accompanying voice becomes a co-leading voice).
The final class we mention is one that has been, in certain respects, much studied, namely
simultaneous speech. This is a somewhat controversial area because whereas the ‘normativity’ of
one speaker using speech and another producing a non-verbal signal is not in question, the norm-
ativity of the corresponding case where both participants use speech is very much in question.
This is so given the notion of turn and the rule-based system which interlocutors are postulated
to follow in the highly influential account of Sacks et al. (1974). This system is based on the
assumption that normatively at any given time there should be a single speaker; deviations are
‘performance errors’, either unintentional overlaps or one interlocutor interrupting, attempting
to gain the floor. The set up we have provided does not predict any sharp contrast between non–
speech/speech overlap and speech/speech overlap, though this could in principle be enforced by
introducing conversational rules privileging the speech tier. Nonetheless, we do not think such
a strategy is promising. Rather, there are other explanatory factors which conspire to suppress
pervasive overlap. In a study of the multilingual CallHome corpus, Yuan et al. (2007) note
that overlapping varies across languages, with significantly more (non-backchannel) overlaps in
Japanese than in the other languages they study (Arabic, English, German, Mandarin, Spanish);
they also find that males and females make more overlaps when talking to females than to males,
and similarly find more overlaps when talking with familiars than with strangers. Tannen (1984)
argues for the existence of distinct conversational styles, including a high-involvement style that
favors a fast delivery pace, cooperative overlaps, and minimal gaps contrasting with a dicho-
tomous high-considerateness style. Hilton (2018) conducted a study which found statistically
significant correlations between a subject’s conversational style preference and their assessment
of the acceptability of overlaps. All this argues against viewing avoidance of overlapping as a
fundamental, systematic organizing principle.
Can we say anything systematic based on subject matter about cases where overlap seems to
be acceptable? There is no dearth of evidence for such cases going back to Bennett (1978);
Falk (1980); Goodwin and Goodwin (1992) and indeed Schegloff (2000), who while defending
the basic intuition underlying Sacks et al. (1974) list various cases of acceptable overlaps. We
mention several subclasses: the first involve what we dub, following Goodwin and Goodwin
(1992), shared situation assessments. Examples of this are given in (18a,b); in all three cases
a single situation is being described. A second class, noted by Schegloff (2000), are symmetric
moves like greetings, partings, and congratulations (“we won!” “Yay!” etc). A third class is
exemplified by the attested (18c)—cases where the same question is being addressed; additional
instances of this, noted by Schegloff (2000), are utterances involving self-addressed questions
(Tian et al.,2017) and ‘split utterances’—utterances started by A and completed by B—(Goodwin,
1979;Gregoromichelaki et al.,2011;Lerner,1988;Poesio and Rieser,2010).
18
(18) a. B: y’know where they’re separate and they do differently things and we’re doing this
and there’s a y’know we operate in a vacuum
C: Mhm, yeah you choose the part you want.
And you choose what you want.
(Bennett,1978, example (2))20
b. O: Youdon’tseem too enthusiastic about it J: well it was a great trip yeah except that
R: it was a good trip yeah it was yeah
it was a foggy day and we . . .
(Falk,1980, example II)
c. Paul: Tell y- Tell Debbie about the dog on the ((smile intonation begins)) golf course
t’day
Eileen: eh hnh hnh ha
Paul: hih hih
has! ha!
Heh Heh! *hh hh *h
Eileen: Paul en I got ta the first green, (0.6)
Eileen: *hh An this beautiful, ((swallow))
Paul: I rish Setter ((reverently))
Eileen: Irish Setter
Debbie: Ah:::,
Eileen: Came tear in up on ta the first=
Paul: Oh it was beautiful
Eileen :=gree(h)n an tried ta steal Pau(h)l’s go(h)lf ball. *hh
(Goodwin and Goodwin,1992, example (1))
d. M: How old was he? D: Not very old
J: Very old
D: No, not that old.
Our assumption throughout has been that vertical relevance supervenes on horizontal relevance—
what we labelled earlier the multimodal serialization hypothesis. We adopt this assumption since,
at least on the basis of Table 2, all polyphonic utterances seem to have sequential manifestations
which give rise to equivalent contents; such cases, nonetheless, do lead to distinct DGBs since
the partiturs in the two cases are distinct. On the other hand, we believe that there exist sequential
adjacency pairs that do not have polyphonic manifestations which give rise to equivalent contents:
turn-assigning moves, such as those arising by using the assignee’s name or via gaze, do not have
a polyphonic equivalent.
Assuming supervenience to hold, we derive vertical relevance from conversational rules by
applying incrementalization—in other words given two conversational rules CR1and CR2that
can apply in sequence where 𝐴holds the turn as a consequence of CR1and this is exchanged in
CR2, if by means of incremental interpretation B finds herself in a DGB applicable to CR2before
the move taking place CR1is complete, an overlap arises. To make this concrete, A asserting
𝑝and B discussing whether 𝑝is the case can be explicated in terms of the sequence of Assert
QUD-incrementation and QSPEC (see (10)). Incrementalizing this involves B using Assert
QUD-incrementation before A completed their utterance, which then satisfies the preconditions
20‘I would like to think of discourse as not so much an exchange but as a shared world that is built up through various
modes of mutual response over the course of time in particular interaction.’ (Bennett,1978, p. 574).
19
of QSPEC. In such a case, as discussed above, A is the ‘leading voice’ and B is an ‘accompanying
voice’.
All this means that to the extent that the conversational rules underlying horizontal relevance
ensure the coherence of dialogue, the same applies to dialogue with polyphonic utterances. Given
this, incrementalizing conversational rules provides a detailed model for coherence-driven, pre-
dictive processing in natural language interaction. In particular, it makes the testable prediction
that accompanying behaviour commenting on a leading voice (examples of which are collected
in Table 2) is expected to occur before the leading voice finished its contribution on its own.
6 Conclusions
We have outlined a unified framework for describing multimodal dialogical interaction. We
show how minor adjustments to an existing dialogue framework, KoS, which provides richly
structured cognitive states and conversational rules along with (i) partiturs, representations of
multimodal events, and (ii) an incremental semantic framework are needed to analyze multimodal
phenomena.
•We demonstrate the existence of noetic head shakes whose contents are dissociated from
simultaneous speech. Such dissociation has been demonstrated in previous work for
laughter.
•We offer a testable, quantitative account of mutual gaze repair and backchanelling driven
by monitoring of participant roles—not enough leading to clarification requests, too much
leading to complaints.
•We have argued that no overlap is not a defensible norm in multimodal interaction,
including in cases where the two tiers involve speech. The intrinsically sequential notion
of turn should be replaced by a notion such as leading/accompanying voice, which are
driven by vertical coherence.
On the more basic level of theory design, the observations we discussed all exemplify the
need for analytic semantic tools within the systemic landscape of cognitive science. We argued
that a dynamic dialogue semantics incarnates a cognitively potent, formally precise linguistic
framework for fertilizing cross-talk between the disciplines.
As is frequently pointed out but cannot be overemphasized, an important goal of formaliz-
ation in linguistics is to enable subsequent researchers to see the defects of an analysis as
clearly as its merits; only then can progress be made efficiently. (Dowty,1979, 322)
The issues of timing and coherence as captured in terms such as leading voice and vertical
relevance have been identified as specific topics within multimodal dialogue semantics.
Acknowledgments
We wish to thank Judith Holler, two anonymous reviewers for Language and Cognition, Robin
Cooper, Mark Liberman, Chiara Mazzocconi, and Hannes Rieser, for comments on earlier
versions of this paper. Portions of this paper have been presented at the 2021 Dialogue, Memory,
20
and Emotion workshop in Paris, at seminars in Bochum, Saarbrücken, at the Padova Summer
School on Innovative Tools in the Study of Language, and at the 2022 ESSLLI summer school
in Galway. We wish to thank audiences there for their comments.
Funding statement
This work is supported by a public grant overseen by the French National Research Agency
(ANR) as part of the program ‘Investissements d’Avenir’ (reference: ANR-10-LABX-0083). It
contributes to the IdEx Université Paris Cité – ANR-18-IDEX-0001.
References
Michael Argyle. Bodily Communication. Routledge, London and New York, 2 edition, 1988.
John L. Austin. Truth. In Proceedings of the Aristotelian Society. Supplementary, volume xxiv,
pages 111–128, 1950. Reprinted in John L. Austin: Philosophical Papers. 2. ed. Oxford:
Clarendon Press, 1970.
Alan Baddeley. Working memory: Theories, models, and controversies. Annual Review of
Psychology, 63:1–29, 2012. doi: 10.1146/annurev-psych-120710-100422.
Jon Barwise and John Etchemendy. The Liar. Oxford University Press, New York, 1987.
Jon Barwise and John Perry. Situations and Attitudes. Bradford Books. MIT Press, Cambridge,
1983.
Janet B. Bavelas and Jennifer Gerwing. The listener as addressee in face-to-face dialogue. Inter-
national Journal of Listening, 25(3):178–198, 2011. doi: 10.1080/10904018.2010.508675.
C. Fabian Benitez-Quiroz, Ronnie B. Wilbur, and Aleix M. Martinez. The Not Face: A
grammaticalization of facial expressions of emotion. Cognition, 150:77–84, 2016. doi:
10.1016/j.cognition.2016.02.004.
Adrian Bennett. Interruptions and the interpretation of conversation. In Annual Meeting of the
Berkeley Linguistics Society, volume 4, pages 557–575, 1978.
Ginger Berninger and Catherine Garvey. Relevant replies to questions: Answers versus evasions.
Journal of Psycholinguistic Research, 10(4):403–420, 1981.
Mark H. Bickhard. Is embodiment necessary? In Calvo Paco and Toni Gomila, editors, Handbook
of Cognitive Science: An Embodied Approach, Perspectives on Cognitive Science, chapter 2,
pages 29–40. Elsevier, San Diego, CA, 2008.
Ann E. Bigelow. The development of joint attention in blind infants. Development and Psycho-
pathology, 15(2):259–275, 2003. doi: 10.1017/s0954579403000142.
Berit Brogaard. What can neuroscience tell us about reference? In Barbara Abbott and Jeanette
Gundel, editors, The Oxford Handbook of Reference, pages 365–383. Oxford University Press,
Oxford, 2019. doi: 10.1093/oxfordhb/9780199687305.013.17.
21
Patricia M. Clancy, Sandra A. Thompson, Ryoko Suzuki, and Hongyin Tao. The conversational
use of reactive tokens in English, Japanese, and Mandarin. Journal of Pragmatics, 26(3):
355–387, 1996. doi: 10.1016/0378-2166(95)00036-4.
Herbert Clark. Using Language. Cambridge University Press, Cambridge, 1996.
Louise Connell. What have labels ever done for us? The linguistic shortcut in conceptual
processing. Language, Cognition and Neuroscience, 34(10):1308–1318, 2019. doi: 10.1080/
23273798.2018.1471512.
Richard P. Cooper and David Peebles. Beyond single-level accounts: The role of cognitive
architectures in cognitive scientific explanation. Topics in Cognitive Science, 7(2):243–258,
2015. doi: 10.1111/tops.12132.
Robin Cooper. Type theory, interaction and the perception of linguistic and musical events. In
Martin Orwin, Christine Howes, and Ruth Kempson, editors, Language, Music and Interaction,
pages 67–90. College Publications, 2015.
Robin Cooper. Representing types as neural events. Journal of Logic, Language and Information,
28(2):131–155, 2019.
Robin Cooper. From perception to communication: An analysis of meaning and action using a
theory of types with records (TTR). Oxford University Press, Oxford, in press.
Robin Cooper and Jonathan Ginzburg. Type theory with records for natural language semantics.
In Shalom Lappin and Chris Fox, editors, The Handbook of Contemporary Semantic Theory,
chapter 12, pages 375–407. Wiley-Blackwell, 2 edition, 2015.
Robin Cooper, Simon Dobnik, Staffan Larsson, and Shalom Lappin. Probabilistic type theory
and natural language semantics. Linguistic Issues in Language Technology, 10, 2015. URL
https://aclanthology.org/2015.lilt-10.4.
Jérôme Daltrozzo and Daniele Schön. Conceptual processing in music as revealed by N400
effects on words and musical targets. Journal of Cognitive Neuroscience, 21(10):1882–1892,
2009. doi: 10.1162/jocn.2009.21113.
Jan Peter de Ruiter. On the primacy of language in multimodal communication. In Proceedings
of the Workshop on Multimodal Corpora, pages 38–41, Lisbon, 2004.
Camille Debras. The shrug: Forms and meanings of a compound enactment. Gesture, 16(1):
1–34, 2017. doi: https://doi.org/10.1075/gest.16.1.01deb.
Vera Demberg, Frank Keller, and Alexander Koller. Incremental, predictive parsing with psy-
cholinguistically motivated tree-adjoining grammar. Computational Linguistics, 39(4):1025–
1066, 2013.
David R. Dowty. Word Meaning and Montague Grammar. Reidel, Dordrecht, 1979.
22
Alessandro Duranti. Polyphonic discourse: Overlapping in Samoan ceremonial greetings. Text
– Interdisciplinary Journal for the Study of Discourse, 17(3):349–382, 1997.
Cornelia Ebert. The non-at-issue contributions of gestures. Workshop on Demonstration and
Demonstratives, April 11-12 2014, Stuttgart, 04 2014.
Nick J. Enfield. The Anatomy of Meaning: Speech, Gesture, and Composite Utterances. Num-
ber 13 in Language, Culture and Cognition. Cambridge University Press, 2009.
Jane Falk. The conversational duet. In Annual Meeting of the Berkeley Linguistics Society,
volume 6, pages 507–514, 1980.
Tim Fernando. Observing events and situations in time. Linguistics and Philosophy, 30(5):
527–550, 2007. doi: 10.1007/s10988-008-9026-1.
Fernanda Ferreira. Psycholinguistics, formal grammars, and cognitive science. The Linguistic
Review, 22(2-4):365–380, 2005. doi: 10.1515/tlir.2005.22.2-4.365.
Steven M. Frankland and Joshua D. Greene. Concepts and compositionality: In search of
the brain’s language of thought. Annual Review of Psychology, 71(1):273–303, 2020. doi:
10.1146/annurev-psych-122216-011829.
Riccardo Fusaroli, Nivedita Gangopadhyay, and Kristian Tylén. The dialogically extended mind:
Language as skilful intersubjective engagement. Cognitive Systems Research, 29-30:31–39,
2014. doi: 10.1016/j.cogsys.2013.06.002.
Alan Garnham. Models of processing: discourse. WIREs Cognitive Science, 1(6):845–853,
2010. doi: 10.1002/wcs.69.
Jonathan Ginzburg. An update semantics for dialogue. In H. Bunt, editor, Proceedings of the 1st
International Workshop on Computational Semantics. Tilburg University, 1994.
Jonathan Ginzburg. The Interactive Stance: Meaning for Conversation. Oxford University Press,
2012.
Jonathan Ginzburg and Andy Lücking. On laughter and forgetting and reconversing: A
neurologically-inspired model of conversational context. In Proceedings of The 24th Workshop
on the Semantics and Pragmatics of Dialogue, SemDial/WatchDial, 2020.
Jonathan Ginzburg, Robin Cooper, Julian Hough, and David Schlangen. Incrementality and
HPSG: Why not? In Anne Abeillé and Olivier Bonami, editors, Constraint-Based Syntax and
Semantics: Papers in Honor of Danièle Godard. CSLI Publications, 2020a.
Jonathan Ginzburg, Chiara Mazzocconi, and Ye Tian. Laughter as language. Glossa, 5(1):104,
2020b. doi: 10.5334/gjgl.1152.
Jonathan Ginzburg, Zulipiye Yusupujiang, Chuyuan Li, Kexin Ren, Aleksandra Kucharska, and
Paweł Łupkowski. Characterizing the response space of questions: data and theory. Dialogue
and Discourse (forthcoming), 2022.
23
Charles Goodwin. The interactive construction of a sentence in natural conversation. In George
Psathas, editor, Everyday Language: Studies in Ethnomethodology, pages 97–121. Irvington
Publishers, New York, 1979.
Charles Goodwin and Marjorie Harness Goodwin. Assessments and the construction of context.
Rethinking context: Language as an interactive phenomenon, 11:147–190, 1992.
Eleni Gregoromichelaki, Ruth Kempson, Matthew Purver, Gregory J. Mills, Ronnie Cann R.,
Wilfried Meyer-Viol, and Patrick G.T. Healey. Incrementality and intention-recognition in
utterance processing. Dialogue and Discourse, 2(1):199–233, 2011. doi: 10.5087/dad.2011.
109.
Eleni Gregoromichelaki, Ronnie Cann, and Ruth Kempson. On coordination in dialogue: Sub-
sentential speech and its implications. In Laurence Goldstein, editor, Brevity, chapter 3, pages
53–73. Oxford University Press, Oxford, UK, 2013.
Uri Hadar, Timothy J Steiner, and F Clifford Rose. Head movement during listening turns in
conversation. Journal of Nonverbal Behavior, 9(4):214–228, 1985.
Fritz Hamm, Hans Kamp, and Michiel Van Lambalgen. There is no opposition between formal
and cognitive semantics. Theoretical Linguistics, 32(1):1–40, 2006.
Barbara R Hanning. Conversation and musical style in the late eighteenth-century Parisian Salon.
Eighteenth-Century Studies, 22(4):512–528, 1989.
Uri Hasson, Asif A. Ghazanfar, Bruno Galantucci, Simon Garrod, and Christian Keysers. Brain-
to-brain coupling: a mechanism for creating and sharing a social world. Trends in Cognitive
Sciences, 16(2):114–121, 2012. doi: 10.1016/j.tics.2011.12.007.
Irene Heim. The Semantics of Definite and Indefinite Noun Phrases. PhD thesis, University of
Massachusetts Amherst, 1982.
Dirk Heylen. Listening heads. In Modeling Communication with robots and virtual humans,
pages 241–259. Springer, 2008.
Katherine Hilton. What Does an Interruption Sound Like? PhD thesis, Stanford University,
2018.
Judith Holler and Stephen C. Levinson. Multimodal language processing in human communic-
ation. Trends in Cognitive Sciences, 23(8):639–652, 2019. doi: 10.1016/j.tics.2019.05.006.
Opinion.
John E. Hummel. Getting symbols out of a neural architecture. Connection Science, 23(2):
109–118, 2011. doi: 10.1080/09540091.2011.569880.
Muireann Irish. On the interaction between episodic and semantic representations – constructing
a unified account of imagination. In Anna Abraham, editor, The Cambridge Handbook of the
Imagination, Cambridge Handbooks in Psychology, pages 447–465. Cambridge University
Press, Cambridge, 2020. doi: 10.1017/9781108580298.027.
24
William James. The Principles of Psychology. Harvard University Press, 1981. The Harvard
edition of The Principles of Psychology, 2 vols., repr. from 1842.
Mari Riess Jones and Marylin Boltz. Dynamic attending and responses to time. Psychological
review, 96(3):459–491, 1989. doi: 10.1037/0033-295X.96.3.459.
Hans Kamp. Events, instants and temporal reference. In Rainer Bäuerle, Urs Egli, and Arnim
von Stechow, editors, Semantics from Different Points of View, number 6 in Springer Series in
Language and Communication, pages 376–417. Springer, Berlin, 1979.
Hans Kamp and Uwe Reyle. From Discourse to Logic. Kluwer Academic Publishers, Dordrecht,
1993.
Ruth Kempson, Wilfried Meyer-Viol, and Dov M. Gabbay. Dynamic syntax. Blackwell Publish-
ers, Oxford and Malden, Mass., 2001.
Adam Kendon. Some functions of gaze-direction in social interaction. Acta Psychologica, 26
(1):22–63, 1967. doi: 10.1016/0001-6918(67)90005-4.
Adam Kendon. Some uses of the head shake. Gesture, 2(2):147–182, 2002.
Adam Kendon. Gesture: Visible Action as Utterance. Cambridge University Press, 2004.
Jeagwon Kim. Concepts of supervenience. Philosophy and Phenomenological Research, 45(2):
153–176, 1984.
Lawrence Krader. Noetics: The Science of Thinking and Knowing. Peter Lang, New York, 2010.
John W. Krakauer, Asif A. Ghazanfar, Alex Gomez-Marin, Malcolm A. MacIver, and David
Poeppel. Neuroscience needs behavior: Correcting a reductionist bias. Neuron, 93(3):480–
490, 2017. ISSN 0896-6273. doi: 10.1016/j.neuron.2016.12.041.
Staffan Larsson. Issue based Dialogue Management. PhD thesis, Gothenburg University, 2002.
Alex Lascarides and Matthew Stone. Discourse coherence and gesture interpretation. Gesture,
9(2):147–180, 2009.
Gene Howard Lerner. Collaborative turn sequences: Sentence construction and social action.
PhD thesis, University of California, Irvine, 1988.
Stephen C. Levinson and Francisco Torreira. Timing in turn-taking and its implications for
processing models of language. Frontiers in psychology, 6:731, 2015.
David Lewis. Scorekeeping in a language game. In Rainer Bäuerle, Urs Egli, and Arnim von
Stechow, editors, Semantics from Different Points of View, number 6 in Springer Series in
Language and Communication, pages 172–187. Springer, Berlin, 1979.
Mario1 Liotti, Kathy Ryder, and Marty G. Woldorff. Auditory attention in the congenitally blind:
where, when and what gets reorganized? NeuroReport, 9(6):1007–1012, 1998.
25
Piotr Litwin and Marcin Miłkowski. Unification by fiat: Arrested development of predictive
processing. Cognitive Science, 44:e12867, 2020. doi: 10.1111/cogs.12867.
Daniel Loehr. Apects of rhythm in gesture in speech. Gesture, 7(2):179–214, 2007.
Andy Lücking and Jonathan Ginzburg. Towards the score of communication. In Proceedings of
The 24th Workshop on the Semantics and Pragmatics of Dialogue, SemDial/WatchDial, 2020.
Andy Lücking and Jonathan Ginzburg. Saying and shaking ‘no’. In Proceedings of the 28th
International Conference on Head-Driven Phrase Structure Grammar, HPSG 2021, 2021.
Andy Lücking, Alexander Mehler, and Peter Menke. Taking fingerprints of speech-and-gesture
ensembles: Approaching empirical evidence of intrapersonal alignmnent in multimodal com-
munication. In Proceedings of the 12th Workshop on the Semantics and Pragmatics of
Dialogue, LonDial’08, pages 157–164, King’s College London, 2008.
David Marr. Vision. Freeman, San Francisco, 1982.
Chiara Mazzocconi, Ye Tian, and Jonathan Ginzburg. What is your laughter doing there: a
taxonomy of the pragmatic functions of laughter. IEEE Transactions of Affective Computing,
2020.
David McNeill. Hand and Mind – What Gestures Reveal about Thought. Chicago University
Press, 1992.
Alexander Mehler and Andy Lücking. Pathways of alignment between gesture and speech:
Assessing information transmission in multimodal ensembles. In Gianluca Giorgolo and
Katya Alahverdzhieva, editors, Proceedings of the International Workshop on Formal and
Computational Approaches to Multimodal Communication under the auspices of ESSLLI,
2012.
Lotte Meteyard, Sara Rodriguez Cuadrado, Bahador Bahrami, and Gabriella Vigliocco. Coming
of age: A review of embodiment and the neuroscience of semantics. Cortex, 48(7):788–804,
2012. doi: 10.1016/j.cortex.2010.11.002.
Lorenza Mondada. The local constitution of multimodal resources for social interaction. Journal
of Pragmatics, 65:137–156, 2014. doi: 10.1016/j.pragma.2014.04.004.
Lorenza Mondada. Challenges of multimodality: Language and the body in social interaction.
Journal of Sociolinguistics, 20(3):336–366, 2016. doi: 10.1111/josl.1\_12177.
Richard Montague. Pragmatics. In Richmond Thomason, editor, Formal Philosophy. Yale UP,
New Haven, 1974.
Peter Mundy and Lisa Newell. Attention, joint attention, and social cognition. Current directions
in psychological science, 16(5):269–274, 2007. doi: 10.1111/j.1467-8721.2007.00518.x.
PMID: 19343102.
26
Lauri Nummenmaa and Andrew J. Calder. Neural mechanisms of social attention. Trends in
Cognitive Sciences, 13(3):135–143, 2009. doi: 10.1016/j.tics.2008.12.006.
Josef Perner, Michael Huemer, and Brian Leahy. Mental files and belief: A cognitive theory
of how children represent belief and its intensionality. Cognition, 145(Supplement C):77–88,
2015. doi: 10.1016/j.cognition.2015.08.006.
Martin J. Pickering and Simon Garrod. Toward a mechanistic psychology of dialogue. Behavioral
and Brain Sciences, 27(2):169–190, 2004.
Martin J. Pickering and Simon Garrod. An integrated theory of language production and
comprehension. Behavioral and Brain Sciences, 36(4):329–347, 2013. doi: 10.1017/
S0140525X12001495.
Massimo Poesio and Hannes Rieser. Completions, continuations, and coordination in dialogue.
Dialogue and Discourse, 1(1):1–89, 2010.
Isabella Poggi. Mind markers. In The Semantics and Pragmatics of Everyday Gestures. Verlag
Arno Spitz, Berlin, 2001.
Carl Pollard and Ivan A. Sag. Head-Driven Phrase Structure Grammar. CSLI Publications,
Stanford, CA, 1994.
Matthew Purver. CLARIE: Handling clarification requests in a dialogue system. Research on
Language & Computation, 4(2):259–288, 2006.
François Recanati. Mental files. Oxford University Press, 2012.
Craige Roberts. Information structure in discourse: Towards an integrated formal theory of
pragmatics. OSU Working Papers in Linguistics, pages 91–136, 1996.
Juan Pablo Robledo, Sarah Hawkins, Carlos Cornejo, Ian Cross, Daniel Party, and Esteban Hur-
tado. Musical improvisation enhances interpersonal coordination in subsequent conversation:
Motor and speech evidence. PloS one, 16(4):e0250166, 2021. doi: 10.1371/journal.pone.
0250166.
Harvey Sacks, Emanuel A. Schegloff, and Gail Jefferson. A simplest systematics for the organ-
ization of turn-taking for conversation. Language, 50(4):696–735, 1974.
Emanuel A. Schegloff. Overlapping talk and the organization of turn-taking for conversation.
Language in Society, 29:1–63, 2000.
Emanuel A. Schegloff. Sequence Organization in Interaction. Cambridge University Press,
Cambridge, 2007.
Natalie Sebanz and Guenther Knoblich. Prediction in joint action: What, when, and where.
Topics in Cognitive Science, 1(2):353–367, 2009. doi: 10.1111/j.1756-8765.2009.01024.x.
Robert C. Stalnaker. Assertion. In P. Cole, editor, Syntax and Semantics, Volume 9, pages
315–332. AP, New York, 1978.
27
Tanya Stivers and Nick J. Enfield. A coding scheme for question–response sequences in conver-
sation. Journal of Pragmatics, 42(10):2620–2626, 2010.
Jürgen Streeck. Gesturecraft. Number 2 in Gesture Studies. John Benjamins, 2009.
Jürgen Streeck and Ulrike Hartge. Previews: Gestures at the transition place. In Peter Auer and
Aldo Di Luzio, editors, The Contextualization of Language, pages 135–157. John Benjamins,
Amsterdam, 1992.
Deborah Tannen. Conversational style: Analyzing talk among friends. Oxford University Press,
1984.
Henry S Thompson. Conversation as musical interaction, 1993. HCRC Edinburgh unpublished
lecture.
Ye Tian and Jonathan Ginzburg. No I Am: What are you saying “No” to? In Sinn und Bedeutung
21, 2016.
Ye Tian, Takehiko Maruyama, and Jonathan Ginzburg. Self addressed questions and filled pauses:
A cross-linguistic investigation. Journal of psycholinguistic research, 46(4):905–922, 2017.
Michael Tomasello. The Cultural Origins of Human Cognition. Harvard University Press,
Cambridge, MA, 1999.
Kevin Tuite. The production of gesture. Semiotica, 93(1/2):83–105, 1993.
Roel Vertegaal, Robert Slagter, Gerrit van der Veer, and Anton Nijholt. Eye gaze patterns in
conversations: There is more to conversational agents than meets the eyes. In Proceedings of
SIGCHI 2001, CHI ’01, pages 301–308, 2001. doi: 10.1145/365024.365119.
Hannes Vilhjálmsson, Nathan Cantelmo, Justine Cassell, Nicolas E. Chafai, Michael Kipp,
Stefan Kopp, Maurizio Mancini, Stacy Marsella, Andrew N. Marshall, Catherine Pelachaud,
Zsofi Ruttkay, Kristinn R. Thórisson, Herwin van Welbergen, and Rick J. van der Werf. The
Behavior Markup Language: Recent developments and challenges. In Catherine Pelachaud,
Jean-Claude Martin, Elisabeth André, Gérard Chollet, Kostas Karpouzis, and Danielle Pelé,
editors, Intelligent Virtual Agents, pages 99–111, Berlin, 2007. Springer.
Markus Werning. Predicting the past from minimal traces: Episodic memory and its distinction
from imagination and preservation. Review of Philosophy and Psychology, 11:301–333, 2020.
doi: 10.1007/s13164-020-00471-z.
Jiahong Yuan, Mark Liberman, and Christopher Cieri. Towards an integrated understanding of
speaking rate in conversation. In Proceedings of INTERSPEECH, pages 541–544, 2006.
Jiahong Yuan, Mark Liberman, and Christopher Cieri. Towards an integrated understanding of
speech overlaps in conversation. ICPhS XVI, Saarbrücken, Germany, 2007.
28