ArticlePDF Available

Turn-taking in Conversational Systems and Human-Robot Interaction: A Review

Authors:

Abstract and Figures

The taking of turns is a fundamental aspect of dialogue. Since it is difficult to speak and listen at the same time, the participants need to coordinate who is currently speaking and when the next person can start to speak. Humans are very good at this coordination, and typically achieve fluent turn-taking with very small gaps and little overlap. Conversational systems (including voice assistants and social robots), on the other hand, typically have problems with frequent interruptions and long response delays, which has called for a substantial body of research on how to improve turn-taking in conversational systems. In this review article, we provide an overview of this research and give directions for future research. First, we provide a theoretical background of the linguistic research tradition on turn-taking and some of the fundamental concepts in theories of turn-taking. We also provide an extensive review of multi-modal cues (including verbal cues, prosody, breathing, gaze and gestures) that have been found to facilitate the coordination of turn-taking in human-human interaction, and which can be utilised for turn-taking in conversational systems. After this, we review work that has been done on modelling turn-taking, including end-of-turn detection, handling of user interruptions, generation of turn-taking cues, and multi-party human-robot interaction. Finally, we identify key areas where more research is needed to achieve fluent turn-taking in spoken interaction between man and machine.
Content may be subject to copyright.
Turn-taking in Conversational Systems and Human-Robot
Interaction: A Review
Gabriel Skantze
Department of Speech Music and Hearing, KTH, Sweden
ARTICLE INFO
Article History:
Received 25 August 2020
Accepted 4 December 2020
Available online 16 December 2020
ABSTRACT
The taking of turns is a fundamental aspect of dialogue. Since it is difcult to speak and listen
at the same time, the participants need to coordinate who is currently speaking and when
the next person can start to speak. Humans are very good at this coordination, and typically
achieve uent turn-taking with very small gaps and little overlap. Conversational systems
(including voice assistants and social robots), on the other hand, typically have problems
with frequent interruptions and long response delays, which has called for a substantial
body of research on how to improve turn-taking in conversational systems. In this review
article, we provide an overview of this research and give directions for future research. First,
we provide a theoretical background of the linguistic research tradition on turn-taking and
some of the fundamental concepts in theories of turn-taking. We also provide an extensive
review of multi-modal cues (including verbal cues, prosody, breathing, gaze and gestures)
that have been found to facilitate the coordination of turn-taking in human-human interac-
tion, and which can be utilised for turn-taking in conversational systems. After this, we
review work that has been done on modelling turn-taking, including end-of-turn detection,
handling of user interruptions, generation of turn-taking cues, and multi-party human-robot
interaction. Finally, we identify key areas where more research is needed to achieve uent
turn-taking in spoken interaction between man and machine.
© 2020 The Author. Published by Elsevier Ltd. This is an open access article under the CC BY
license (http://creativecommons.org/licenses/by/4.0/)
Keywords:
Turn-taking
Dialogue systems
Social robotics
Prosody
Gaze
1. Introduction
Many human social activities require some kind of turn-taking protocol, which determines the order in which different
actions are supposed to take place, and by whom. This is obvious when, for example, playing a game of chess (where the protocol
is very simple), but it also applies to spoken interaction. Since it is difcult to speak and listen at the same time, speakers in dia-
logue have to somehow coordinate who is currently speaking and who is listening. How this turn-taking is coordinated has been
studied during the past decades in different scientic disciplines, including linguistics, phonetics, neuropsychology, and sociol-
ogy.
Turn-taking is however not only a concern for those trying to understand human communication. As conversational systems
(in various forms) are becoming ubiquitous, it is clear that turn-taking is still not handled very well in those systems. They often
tend to interrupt the user or have very long response delays, there is little timely feedback, and the ow of the conversation feels
stilted. Thus, modelling turn-taking in conversational systems is still very much an area of active research. In this review article,
we will give an overview of ndings from studies of turn-taking in human-human conversation, outline the state-of-the art in
E-mail address: gabriel@speech.kth.se
https://doi.org/10.1016/j.csl.2020.101178
0885-2308/© 2020 The Author. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)
Computer Speech & Language 67 (2021) 101178
Contents lists available at ScienceDirect
Computer Speech & Language
journal homepage: www.elsevier.com/locate/csl
modelling turn-taking in conversational systems and human-robot interaction, and draw conclusions about future directions for
this research eld.
The rst dialogue systems, such as Eliza (Weizenbaum, 1966), were text-based, and in these turn shifts were clearly indicated
by the user pushing the sendbutton. The same goes for the chatbots still used today, on for example websites or in messaging
apps. Similarly, some speech-based systems have relied on push-to-talk mechanisms for managing turn-taking (e.g., Hemphill
et al. 1990; Traum et al. 2007; 2012). Push-to-talk is however not very convenient if the hands are busy or when talking over dis-
tance (for example with a robot or a smart speaker). There is also a risk that the user forgets to push the button, pushes the button
after starting to speak, or releases it before the utterance is complete (Skantze and Gustafson, 2009). Such explicit signals are of
course sometimes used in human-human interaction as well, especially in simplex channel settings, for example the oversignal
used in walkie-talkie interactions. Another example of explicit turn-taking signalling are the wake-words used in todays smart
speakers and voice assistants, such as Hey Sirior Alexa(Gao et al., 2020). The wake-word gives a clear cue that the user wants
to initiate a turn (although the end of the turn has to be detected by other means). The wake-word also helps to identify the
addressee of the utterance (as being the voice assistant and not some other person), as well as allowing the user to barge-in
more easily. However, while they can be effective, the use of explicit cues is not very convenient and often leads to an interaction
that is less conversational(Woodruff and Aoki, 2003). In most conversational settings, we manage turn-taking efciently with-
out thinking about how it is actually accomplished.
When not using explicit cues, spoken dialogue systems have traditionally detected the end of the users turn by a certain
amount of silence. Silence, however, is not a very good indicator of turn-endings, since users might pause within a turn and
thereby unintentionally trigger a response. This can be mitigated by increasing the silence threshold, but this will then lead to
more sluggish responses. Studies of human-human conversation have found that pauses within turns are on average longer than
gaps between turns (Brady, 1968; Ten Bosch et al., 2005; Edlund and Heldner, 2005), so silence is clearly not the main signal for
humans to switch turns.
The signals (or cues) by which speakers coordinate their turn-taking have been studied extensively, and have been found
across different modalities, including verbal cues, prosody, breathing, eye gaze and gestures, that we will explore in depth in Sec-
tion 3 of this review. Thus, apart from the auditory channel, the visual channel (the face and body) are also important for turn-
taking. Therefore, conversational systems that involve virtual or physical agents are also very interesting from a turn-taking per-
spective, as they provide a wider repertoire of turn-taking cues, and will also be covered in this review. Human-robot interaction
(HRI) is especially interesting, as the situated nature of the interaction more easily allows for multi-party conversations, where
turn-taking is an even more complex phenomenon.
This review article is organised as follows. We will start with a brief review of the linguistic research tradition on turn-taking
and introduce some fundamental concepts (Section 2). Then we will review the cues that have been found to facilitate turn-tak-
ing across different modalities (Section 3). After this, we will review four main aspects of turn-taking in conversational systems
that have so far attracted a considerable amount of research:
How can the system identify appropriate places to take the turn or to produce a backchannel? (Section 4)
How can the system handle interruptions, overlaps and backchannels from the user? (Section 5)
How can the system generate turn-taking signals that help the user to understand whether the oor is open or not?
(Section 6)
How can the system handle multi-party and situated interaction (which is common for human-robot interaction), where
there might be several potential addresses of an utterance, and which might involve the manipulation of physical objects?
(Section 7)
Finally, in Section 8, we will identify a couple of directions where we think more research is needed.
As most research on turn-taking has been done on English (especialy when it comes to computational modelling), this will
also be reected in this review, with some exceptions. If the language of study is not explicitly stated, the reader should assume
it is English.
2. Fundamental concepts
An example of how tightly coordinated turn-taking can be is shown in Fig. 1, where one person (speaker A) is describing a
route on a map to another person (speaker B). As can be seen, there are very small (or no) gaps between the turns. Already before
the rst question from A is complete, B starts to answer. This is truly remarkable, given that B not only has to interpret what is
being said, but also gure out a response, and then start to articulate that response (Levelt, 1989).
One of the most inuential early accounts of the organisation of turn-taking is the one proposed by Sacks et al. (1974). Their
model (which applies to dyadic as well as multi-party interaction) is based on a set of basic observations. To start with, the orga-
nisation is not planned in advance, but has to be coordinated in a exible manner as the dialogue evolves. Also, they note that
overwhelmingly one party talks at a time. [... ] Occurrences of more than one speaker at a time are common but brief [... ] Transi-
tions (from one turn to the next) with no gap and no overlap are common. Together with transitions characterised by slight gap
or slight overlap, they make up the vast majority of transitions(p. 700).
Based on these observations, they propose that turn-taking can be analysed using units of speech called turn-constructional
units (TCU), which are stretches of speech from one speaker during which other participants assume the role as listeners. After
2G. Skantze / Computer Speech & Language 67 (2021) 101178
each such unit, there is a transition-relevant place (TRP), where a turn-shift can (but does not have to) occur according to the
following rules:
1. The current speaker may select a next speaker (other-select), using for example gaze or an address term. In the case of
dyadic conversation, this may default to the other speaker.
2. If the current speaker does not select a next speaker, then any participant can self-select. The rst to start gains the turn.
3. If no other party self-selects, the current speaker may continue.
In order to identify TCUs and TRPs, researchers using speech technology have found it convenient to rst segment the speech
into Inter-pausal units (IPUs), which are stretches of audio from one speaker without any silence exceeding a certain amount
(such as 200ms). These can relatively easily be identied using voice activity detection (VAD). This was noted already by
Brady (1965), who used this technique to analyse turn-taking patterns with automatic methods. A turn is then typically dened
as a sequence of IPUs from a speaker, which are not interrupted by IPUs from another speaker. A possible exception are very short
IPUs (like mhm) that might occur without the intention of taking the turn. We will discuss such backchannels further down.
Silence between two IPUs within the same speaker are often referred to as pauses, whereas silence between IPUs from different
speakers are called gaps. These concepts are illustrated in Fig. 1.
By operationalising turn-taking using IPUs, it possible to analyse turn-taking patterns and statistics in larger corpora with
automatic methods (e.g. Brady 1968; Ten Bosch et al. 2005; Heldner and Edlund 2010; Levinson and Torreira 2015). One example
of this is the histogram of turn-taking latency shown in Fig. 2. As can be seen, even if gaps and overlaps are common, humans are
Fig. 1. Example of turn-taking from Map Task (Anderson et al., 1991).
Fig. 2. Turn-taking latency in the Switchboard corpus (Godfrey et al., 1995), as calculated and visualised by Levinson and Torreira (2015). Negative latencies rep-
resent overlaps and positive latencies gaps.
G. Skantze / Computer Speech & Language 67 (2021) 101178 3
typically very good at keeping them short, often with just a 200ms gap (although reported medians vary across interaction styles
and cultures, as discussed in Section 2.5). It is clear that a response time of 200ms is much shorter than the silence threshold com-
monly used in spoken dialogue systems (often around 7001000ms). Thus, there has to be some other mechanisms in place that
help the listener to take turns with such small gaps, while minimising the amounts of overlaps.
2.1. Turn-taking cues
Several studies have investigated cues that can be found at the end of IPUs, which could be used by the listener to distinguish
TRPs (turn-yielding cues) from non-TRPs (turn-holding cues). One of the rst to systematically study these cues was Duncan
and his associates (Duncan, 1972; 1974; Duncan and Fiske, 1977). By analysing face-to-face (American English, dyadic) conversa-
tions, they identied several multi-modal cues that signal turn-completion, including phrase-nal intonation, termination of
hand gesticulation, and completion of a grammatical clause. It is important to stress that these cues do not have a denite effect
on the recipient, and that turn-taking is highly optional. However, what they found was that these cues had an additive effect:
the listener was more likely to take the turn as the number of turn-yielding cues increased. These studies have later been fol-
lowed by a large number of studies that have investigated the effects of individual cues, as well as the effects of combining them,
using larger datasets, automatic methods, and more thorough statistical analyses (e.g. Koiso et al. 1998; Gravano and Hirschberg
2011; Hjalmarsson 2011). In general, these studies tend to conrm the nding that turn-taking cues are additive, even if there is
also a considerable amount of redundancy. A summary of typical cues found in studies of turn-taking is listed in Table 1. We will
provide an in-depth review of these cues in Section 3.
A complicating factor in these kind of analyses is that TRPs (in the sense proposed by Sacks et al.) cannot be directly observed
in data. We can only observe actual turn-shifts, which might thus be considered to be a subset of TRPs. Furthermore, turn-shifts
might also occur where there is no TRP (for example if the listener has something very important to say), as speakers are of
course free to break the rules, just like they are free to produce non-grammaticalutterances. Given this, some researchers
have questioned whether it is meaningful to talk about turn-taking rulesat all (OConnell et al., 1990). In any case, it seems clear
that turn-shifts are more likely to occur at certain places than others. Thus, we would like to propose that TRP should not be
regarded as a binary notion, but rather as the probability of a speaker shift in a certain context (for example given a certain set of
cues).
In addition to the signals produced by the current speaker, it is also relevant to consider the signals produced by the next
speaker to indicate her willingness to take the turn. Duncan (1974) argues that there are certain signals which show that the
next speaker wants to take the turn (when self-selecting or when accepting an offered turn), which we will refer to as turn-ini-
tial cues. These cues include things like gazing away, inhaling or initiating a gesture (ibid.). Thus, they are similar to the turn-
holding cues discussed above, although they are produced before the onset, or just at the start, of a turn. The turn-initial cue can
also help to differentiate attempts to take the turn from backchannels, which will be discussed in Section 2.3 below.
2.2. Reaction vs. prediction
While the turn-taking cues outlined above provide some explanation as to how the listener can differentiate pauses from
gaps, they cannot provide a full account for how turn-taking is coordinated. If the listener only focused on cues at the end of the
turn, it would not really be plausible to nd a typical response time of 200ms (as seen Fig. 2). This would not give the listener
enough time to react to the cue, prepare a response and start speaking. According to psycholinguistic estimates, the response
time would rather be around 600-1500ms (Levinson and Torreira, 2015). This has led many researchers to conclude that there
must be some sort of prediction mechanism involved (Sacks et al., 1974; Levinson and Torreira, 2015; Garrod and Pickering,
2015; Ward, 2019). This is even more evident when considering the fairly large proportion of turn-shifts that occur without any
gap at all, i.e., before the IPU has even been completed (as seen in Fig. 2). An example of such a turn-shift was shown in Fig. 1.
Well before the rst question is complete (before the nal word volcanois spoken), speaker B must predict that the turn is
about to end, what kind of dialogue act is being produced (a question), as well as the nal word in the question, in order to pre-
pare a meaningful answer.
This predictive view is sometimes contrasted with the Signalling approach(or reactiveview) to turn-taking which focuses
on the cues found at the end of the turn. However, even though the predictive and reactive accounts of turn-taking are
Table 1
Typical turn-nal cues found in studies of English conversation.
Turn-yielding cues Turn-holding cues
Verbal Syntactically complete Syntactically incomplete,
Filled pause
Prosody Rising or falling pitch Flat pitch
Lower intensity Higher intensity
Breathing Breathe out Breathe in
Gaze Looking at addressee Looking away
Gesture Terminated Non-terminated
4G. Skantze / Computer Speech & Language 67 (2021) 101178
sometimes portrayed as two opposing views, it seems like most researchers today acknowledge the need for both prediction
mechanisms (which allows the other speaker to prepare a response), as well as some turn-nal cues that can conrm that the
turn is actually complete and yielded (Heldner and Edlund, 2010; Levinson and Torreira, 2015). Ward (2019) makes an analogy
with the Ready-Set-Go!signal used when starting a sprint.
The mechanisms used for prediction are more complex to study and harder to identify than the signals found at the end of the
turn. Sacks et al. (1974) argued that syntax and semantics were more important for prediction than for example intonation, as the
completion of a syntactic unit is likely to be more predictable than a prosodic unit (more on that in Section 3). Garrod and Picker-
ing (2015) argue that dialogue act prediction is an integral part of end-of-turn prediction. In their account, the addressee covertly
imitates the speakers utterance in order to determine the underlying intention of the upcoming utterance (in terms of content
and timing). In addition, they argue that the speakers rely on entrainment of low-frequency oscillations between speech enve-
lope and brain, as originally proposed by Wilson and Wilson (2005).
2.3. Overlaps, backchannels and interruptions
As we have seen, even though dialogue proceeds predominately with one speaker at a time, there is typically a considerable
amount of overlap. It is important to note that overlap should not just be considered as failedturn-taking, as it often serves
many important functions and contributes to a uent interaction (Coates, 1994). A distinction can be made between competitive
and cooperative overlap. In competitive overlaps, the two speakers are seen as competing for the turn, where one of them will
have to give upthe turn. In cooperative overlaps, they are seen as producing speech in a collaborative manner, and are not com-
peting for the turn.
Schegloff (2000) makes a further distinction between four types of cooperative overlaps:
Terminal overlaps: The listener predicts the end of the turn and starts speaking before it is completed, as illustrated in Fig. 1
Continuers (or Backchannels): Brief, relatively soft vocalisations, such as mm hm,uh huh,oryeah, produced by the lis-
tener to show continued attention and possibly aspects such as attitude and uncertainty (Ward, 2004). This phenomenon has
been referred to as backchannels(Yngve, 1970), listener responses(Dittman and Llewellyn, 1967) and accompaniment
signals(Kendon, 1967). In a face-to-face setting, backchannels can also be produced in the visual channel, for example by
nodding or making facial expressions.
Conditional access to turn: Cases where the listener helps the speaker to construct the turn and possibly bring it to comple-
tion. This might be to just ll in a name that the speaker has forgotten, or a longer sequence that is produced jointly. This is
often referred to as sentence completion (Poesio and Rieser, 2010).
Choral talk: Simultaneous production of speech. This includes laughter or greetings done in concert.
Backchannels hold a special status when it comes to turn-taking, since they are fairly frequent but still not typically consid-
ered to constitute a turn. Thus, in automatic analyses of turn-taking based purely on VAD, they have to be accounted for some-
how. Similar to how turn shifts are found more often after certain cues, the timing of backchannels is also thought to be
associated with certain backchannel-inviting cues, where the speaker is looking for evidence of understanding from the listener
(Clark, 1996). However, since the speaker typically intends to continue after the backchannel, these cues should look a bit differ-
ent from turn-yielding cues. Thus, analogous to TRPs, we will use the term backchannel-relevant places (BRPs), which we will
come back to in Section 3.
Since the listener does not intend to take the turn when producing a backchannel (or other forms of cooperative overlaps), it is
important that the current speaker can differentiate these from attempts to take the turn. As discussed in Section 2.1 above (and
in depth by Duncan 1974), the production of turn-initial cues can help in this regard. When it comes to the vocalisations of com-
petitive or cooperative overlap, several studies have found that competitive overlaps have higher pitch and intensity (French and
Local, 1983; Yang, 2001), as well as other acoustic properties (Oertel et al., 2012b; Truong, 2013). In a face-to-face setting, eye-
brow movement and mouth opening can also be good predictors (Lee and Narayanan, 2010).
An interesting question is whether overlaps tend to occur at certain points rather than others. Dethlefs et al. (2016) investi-
gated information density as a predictor for overlaps (a word with high information density is a word that is not very probable,
given its context), as it might be easier to perceive overlapping speech if it is more predictable. They conducted a perception
experiment, where overlapping speech from a dialogue system, in the form of synthesized interruptions and backchannels, were
added to the speech from a user. When overlapping speech segments were inserted in regions with low information density, sub-
jects rated them as more natural.
Unlike cooperative overlaps, competitive overlaps need some kind of resolution mechanism (to determine who should get the
oor). According to Schegloff (2000), competitive overlapping talk is characterised by hitches and perturbationsin the speech,
which involves increase in intensity, higher pitch, change in pace, glottal stops, or repetitions. In his corpus analysis, most com-
petitive overlaps were resolved (meaning that one of the participants gives up the turn) after one or two syllables.
It is important not to confuse overlap with interruptions. As pointed out by Bennett (1981), overlaps can be objectively iden-
tied in a corpus, whereas the notion of interruptions requires some form of interpretation, i.e., that some participant is violating
the other participants right to speak. It should also be stressed that interruptions is not the same thing as competitive overlap, as
dened above, since interruptions can also occur without overlap, in the case a speaker makes a pause (completes an IPU without
yielding the turn), and the other participant starts to speak (Gravano and Hirschberg, 2012). This makes it hard to automatically
G. Skantze / Computer Speech & Language 67 (2021) 101178 5
identify interruptions in a corpus purely based on VAD patterns, and a dialogue system might interrupt the user without any
overlap involved (i.e., taking the turn after an IPU that is not a TRP). Based on manual annotation of interruptions in a task-ori-
ented dialogue corpus, Gravano and Hirschberg (2012) found that the onset of non-overlapping IPUs labeled as interruptions had
higher intensity, pitch level and speech rate.
2.4. Situated, multi-party interaction
When studying and modelling turn-taking, it is important to take the setting of the conversation into account. Intuitively, a
face-to-face setting provides a richer repertoire of cues for coordinating turn-taking than a conversation over the phone and
could therefore be expected to be more uent. For example, seeing each othersfaces allows us to perceive gaze direction and
facial expressions. However, studies that compare spoken interaction in video meetings with voice-only interactions have not
found any substantial differences when it comes to the coordination of turn-taking (OConaill et al., 1993; Sellen, 1995). But
when comparing video conferences to physical face-to-face meetings, OConaill et al. (1993) found that the former had longer
conversational turns, fewer overlaps and backchannels, as well as more formal mechanisms for shifting turns. Thus, it seems like
the physical co-presence allows us to more easily pick up these visual cues and coordinate turn-taking. Especially in multi-party
interaction, video conferences and animated agents on 2D displays are not very well suited for efcient turn-taking
(Al Moubayed and Skantze, 2011). This is an important argument for why physical robots might provide better opportunities for
social interaction compared to virtual agents or voice assistants (Skantze, 2016). However, even in dyadic interactions, the physi-
cal presence of a robot has been shown to be benecial in for example language learning, compared to a virtual character on a
screen (Leyzberg et al., 2012).
In dyadic interaction, it is always clear who is supposed to speak next when the turn is yielded. In multi-party interaction, on
the other hand, this has to be coordinated somehow. Most people have experienced how problematic this can be when we lack
clear signals for this coordination, such as in online meetings. As discussed in the beginning of Section 2, the basic model of turn-
taking proposed by Sacks et al. (1974) also accounts for multi-party settings. In this model, the current speaker may select the
next speaker, who then has the right and is obligedto take the next turn to speak, whereas no other participant is supposed to
do so. If the current speaker does not select a next speaker, any participant has the opportunity to self-select, or the current
speaker may continue. This means that turn-taking in multi-party interaction should be even harder to predict, as the transitions
are more optional.
If the current speaker selects the next speaker, a common signal is to gaze at the next speaker, which indicates the attention of
the current speaker, but it is also possible to use other means, such as the addressees name or a pointing gesture (Auer, 2018).
However, as argued by Auer (2018), addressee selection and next-speaker selection is not always the same thing. It is for example
possible to address a statement to several people (by alternatingly looking at them), while selecting a specic person as the next
speaker (by nally looking at that person). Nevertheless, from a computational perspective, this distinction is often not made,
and the problem of identifying the target of an utterance is often referred to as addressee detection (cf. Jovanovic et al. 2006;
Katzenmaier et al. 2004; Vinyals et al. 2012), which we will come back to in Section 7.1.
In multi-party interaction, we can also distinguish several different roles of the (potential) participants. In dyadic interaction,
each participant is currently either a speaker or addressee (the listener). However, multi-party interaction involves other types of
listeners. Among those considered to be participants in the interaction, there might also be side participants, who are neither
currently speaking or being addressed, but who might still take the turn at any point. But there might also be overhearers in the
vicinity, who are not considered to be participants, such as bystanders (still openly present to the participants) and eavesdrop-
pers (Goffman, 1979; Clark, 1996).
2.5. Cultural and developmental aspects
Even though the general patterns of turn-taking (one speaker at a time) seem to be fairly generic across languages
(Stivers et al., 2009), there are also notable differences when it comes to which specic cues are being used, and overall distribu-
tions. For example, Stivers et al. (2009) found mean gap length to vary substantially between 10 different languages, from 7ms in
Japanese to 489ms in Danish. The frequency and placement of backchannels also seem to be different across languages. In a study
of English, Mandarin and Japanese, Clancy et al. (1996) found that backchannels are most frequent in Japanese, fairly frequent in
English, and least frequent in Mandarin. The placement of backchannels also seemed to be more aligned with grammatical com-
pletion points in English and Mandarin than in Japanese. When it comes to prosodic turn-taking cues, there are also differences,
which will be discussed in Section 3.2.
The developmental aspect of turn-taking is important, since turn-taking is a skill that gets acquired and perfected relatively
late in child language development. This learning begins by dyadic interaction with the caregiver, who is responsible for regulat-
ing most of the turn-taking (Ervin-Tripp, 1979). Later on, children also learn how to claim the oor in multi-party interaction, a
skill that is mastered around the age of six. Even after that, children continue to learn how to take turns, and reduce gaps and
overlaps. Whereas adults often take turns with very short gaps, childrens gaps are typically longer in some studies average
gap length has been measured at 1.52 s (ibid.). This gap length shortens with age, as the child learns to better pick up turn-
yielding cues, project the interlocutors turn ending, and plan their own response (ibid.). Researchers have also compared child-
child interaction with child-adult interaction (Martinez, 1987). Typically, children conversing with adults allow the adult to
6G. Skantze / Computer Speech & Language 67 (2021) 101178
regulate the interaction (they are less inclined to self-select), and then pick up these regulatory behaviours and employ them
when talking to other children.
3. Turn-taking cues
In this section, we will provide a more thorough review of the literature on turn-taking cues. As we will discuss in more depth
in the next sections, to be able to engage in a more uent interaction, a conversational system needs to be able to both under-
stand and generate such cues.
Most of the more systematic studies on turn-taking cues have focused on cues found at the end of the turn. As discussed in
Section 2.2, it is not plausible that turn-nal cues will sufce to give a full account of how turn-taking is coordinated. However,
turn-nal cues are of course much easier to identify in a systematic manner than cues that appear earlier and which are used for
projection of turn completion. Another problem with any corpus study is of course that it is hard to know whether cues corre-
lated with certain behaviours are in fact used as signals by the speakers (correlation does not imply causation). To sort this out,
corpus studies need to be complemented with controlled experiments. However, from a practical perspective, even if it turns out
that a certain cue is not actually used by the human listener, it is still possible that it could be utilised by a conversational system,
which has different computational constraints than the human brain.
3.1. Verbal cues: syntax, semantics and pragmatics
As conversation ultimately progresses through the exchange of meaningful contributions (dialogue acts), the verbal
aspect of spoken language (also referred to as linguistic features, i.e., the words spoken and the semantic and pragmatic
information that can be derived from those) is likely very important for regulating turn shifts. The completion of a syntactic
unit is intuitively a requirement for considering the turn as done(a TRP). For example, the phrase I would like to order
a...is not syntactically complete, and the listener is likely to wait for the speaker to nish the sentence. Another argument
for the prominence of verbal cues for turn-taking is the projectability of syntactic units, which might help to explain the pre-
cise turn-taking coordination with very small gaps discussed in Section 2.2 (Sacks et al., 1974; Ford and Thompson, 1996).
Given the context in which the utterance is spoken, it might be possible to project the completion of the sentence and
thereby predict roughly how soon it will come. Such predictions are also necessary for certain types of collaborative over-
laps, such as the choralspeech or sentence completions discussed in Section 2.3 above. Yet another argument for the
importance of verbal cues is the fact that turn-taking is a skill that is perfected relatively late in child language development
(as discussed in Section 2.5).
Ford and Thompson (1996) dene an utterance to be syntactically complete if in its discourse context, it could be inter-
preted as a complete clause, that is, with an overt or directly recoverable predicate(p. 143). This includes elliptical clauses,
answers to questions, and backchannel responses. Thus, syntactic completion (by this denition) does not have to be a complete
sentence. Neither is a syntactic phrase (like a nominal phrase) necessarily syntactically complete. Syntactic completion is judged
incrementally as the utterance unfolds. The following (made-up) example (from Ekstedt and Skantze, 2020) illustrates this
notion, where / marks syntactic completion points:
(1) A: yesterday we met / in the park /
B: okay / when / will you meet / again /
A: tomorrow /
As can be seen, in this account, the turn-initial adverb of time yesterdayis not syntactically complete (as there is no overt or
directly recoverable predicate), whereas tomorrowis, which illustrates the dependence on the dialogue context.
As pointed out by Ford and Thompson (1996), while syntactic completion might be necessary for a TRP, it is not sufcient.
They make a distinction between syntactic completion and pragmatic completion. The latter is dened as having anal intona-
tion contour and has to be interpretable as a complete conversational action within its specic sequential context(p. 150). In
their analysis, these are the points that constitute actual TRPs. There is no precise denition of what is considered to be a com-
plete conversational action, and the annotator is likely to depend to a fair amount of common sense. In the example above,
okay when will you meetcould constitute a valid question in itself, but is unlikely, given the preceding context. Thus, it would
be syntactically, but not pragmatically, complete. In the corpus analysis of Ford and Thompson (1996), about half of all syntactic
completions were also pragmatic completions.
We think the notion of pragmatic completion is useful, but it is not entirely clear why Ford and Thompson (1996) include pro-
sodic aspects in their denition of pragmatic completion. We argue it would be better to reserve the notion of pragmatic comple-
tion to the verbal aspect of speech. In this view, pragmatic completion points form a set of potential TRPs, to which other
modalities (prosody, gaze, etc) add further constraints to derive actual TRPs, as will be discussed in the next sections. In many
cases, pragmatic completion is not in itself sufcient for the listener to know when the speaker intends to yield the turn, as the
following example shows:
(2) I would like a hamburger / with fries / and a milkshake /
Again, it is important to stress that TRPs do not automatically imply turn shifts. In the analysis of Ford and Thompson (1996),
of all the TRPs identied through pragmatic completion (which in their account also included intonation), about half of them
involved actual turn shifts. In many cases, it also seemed like the next speaker started to speak at an earlier TRP, while the rst
speaker further amended her utterance, resulting in a terminal overlap.
G. Skantze / Computer Speech & Language 67 (2021) 101178 7
As we have seen, syntactic and pragmatic completion can be very hard to determine, especially since they often require the
consideration of the preceding dialogue context. Most of the more sophisticated analyses of turn-taking in the Conversation Anal-
ysis (CA) tradition have therefore relied on manual annotation. This can be problematic, as it is hard to annotate pragmatic and
syntactic completion without being inuenced by the actual turn shifts in the corpus. This makes it challenging to study the role
of verbal cues in turn-taking.
When incorporating syntax into computational models of turn-taking, much simpler operationalisations have typically been
used, where the dialogue context is not considered at all. Several models have for example used the two nal part-of-speech tags
(Gravano and Hirschberg, 2011; Meena et al., 2014; Koiso et al., 1998). For instance, a syntactically complete phrase is more likely
to end with a noun, but less likely to end with a conjunction or determiner. However, this can of course not account for the much
more sophisticated notion of syntactic or pragmatic completion discussed above. More recent turn-taking models have used
LSTMs to encode linguistic information, such as part-of-speech (Skantze, 2017b), words (Roddy et al., 2018a) or senones
(Masumura et al., 2018).
Although several of these studies have found that linguistic information contribute to the performance (compared to for
example only using prosody), the performance gain is perhaps not as big as what could be expected. One explanation for this
could be the lack of proper modelling of the dialogue context. Recently, the use of stronger transformer-based language models
for identifying TRPs has been proposed by Ekstedt and Skantze (2020), showing a substantially better performance than the use
pf turn-nal words or LSTMs. Further analyses of the models also showed that they indeed do utilise the context of the preceding
turns. The models predictions on example (1) above is shown in Fig. 3. As can be seen, unlike the LSTM, the model accurately pre-
dicts that a turn-shift is more likely after tomorrowthan after yesterday, given the context. This also illustrates how TRPs can
be modelled using a more probabilistic (rather than binary) notion.
A common verbal turn-holding or turn-initial cue, are so-called llers or lled pauses, such as uhor um(Ball, 1975).
These often indicate that the speaker is thinking about what to say next. An interesting question is whether these vocalisations
are simply a symptom of a hesitation in the speech production process, or whether they are used more deliberately, and whether
different lexical and prosodic realisations of these llers have different meanings. Clark and Fox Tree (2002) argue for the latter
interpretation. In their view, speakers monitor their own speech plan for upcoming delays, and use llers (as if they were words)
to comment on these delays. Based on manual annotation of a speech corpus, they conclude that uhis used to signal a shorter
delay, whereas umsignals a longer delay. However, OConnell and Kowal (2005) argue against such an interpretation. Using
acoustic measurements to analyse a dialogue corpus, they did not nd any relationship between the choice of uhor umand
subsequent delays, suggesting that they do not have different meanings. Nevertheless, regardless of their intentional status, ll-
ers still likely serve as important turn-holding and turn-initial cues for the listener.
3.2. Prosody
The role of prosody in turn-taking has the been subject of much interest and dispute. Prosody refers to the non-verbal aspects
of speech, including intonation, loudness, speaking rate and timber. It has been found to serve many important functions in con-
versation, including the marking of prominence, syntactic disambiguation, attitudinal reactions, uncertainty, and topic shifts
(Ward, 2019). As we saw in the discussion on pragmatic completion, Ford and Thompson (1996) included intonation in their def-
inition of the term. When it comes to intonation, studies across various languages have found that level intonation (in the middle
of the speakers fundamental frequency range) near the end of an IPU tend to serve as a turn-holding cue, including English (Dun-
can, 1972; Local et al., 1986; Gravano and Hirschberg, 2011), German (Selting, 1996), Japanese (Koiso et al., 1998) and Swedish
(Edlund and Heldner, 2005). Complementary to this, studies of English and Japanese have found that either rising or falling pitch
can be found in turn-yielding contexts (Gravano and Hirschberg, 2011; Local et al., 1986; Koiso et al., 1998). However, studies of
Fig. 3. Turn-shift probability, as predicted with the transformer-based model TurnGPT (vs. an LSTM model). Some words are split into sub-words. Figure from
Ekstedt and Skantze (2020).
8G. Skantze / Computer Speech & Language 67 (2021) 101178
Swedish have found that while falling pitch is a turn-yielding cue, rising pitch is not clearly associated with either turn-holds or
turn-shifts (Edlund and Heldner, 2005; Hjalmarsson, 2011).
Gravano and Hirschberg (2011) also looked at intensity in English dialogue, and found that speakers tend to lower their voi-
ces when approaching potential turn boundaries, whereas turn-internal pauses had a higher intensity. Similar patterns were
found by Koiso et al. (1998) in Japanese dialogue, where low or decreasing energy was associated with turn change, and non-
decreasing energy was associated with turn hold.
Regarding duration and speaking rate, ndings seem to be mixed. Duncan (1972) found a drawl on the nal syllable or on
the stressed syllable of a terminal clauseto be a turn-yielding cue (in English). This is also in line with the ndings of
Local et al. (1986). However, Gravano and Hirschberg (2011) found that nal lengthening tends to occur at all phrase nal posi-
tions, not just at turn endings. If anything, nal lengthening seemed to be more prominent in turn-medial IPUs than in turn-nal
ones. In an analysis of Japanese task-oriented dialogue, Koiso et al. (1998) also found longer duration to be associated with turn
hold. In a listening task where participants were asked to predict turn-shifts in Swedish, Hjalmarsson (2011) did not nd nal
lengthening to result in any consistent predictions.
Gravano and Hirschberg (2011) also examined voice quality, as measured through jitter, shimmer and noise-to-harmonics
ratio (NHR), and found these acoustic features to potentially serve as turn-taking cues. According to Ward (2019),creaky voice is
also commonly found before turn-shifts.
Several studies have also examined how prosody can serve as a backchannel-inviting cue, and how this differs from
turn-yielding cues. Ward (1996) investigated a corpus of Japanese conversations to nd predictive cues for backchannels.
He found that backchannels tended to come about 200ms after a region of low pitch. Gravano and Hirschberg (2011),on
the other hand, found that IPUs immediately preceding backchannels showed a clear tendency towards a nal rising into-
nation, as well as higher intensity (i.e., opposite to what they found to be a turn-yielding cue). These somewhat contradic-
tory ndings, could perhaps be explained by language differences (Japanese vs. American English), or the fact that
Ward (1996) looked at overlapping backchannels in unconstrained conversations, whereas Gravano and Hirschberg (2011)
looked at backchannels coming after an IPU in task-oriented dialogue. In an experiment where participants were given
the task of dictating number sequences to each other, Skantze and Schlangen (2009) found that they segmented the lon-
gernumbersequenceintoinstallments(Clark, 1996), where each installment ended with a rising pitch, which seemed
to invite a brief feedback from the listener (usually something similar to a backchannel, but potentially also clarication
requests), and then ended the full sequence with falling pitch.
As discussed earlier, since turn-nal cues (by themselves) cannot explain rapid turn-shifts, it is not clear what role these pro-
sodic features have for turn-taking. Ward (2019) goes so far as to call the importance of turn-nal prosody for turn-taking a pop-
ular myth, and argues that for most turn exchanges, nal prosody can play no role in the turn-start decision(p. 145). However,
he also notes that turn-nal prosody might still be decisive if the speaker has a response pre-prepared and only needs to decide
whether to deploy it or not(p. 210). Heldner and Edlund (2010) argue that even though many turn-shifts occur with very little
gap (or even overlap), which might require prediction, there are still a considerable amount of turn-shifts that have a longer gap,
where prosody could still play an important role.
It is also not clear to what extent prosody provides additional information compared to verbal cues, or if they are redundant.
In an experiment by de Ruiter et al. (2006), subjects were asked to listen to a conversation and press a button when they antici-
pated a turn ending. The speech signal was manipulated to either atten the intonational contour, or to remove verbal informa-
tion by low-pass ltering. The results showed that the absence of intonational information did not reduce the subjects
prediction performance signicantly, but that the subjectsperformance deteriorated signicantly in the absence of verbal infor-
mation. From this, they concluded that verbal information is crucial for end-of-turn prediction, but that intonational information
is neither necessary nor sufcient.
As we saw in the discussion on verbal cues in Section 3.1 above, Ford and Thompson (1996) included the requirement of a
nal intonation contourin their denition of pragmatic completion. This is a more complex (and subjective) notion than the
prosodic features discussed above, and involves a longer prosodic gesture over an intonational unit, but it can to some extent be
related to the falling (or rising) nal pitch others have identied as turn-yielding. Thus, in their account, prosody helps to lter
our syntactically complete units which are also pragmatically complete (and thereby TRPs).
If prosody is specically useful in syntactically ambiguous places, this can help to explain the ndings of de Ruiter et al. (2006)
mentioned above, as the stimuli used might not have involved such ambiguities. Indeed, B
ogels and Torreira (2015) performed a
similar experiment, but selected the stimuli so that they contained several pragmatic completion points, and where the intona-
tion phrase boundary provided additional cues to whether they were actual TRPs. They found that subjects made better predic-
tions when the intonation was intact.
Taken together, these studies indicate that prosody can play an important role in cases where pragmatic completion in itself is
not sufcient. They do not say, however, how common these type of ambiguities are, and how important prosody is overall for
turn-taking. One way of addressing this question is to look at turn-taking prediction models and see how much prosodic features
contribute to predictive performance, compared to verbal features. Koiso et al. (1998) investigated the relationship between syn-
tax and prosody for predicting turn shifts in task-oriented Japanese dialogue. Using a C4.5 decision tree, they explored how the
predictive power of the model changed when various syntactic and prosodic features were added or removed. The results
showed that some of the individual syntactic features had very strong contributions, which was not the case of individual pro-
sodic features. However, if all prosodic features were removed from the model, the performance dropped as much as when
removing all syntactic features. In a similar way, Gravano and Hirschberg (2011) investigated American English task-oriented
G. Skantze / Computer Speech & Language 67 (2021) 101178 9
dialogue and trained a multiple logistic regression model to predict turn-shifts. The most important cue in their model was tex-
tual completion, followed by voice quality, speaking rate, intensity level, pitch level and IPU duration. However, when using all
features, they did not nd any signicant contribution of intonation to the general predictive power of the model. One caveat in
these type of studies is of course that these models do not encode and use prosody and verbal aspects in the same way as
humans; especially pragmatic aspects are virtually non-existing, as we saw in Section 3.1. Therefore, the general contribution of
verbal and prosodic cues to turn-taking is still very much an open question.
Regardless of the role of prosody in turn-taking between humans, prosody might provide important cues from the per-
spective of a conversational system. Since conversational systems do not have the same computational/cognitive con-
straints and might not have to prepare the response in advance to the same extent as humans, they could make use of
turn-nal cues to a larger extent. Also, whereas prosodic features can be extracted fairly reliably in a continuous fashion
(Eyben et al., 2010), verbal features rely on an Automatic Speech Recogniser (ASR), which introduces a certain delay and
ASR errors, as will be further discussed in Section 4.2 below. For these reasons, prosody might be more important for con-
versational systems than for humans. However, as we have seen, the coordinative functions of prosody seem to vary
somewhat between languages and settings.
3.3. Breathing
Breathing is intuitively linked to turn-taking, as we typically breathe in before starting to speak, which means that an (audible
and/or visible) in-breath might serve as a cue that the speaker intends to speak in the near future. In a study on breathing in con-
versation, McFarland (2001) found increased expiratory duration before speech onset at turn-shifts, which may reect the prepa-
ration of the respiratory system for speech production.
Rochet-Capellan and Fuchs (2014) also investigated the role of breathing as a coordination cue. They found no global relation-
ship between breathing and turn-taking rates and no signs of general breathing synchronisation (temporal alignment of breath-
ing) between the participants. However, they did nd a local relationship between turn-taking and breathing. When turn-taking
was successful, speech onset was generally well timed with inhalation events. When a participant took a breath and tried to take
the turn but failed, they shortened their breathing cycle. At turn-holds, the speaker also inhaled (although less so than at the
beginning of the turn), indicating that they would like to continue speaking.
Torreira et al. (2015) examined inbreaths occurring right before answering a question, and found that they typically begin
briey after the end of questions. These inbreaths are also associated with substantially delayed answers (576ms vs. 100ms for
answers not preceded by an inbreath), as well as longer answers. This indicates that breathing is linked to the planning of the
response, and also means that breathing could be regarded as a turn-initial cue, showing that the next speaker has detected a
turn-end and is preparing a response.
Ishii et al. (2014) examined breathing in multi-party interactions and found that when a speaker is holding the turn,
she inhales more rapidly and quickly than when yielding the turn. The speaker who is about to take the turn tends to
take a deeper breath compared to listeners who are not about to speak. Thus, it is possible that breathing helps to coordi-
nate self-selection.
Most studies on breathing have used various forms of invasive measuring equipment, such as elastic strain gauges that
measure movements of the rib cage (McFarland, 2001). However, for breathing to play a role as a cue for coordination in
conversation, it needs to be audible and/or visible to the other participants. W»odarczak and Heldner (2016) examined
acoustic inhalation intensity (as picked by a microphone) as a cue to speech initiation and found inhalations preceding
speech to be louder than those in tidal breathing and before backchannels. However, while judges have been shown to be
able perceive respiratory pauses in speech (Wang et al., 2012), it is not known to what extent listeners in regular conver-
sational settings perceive breathing.
3.4. Gaze
In face-to-face interaction, eye gaze has been found to serve many important communicative functions, including referring to
objects, expressing intimacy, dominance, and embarrassment, as well as regulating proximity and turn-taking (Argyle and
Cook, 1976). It is of course important to note that eye gaze cannot be understood purely as a signaling device, without also con-
sidering its main function: to direct our visual attention and perceive the world around us. However, as humans, we have learned
through evolution that the eye gaze of others (and by extension their attention) provides a rich source of information for us to
coordinate our activities with each other (Tomasello et al., 2007). When two or more people engage in interaction, they typically
position themselves in a way that facilitates the monitoring of each othersgaze direction (Schneider and Goffman, 1964). It is
therefore possible that we have also learned that eye gaze (to some extent) can be directed to achieve a certain communicative
effect.
One of the rst extensive analyses of the role of eye gaze in turn-taking was done by Kendon (1967), through observations of
video recordings of dyadic interactions. A general pattern he observed is that the speaker tends to look away at the beginning of
the turn, but then shift the gaze towards the listener at the end of the turn. At the same time, the listener looks at the speaker for
the most part of the turn, but looks away as the turn is being completed and as the turn shifts. If the current speaker pauses with-
out yielding the turn, she is likely to keep looking away, but then look back as the utterance is being resumed. If the listener starts
to look away as the speakers turn is being completed, it can serve as a turn-initial cue, i.e., that she is assuming that the turn is
10 G. Skantze / Computer Speech & Language 67 (2021) 101178
soon yielded and that she is preparing to take the turn. Another nding was that the listener tends to look at the speaker most of
the time, and only looks away briey, whereas the speaker shifts the gaze between the listener and to some other target with
about equal lengths. Thus, the listener in general looks more at the speaker than the other way around. Mutual gaze between the
participants is seldom maintained for longer than a second.
Similar ndings have been reported in several studies (Goodwin, 1981; Oertel et al., 2012a; Jokinen et al., 2010) However, it is
also clear that even if these are general patterns, there are other perceptual, communicative and social factor involved, which
means that there is a lot of variation in them. There are also large individual differences (Cummins, 2012).
Bavelas et al. (2002) also examined gaze behaviour around backchannels in dyadic interactions, where one person was telling
a story to the other. In general, their analysis conrmed Kendonsnding that the listener looks more at the speaker than the
reverse, but they also found that the speaker looked at the listener at key points during their turn to seek a response. At these
points, the listener was very likely to respond with a verbal or non-verbal backchannel, after which the speaker quickly looked
away and continued speaking.
As discussed in Section 2.4 above, gaze also serves an important role for addressee selection as well as next-speaker selection
in multi-party interaction (Auer, 2018; Jokinen et al., 2013; Ishii et al., 2016). In this way, gaze towards a participant serves both
as a turn-yielding cue and as a signal for selecting the next speaker in multi-party interaction. When addressing several people,
the current speaker may alternatingly look at the co-participants they want to address, but then ends the turn by looking at the
selected next speaker (ibid.). While people do not always conform to this selection, the gazed-at participant will take the turn
more often than not. If the targeted person does not respond, a sustained gaze at that person is particularly efcient to elicit a
response (ibid.). If the targeted person wants to avoid taking the turn, they can either pass on the turn by gazing at a third partici-
pant, or they can reject the offered turn by gazing away and thereby open up the conversational oor for any other participant to
self-select (Weiss, 2018). Zima et al. (2019) investigated the role of gaze for competitive overlap resolution in triadic interactions.
They found that when two speakers started to speak at the same time, the prevailing speaker averted their gaze away from
the competing speaker. The speaker who withdrew from the competition instead maintained their gaze at the prevailing
speaker. The third party often singled out the prevailing speaker during the overlap by either keep looking or shifting the gaze
towards her.
In situated interaction, there might also be objects that the speakers refer to, especially in task-oriented settings. When
referring to objects, speakers naturally attend to them. The speakers gaze can therefore be used by the listener as a cue to
the speakers current focus of attention, so-called joint attention (Velichkovsky, 1995). This has been shown to clearly affect
the extent to which humans otherwise gaze at each other when speaking and shifting turns (Argyle and Graham, 1976). In a
study on modelling turn-taking in three-party poster conversations, it was found that the participants almost always looked
at the shared poster and very little at each other (Kawahara et al., 2012). In a Wizard-of-Oz study on multi-party human-
robot interaction, where the participants discussed objects on a table between them, Johansson et al. (2013) found that turn
shifts often occurred without the speakers looking at each other. This might of course affect the usefulness of gaze as a turn-
taking cue in such settings.
3.5. Gestures
In the analysis of turn-taking cues by Duncan (1972), certain gestures seemed to have a very strong turn-holding effect (and
even negate other turn-yielding cues). The listener almost never attempted to take the turn while the speaker performed certain
forms of gesticulation, including a tense hand position or hand movements away from the body. The observation that the com-
pletion of hand gestures can serve as a turn-yielding cue has also been conrmed in other studies (Zellers et al., 2016).
Holler et al. (2018) investigated how bodily signals inuence language processing in interaction. They found that questions
accompanied by gestures resulted in faster response times compared to questions without gestures. The response timing also
seemed to be oriented with respect to the termination of the gesture. Thus, it seems like gestures help the listener to predict
turn-endings. Sikveland and Ogden (2012) also found speakers to temporarily freeze and hold their gesture when some other
participant provided some kind of mid-turn clarication or feedback, after which their turn and gesture were resumed.
In the analysis of Streeck and Hartge (1992), gestures by the listener can be used as an indication that they want to get the
oor (early turn-incursion while avoiding overlap), and as a turn-initial cue. Mondada (2007) investigated turn-taking in a
multi-party, situated interaction, involving objects (such as a map) on a shared table. In this setting, pointing gestures (with a n-
ger or a pen), or stretching, towards these objects were also found to serve as a turn-initial cue, signalling an interest in self-
selecting the turn. This movement (which typically involves the whole upper body) is often initiated before the completion of
the preceding turn, and can thus be used for projection by the other participants.
3.6. Summary
As this review has shown, turn-taking cues across different modalities can be both redundant and complementary. The com-
bination of several cues can lead to more accurate recognition or prediction of the partners intentions, which might help to
explain why many people prefer face-to-face interaction. Especially for conversational systems, where the recognition of these
subtle cues is challenging, the combination of different cues may increase robustness. As discussed before, this is an argument for
why social robots might offer better interaction opportunities than voice assistants (Skantze, 2016).
G. Skantze / Computer Speech & Language 67 (2021) 101178 11
Verbal cues are arguably the most important ones for humans, and provide a stronger basis for prediction, especially when
taking the larger dialogue context into account. However, they are also very hard to model, especially in conversational systems,
partly due to the error-proneness and delay of ASR output, and partly due to lack of more sophisticated pragmatic modelling.
Thus, prosodic cues can complement verbal cues and help when there are ambiguities. Gaze can be a strong cue, especially in
multi-party interaction. However, this requires some form of embodiment that makes it natural to look at the agent in the same
way we look at each other during the conversation. Also, if the conversation involves objects in the surroundings, this might sig-
nicantly decrease the extent to which we look at each other to regulate turn-taking. The use of gestures and breathing for turn-
taking in conversational systems and human-robot interaction have so far attracted less attention, and should be worthy to
explore.
In the next section, we will discuss attempts at utilising these cues for knowing when the system should speak and not.
4. End-of-turn detection and prediction
The arguably most studied aspect of turn-taking in conversational systems is how to determine when the users turn is
yielded and the system can start to speak (i.e. the detection of TRPs). A related aspect is to determine when the system should
give a backchannel (i.e., the detection of BRPs, as discussed in Section 2.3). In this section, we will review attempts at developing
such models, and also include work done on allowing the system to project turn completion, perform sentence completion, or
other forms of overlapping speech, even if these are not very common yet. The use of more explicit cues, such as push-to-talk,
will not be covered in this review.
We have divided these approaches into three types (ordered from simpler to more advanced), illustrated in Fig. 4:
Silence-based models. The end of the users utterance is detected using a VAD. A silence duration threshold is used to deter-
mine when to take the turn.
IPU-based models. Potential turn-taking points (IPUs) are detecting using a VAD. Turn-taking cues in the users speech are
processed to determine whether the turn is yielded or not (potentially also considering the length of the pause).
Continuous models. The users speech is processed continuously to nd suitable places to take the turn, but also for identify-
ing backchannel relevant places (BRP), or for making projections.
These models are sometimes referred to as end-of-turn detection models (or simply endpointing) and sometimes as end-
of-turn prediction models. De Kok and Heylen (2009) argue that the former term should be associated with a more reactive
account of turn-taking and the latter with a more predictiveaccount. Furthermore, both Schlangen (2006) and De Kok and Hey-
len (2009) argue that silence-based models should be associated with a reactive account, whereas models that take turn-taking
Fig. 4. End-of-turn detection and prediction models.
12 G. Skantze / Computer Speech & Language 67 (2021) 101178
cues into account should be considered predictive. We think this distinction is somewhat misleading and would like to reserve
the term predictionfor models that are associated with predictive mechanisms and projections in turn-taking (which was dis-
cussed in Section 2.2). We argue that both IPU-based models and silence-based models can be considered reactive, in the sense
that they react to past cues (whether silence-based or more sophisticated) in order to make a decision for the current point-in-
time. We will thus use the term end-of-turn detection to refer to such models and the term end-of-turn prediction for models that
predict an upcoming turn-completion (i.e. that has not occurred yet), which is perhaps only feasible with continuous models.
4.1. Silence-based models
In most ASR systems, an end silence duration threshold is often used to determine the end of the speech segment which is to
be transformed to text, often relying on a VAD. The VAD is in turn typically based on energy and possibly spectral features (to dis-
tinguish speech from background noise) in the incoming audio. Thus, when implementing a conversational system, it is conve-
nient to use this mechanism for determining the end of the users turn. This approach is still being used in many conversational
systems, and was for example assumed to be used in the VoiceXML standard, developed by W3C (McGlashan et al., 2004). Typi-
cally, two main parameters (thresholds) can be tuned by the application developer, that regulates the turn-taking behaviour of
the system, as illustrated in Fig. 5 (a):
After the system has yielded the turn, it awaits a user response, allowing for a certain silence (a gap). If this silence exceeds the
no-input-timeout threshold (such as 5 s), the system should continue speaking, for example by repeating the last question.
Once the user has started to speak, the end-silence-timeout (such as 700ms) marks the end of the turn. As the gure shows,
this allows for brief pauses (shorter than the end-silence-timeout) within the users speech.
This basic model might work well when the users turns are expected to be fairly brief and when the user knows what to say
to the system. However, it often results in two kinds of problems: If the end-silence-timeout is too short, the system might inter-
rupt the user within pauses (as illustrated in Fig. 5 (b)). If it is too long, the system will be perceived as unresponsive, and the
user might not understand that the system is about to respond (as illustrated in example (c)). To minimise these problems, it is
important to carefully tune the end-silence-timeout threshold, depending on the domain of the system (Witt, 2015). It is also
possible to use different thresholds depending on the preceding system utterance or dialogue state. For example, a yes/no ques-
tion can have a shorter threshold than an open question, as the user is more likely to answer with something brief that does not
contain pauses. However, as discussed earlier, studies on human-human turn-taking have found that pauses are on average
Fig. 5. Silence-based model of turn-taking.
G. Skantze / Computer Speech & Language 67 (2021) 101178 13
longer than gaps. Thus, it is most often impossible to nd a perfect threshold that completely alleviates these two problems, and
systems based on this simplistic model will be plagued by a certain amount of turn-taking issues (Ward et al., 2005; Raux et al.,
2006).
4.2. IPU-based models
Given the limitations of pure silence-based models described above, several researchers have investigated how turn-taking
cues could be incorporated in the end-of-turn detection. Based on the assumption that the system should not start to speak while
the user is speaking, a common approach has been to detect the end of IPUs in the users speech using a VAD (in this way similar
to the silence-based model outlined above, but potentially using a much shorter silence threshold, such as 200ms). After the end
of an IPU has been detected, the system uses the turn-taking cues detected from the user to determine whether there is a TRP or
not, as illustrated in Fig. 4. If these TRPs are correctly identied, the system can take the turn with very small gaps, while avoiding
to interrupt the user at non-TRPs.
An early example of an IPU-based model was Bell et al. (2001), who used a rule-based approach (a semantic parser) for deter-
mining semantic completion after an end-of-speech was detected and the ASR reported a result. If the utterance was determined
to be incomplete, the system refrained to take the turn and the ASR continued to listen for more speech. A more data-driven
approach, also taking other cues into account, was proposed by Sato et al. (2002), who used a decision tree to classify pauses lon-
ger than 750 ms, based on features from semantics, syntax, dialogue state, and prosody. Their model achieved an accuracy of
83.9%, compared to the baseline of 76.2%. Similar models have been explored by Schlangen (2006) and Meena et al. (2014), using
shorter silence thresholds.
A problem with using a xed silence threshold in this approach, after which the decision has to be made, is that it is unclear
what will happen if the IPU is misclassied as a pause. Even if a TRP is not detected immediately after an IPU, the system should
continue to consider whether there is a TRP as the silence progresses. Intuitively, the longer the silence, the more likely it is that
the user is indeed yielding the turn. Ferrer et al. (2002) trained a decision tree classier based on prosodic features (pitch and
duration), as well as n-grams of the nal words, but also conditioned their model on the length of the pause after the IPU. The
model was then applied continuously throughout the pause, with an increasing likelihood of a turn-shift (as illustrated in Fig. 4).
Raux and Eskenazi (2008) formulated the problem somewhat differently. Instead of conditioning the model on pause length, the
system used a model to predict a threshold to be used for determining when to take the turn after an IPU was detected. Thus, in
case turn-holding cues were detected, more time would be allowed for the user to start speaking again, whereas if turn-yielding
cues were detected, only a brief (or no) silence passed before the system took the turn. A similar approach, using deep learning
(LSTM), was proposed by Maier et al. (2017).
A general question when training IPU-based models is what target labels to use for detecting TRPs. When using a human-
human dialogue corpus, it is of course possible to use the actual turn shifts as target labels. It is however important to note that
humans do not always take the turn at TRPs, and sometimes at non-TRPs. Also, models trained on human-human dialogue might
not necessarily transfer so easily to human-computer dialogue, as they are a bit different in nature. Using human-computer dia-
logue data is also problematic if the system used to collect the data employed a more simplistic turn-taking model (as we would
not want the new model to learn from this). One approach to solve this is to use bootstrapping. First, a more simplistic model of
turn-taking is implemented in a system and interactions are recorded. Then, the data is manually annotated with suitable TRPs,
and a machine learning model is trained (cf. Raux and Eskenazi 2008; Meena et al. 2014; Johansson and Skantze 2015). Another
approach is to use a Wizard-of-Oz setup, where a hidden operator controls the system and makes the turn-taking decisions
(Johansson et al., 2016; Maier et al., 2017). As the Wizard is expected to take the turn at appropriate places, the data can be used
directly to train a model. A potential problem with this approach is that it can be hard for the Wizard to achieve very precise tim-
ing.
Another approach is to use reinforcement learning. Jonsdottir et al. (2008) showed how two articial agents could develop
turn-taking skills by talking to each other, learning to pick up each othersprosodic cues. In the beginning, pause durations were
too short and overlaps were frequent. However, after some time, interruptions were less frequent and the general turn-taking
patterns started to resemble those of humans. Selfridge and Heeman (2010) presented an approach where turn-taking was mod-
elled as a negotiative process. In this model, each participants bidsfor the turn, based on the importance of the intended utter-
ance, and reinforcement learning was used to indirectly learn the parameters of the model. Khouzaimi et al. (2015) used
reinforcement learning to learn a turn-taking management model in a simulated environment, with the objective of minimising
the dialogue duration and maximising the completion task ratio. However, it is perhaps not clear to what extent such a model
transfers to interactions with real users. Ideally, the system should be able to learn to detect the end of the turn through interac-
tion with users.
Looking at the cues that have been found to be useful for turn-taking detection in the context of a dialogue system, different
studies have come to different conclusions, likely depending on the domain of the system and how well these turn-taking cues
are recognised and modelled. Whereas Sato et al. (2002) and Meena et al. (2014) did not nd prosody to contribute signicantly
to the detection, Ferrer et al. (2002) and Schlangen (2006) found both syntactic and prosodic features to improve detection per-
formance. As we saw in Sections 3.4 and 3.5, face-to-face interaction also allows for visual turn-taking cues, which were used in
the model presented in De Kok and Heylen (2009).Johansson and Skantze (2015) built an IPU-based model for multi-party
human-robot interaction, and investigated the use of words, prosody, head pose (as a proxy for gaze), dialogue context, and the
14 G. Skantze / Computer Speech & Language 67 (2021) 101178
movement of objects on the table. All these cues were in themselves informative, but the best performance was achieved by com-
bining them, which is in line with ndings in the linguistic literature.
A challenge when using verbal features in dialogue systems is speech recognition (ASR) errors. This is especially troublesome
when considering the syntactic or pragmatic completion of the nal part of the utterance, since ASR language models are typi-
cally trained on syntactically complete sentences, and therefore might miss important turn-holding cues at the end of the utter-
ance, such as a preposition (I would like to go to...)oralled pause. Some speech recognition vendors (like Google at the time
of writing this review) do not even report lled pauses (which are typically very strong turn-holding cues) in the users speech.
Meena et al. (2014) explored the effect of ASR errors on their IPU-based model. Whereas prosody had very little contribution to
the performance when the ASR was perfect, it did contribute when ASR performance degraded.
4.3. Continuous models
Most conventional dialogue systems (either silence- or IPU-based) typically process the users speech one utterance (or one
IPU) at a time. A VAD is used to detect the end of the utterance, after which it is processed, one module at a time, and a system
response is generated (unless the system refrains from taking the turn), as illustrated in the left pane in Fig. 6. In addition to the
delay caused by the silence threshold, the processing time of each subsequent module adds to the response delay. An alternative
approach is to process the dialogue incrementally, as shown in the right pane. This means that the input from the user is proc-
essed in increments (e.g. frame-by-frame or word-by-word), and that all modules start to process the input as soon as possible,
passing on their incremental results to subsequent modules (Schlangen and Skantze, 2009).
Incremental processing allows the system to process the users utterance on a deeper level (also incorporating task-related
aspects), and make continuous turn-taking decisions. Thus, the system can potentially also project turn-completions, starting
planning what to say next, nd suitable places for overlapping backchannels, and even decide to interrupt the user. All this is
impossible with an IPU-based model. An early example of a fully incremental dialogue system was presented by Skantze and
Schlangen (2009), although it was constrained to number dictation. While the user was reading out the numbers, the system
could start to prepare responses and give very rapid feedback, based on continuous processing of the users speech (including
prosodic turn-taking cues). One important challenge of incremental systems identied by Schlangen and Skantze (2009) is the
need for revision, as tentative hypotheses of what the user is saying might change as more speech is being processed. For exam-
ple, the word fourmight be amended with more speech, resulting in a revision to the word forty. If subsequent modules have
already started to make output results based on the earlier hypotheses, they might also have to make revisions in their output,
potentially causing a cascade of revisions. Ultimately, if the system has already started to speak when a revision occurs, it might
have to synthesize self-repairs, as explored by Skantze and Hjalmarsson (2013).
Skantze (2017b) proposed a general continuous turn-taking model, which was trained on human-human dialogue data in a
self-supervised fashion. The audio from both speakers was processed frame-by-frame (20 frames per second), and an LSTM was
trained to predict the speech activity for the two speakers for each frame in a future 3 s window. Thus, the model was not trained
specically for end-of-turn detection. However, when applied to this task, the model performed better than more conventional
baselines, and better than human judges given the same task. The model was also able to make predictions about utterance
length at the onset of the utterance, which could potentially be useful for distinguishing attempts to take the turn from shorter
backchannels. A similar model was implemented by Ward et al. (2018) and Roddy et al. (2018a), who also looked more deeply
into the different speech features that can help the prediction. Roddy et al. (2018b) proposed an extension of the architecture,
where the acoustic and linguistic features were processed in separate LSTM subsystems with different timescales.
As will be discussed in Section 7.3, models based on incremental processing are also highly relevant for interactions involving
some form of task execution, for example where a human gives instructions to a robot, where the execution of the task can be ini-
tiated already while the instruction is being given (Hough and Schlangen, 2016; Gervits et al., 2020). Another possibility for more
sophisticated turn-taking that incremental processing allows for was shown by DeVault et al. (2009), where the system predicted
what the user was about to say in order to help the user to complete the sentence, possibly overlapping with the users speech.
Fig. 6. Illustration of the difference between traditional non-incremental speech processing (on the left) and incremental speech processing (on the right).
G. Skantze / Computer Speech & Language 67 (2021) 101178 15
4.3.1. Detecting backchannel-relevant places
Apart from detecting appropriate places to take the turn, the system may also consider where to produce backchannels, i.e.
detecting BRPs. As discussed in Sections 2.3 and 3above, the cues that indicate backchannel-relevant places are somewhat differ-
ent from turn-yielding cues. It should also be noted that backchannels can be non-vocal, such as head nods. While it is possible to
use an IPU-based model for backchannels (cf. Truong et al. 2010), a continuous model is more appropriate, as backchannels often
are produced in overlap while the other participant is speaking.
Most backchannel models have been based on non-verbal cues, which avoids the problems of dealing with ASR in continuous
models. As mentioned in Section 3.2,Ward (1996) dened a very simple rule for backchannel-relevant places: 200ms after a
region of low pitch (dened as less than the 30th percentile pitch level), allowing for overlapping backchannels. The model was
implemented in a semi-autonomous dialogue system and informally evaluated. A more probabilistic approach was proposed by
Morency et al. (2008), where a sequential probabilistic model was trained on human-human face-to-face interactions to predict
listener backchannels (in the form of head nods), using prosody, words and eye gaze. The model is continuous, in that it outputs
a probability of a backchannel for every frame. By setting a threshold on this probability, the system can generate backchannels
at appropriate places, while the frequency of backchannels can be adjusted. The model was directly compared with the rule-
based approach of Ward (1996), and showed a signicant improvement.
More recently, deep learning approaches have been proposed. Ruede et al. (2019) used an LSTM to predict (vocal) backchan-
nels in the Switchboard corpus, based on prosodic features as well as linguistic information (in the form of word vectors).
Hussain et al. (2019) addressed the problem of backchannel-generation (including laughter) in the context of human-robot inter-
action. Instead of a supervised approach, they used deep reinforcement learning to maximise the engagement and attention of
the user.
5. Handling user interruptions and backchannels
Although dialogue systems can be implemented using a simplex channel (i.e., where only one participant can speak at a time),
it is often desirable to have a duplex channel, allowing the system to hear what the user is saying while it is speaking (i.e.
Fig. 7. The handling of user-interruptions, and associated problems.
16 G. Skantze / Computer Speech & Language 67 (2021) 101178
overlapping speech). A requirement for this is that the system uses echo cancellation, i.e., that the systems own voice is cancelled
out from the audio picked up by the microphone. The most common use case for this is to allow the user to barge-inwhile the
system is speaking, i.e., to interrupt the system, as illustrated in Fig. 7(a). This is especially important when the system asks longer
questions or gives longer instructions which the user might have heard before or can be predicted from context. However, there
are also several caveats associated with barge-in.
One of the rst to perform studies of usersbarge-in behaviour was Heins et al. (1997). In their study, they found that users
attempted to barge-in without being informed about the possibility and did so frequently. Barge-in attempts tended to happen
at certain places in the systems prompt, especially at syntactic boundaries. They also noticed that user disuencies (in the form
of stuttering or repetitions) were common in barge-in situations, which is in line with the observations of competitive overlap in
human communication discussed in Section 2.3. This can of course cause problems for further processing of the users speech.
A typical problem when allowing barge-in is that of false barge-ins, as illustrated in example Fig. 7(b). A false barge-in might
either be triggered by non-speech audio, such as external noise or coughing, or by the user giving a backchannel without intend-
ing to take the turn. If not handled correctly, this can easily lead to confusion. One way of reducing this problem suggested by
Heins et al. (1997) is to raise the threshold for detecting user speech when this is less likely to occur and lower it when user inter-
ruptions are more likely (for example at a syntactic boundary). Apart from noise and coughing, the user might also just produce a
brief backchannel, without intending to take the turn. The system should therefore as early as possible try to predict, already at
IPU onset, whether the incoming audio is likely to constitute a longer turn or not. If not, it probably does not even have to stop
speaking. One example of such a model (trained on human-human data) was presented by Neiberg and Truong (2011), where a
prediction of a turn-start vs. a backchannel was made 100 500 ms after the speech onset, based on acoustic features. Another
example is the general model of Skantze (2017b), which was also shown to be able to make such distinctions by predicting the
length of the users utterance.
If the users utterance was determined to be a true barge-in, it should still be veried at the end of the IPU. If it was not a true
barge-in, the system should ideally resume speaking, as illustrated in Fig. 7(c). As suggested by Str
om and Seneff (2000), this
could for example be done using a lled pause, and then restart from the last phrase boundary. An example of how these different
aspects are integrated is presented in Selfridge et al. (2013), where incremental speech recognition was used to make a continu-
ous decision of whether to pause, continue, or resume the systems utterance. The model was evaluated in a public spoken dia-
logue system, resulting in improved task success and efciency compared to a more naive approach.
Another potential problem when allowing for barge-in is illustrated in Fig. 7(d). Here, the user responds to the rst system
utterance (Now for the next question) with Yeah, before hearing that the system has just started the next utterance (Do you
need a visa?). Depending on how the dialogue manager is implemented, there is a risk that the usersYeahwill be interpreted
as a positive reply to the question, which will be very confusing as the user has never heard the question which was unintention-
ally answered. To mitigate this problem, the system should ideally monitor its own speech production in order to evaluate the
likely context in which to interpret the users utterance.
Heins et al. (1997) also noted that users sometimes barge-in before hearing the important things that they need to hear. Thus,
it is also possible that the system should not always allow for barge-in, for example if it has something important to say.
Str
om and Seneff (2000) explored how this could be signaled to the user when a barge-in attempt is detected. By raising the vol-
ume of the systems voice (in line with human behaviour during competitive overlap discussed in Section 2.3), it can signal that
the barge-in is not allowed, and correspondingly lower the voice if it is allowed.
6. Generating turn-taking cues
So far, we have mainly reviewed the processing of turn-taking cues from the user. However, it is also important to consider
the generation of turn-taking cues, so that the user knows when it is appropriate to speak and not. If the system fails to do this
correctly, the user might start to speak at the same time the system. An example of an unclear turn-allocation is illustrated in
Fig. 8(a). If the system makes a pause (for example at a phrase boundary, or because it needs time for processing), it is important
that an appropriate turn-holding cues is produced. In the example in Fig. 8(a), an appropriate prosodic realisation (using level
intonation as discussed in Section 3.2) might be sufcient. However, conversational systems often do not have this level of con-
trol of the speech synthesizer, which might cause problems. Thus, even if turn-taking cues are not generated on purpose, users
will likely interpret these cues from the systems synthesized speech and other behaviours, such as gaze in case of an avatar or
robot. Several studies have therefore looked into how to generate appropriate behaviours in animated agents to facilitate turn-
taking (Cassell et al., 2001; Th
orisson, 1999; Pelachaud et al., 1996).
To explore the effect of such turn-taking cues, Edlund and Beskow (2009) set up an experiment where two participants were
connected remotely to have a conversation, with an avatar (an animated head) representing each person on the other end. The
audio was transmitted between the participants with lip-sync added automatically to the avatar. The gaze direction of the avatar,
however, did not reect the other participant, but was manipulated for the sake of the experiment. The results showed that
when the avatar gazed away at a potential turn-ending, the other participant was less likely to take the turn, compared to when
the agent looked at the other participant. Kunc et al. (2013) explored the effectiveness of visual and vocal turn-yielding cues in
dialogue system using an animated agent. Their results indicated that the visual cues were more effective than vocal cues. In
another study, Skantze et al. (2014) investigated the effect of gaze, syntax and lled pauses as turn-holding cues at pauses in a
human-robot interaction setting, where the robot was instructing the user to draw a route on a map that was placed on the table
between them. When the robot looked at the map (i.e. did not gaze at the user), used a syntactically incomplete phrase, or used a
G. Skantze / Computer Speech & Language 67 (2021) 101178 17
lled pause, the user was less likely to continue drawing or give feedback than when the robot looked at the user or produced a
complete phrase.
Another potentially problematic case is if there is a delay in the systems processing, which causes a delay in its response to
the user. In such cases, the user might not understand that the system is about to speak, and might continue speaking, as illus-
trated in Fig. 8(b). As discussed in Sections 2.1 and 2.3 above, humans typically use various turn-initial cues in these situations
such as a lled pause, in-breath, gaze aversion, to signal their willingness to take the turn.
Thus, one solution to this problem is to use more shallow (or incremental) processing to nd TRPs and decide that the system
should respond. At this point, the system can start to produce a turn-initial cue, even if the system does not know exactly what
to say yet. Then, when processing of the users utterance is complete, the system can produce the actual response (cf. Skantze
and Hjalmarsson 2013; Skantze et al. 2015; Lala et al. 2019). To investigate the effectiveness of such cues in a human-robot inter-
action scenario, Skantze et al. (2015) systematically investigated different multi-modal turn-holding cues. Fig. 8(c) shows an
example where the user asks a question, and the system is not ready to respond immediately, and where a lled pause is used as
a turn-holding cue. To measure the effectiveness of different turn-holding cues, the probability for the user to continue speaking
in the window marked with ?(which should be avoided) can be calculated. It was found that all investigated cues were effec-
tive: lled pause, in-breath, smile, and gaze aversion. The strongest effect was achieved by combining several cues; gaze aversion
together with a lled pause reduced the probability of the user starting to speak by half, compared to using no cues at all. This
indicates that the subtle cues humans use for coordinating turn-taking can be transferred to a human-like robot and have similar
effects, without any explicit instructions to the user.
Hjalmarsson and Oertel (2011) investigated the effect of gaze as a backchannel-inviting cue (as discussed in Section 3.4
above). They designed an experiment where participants were asked to provide feedback while listening to a story-telling virtual
agent. While the agent looked away for the most part while speaking, it gazed at the user at certain points. The results showed
that listeners were indeed more prone to give backchannels when the agent gazed at them, although there was a large variation
in their behaviour, indicating that there are other important factors involved.
7. Modelling turn-taking in multi-party and situated interaction
In Section 2.4, we discussed how turn-taking in multi-party interaction differs from dyadic interaction. Even though voice-
only interaction can be multi-party, the natural setting for such interaction is face-to-face, preferably physically situated, interac-
tion. Thus, the modelling of multi-party interaction has mostly been studied in the context of human-robot interaction (although
multi-agent, single-user interactions have also been modelled in virtual environments; Traum and Rickel 2002). Apart from the
detection and generation of turn-holding and turn-yielding cues, which is relevant for any type of spoken interaction, multi-party
interaction also involves the identication of the addressee of utterances. This means that the system has to both detect whom a
user might be addressing, and display proper behaviours when addressing a specic user, as illustrated in Fig. 9. We will discuss
both these issues in this section, as well as turn-taking in interactions that involve physical manipulation of objects.
Fig. 8. Problems with lack of turn-taking cues from the system.
18 G. Skantze / Computer Speech & Language 67 (2021) 101178
7.1. Addressee detection
Several studies have looked into how to combine different cues for addressee detection in multi-party interaction (i.e., detect-
ing the addressee of a users utterance), using machine learning (Katzenmaier et al., 2004; Vinyals et al., 2012; Jovanovic et al.,
2006). The most obvious signal is clearly the users gaze. However, for practical systems, tracking the eye gaze of users is non-
trivial. Eye tracking equipment typically requires some form of calibration, they are often limited in terms of eld-of-view, and
are sensitive to blinking and occlusion. Many systems therefore rely on head pose tracking as a proxy for gaze direction, which is
a simpler and more robust approach, but which cannot capture quick glances or track more precise gaze targets. Despite this,
studies have found head pose to be a fairly reliable indicator of visual focus of attention in multi-party interaction, given that the
targets are clearly separated (Katzenmaier et al., 2004; Stiefelhagen et al., 2002; Ba and Odobez, 2009; Johansson et al., 2013).
In addition to gaze, some studies have also found other modalities to contribute. In multi-party interaction (involving two
humans and a robot), Shriberg et al. (2012) found that speech addressed towards the machine is louder and more clearly articu-
lated, which can be utilised to make the distinction. This is also in line with the ndings of Katzenmaier et al. (2004), although
they found visual cues to be more informative.
Johansson and Skantze (2015) explored how the problem of addressee detection could be combined with end-of-turn-detec-
tion. They used an IPU-based model (see Section 4.2 above), and annotated each IPU in their dataset (a conversational game
between one robot and two humans) according to how appropriate it was for the robot to take the turn. Instead of a binary
notion, they used a scale ranging from not appropriateto obliged. Cases where it is not appropriate to take the turn could
either be because the current speaker did not yield the turn, or because the turn was yielded to the other human (Fig. 9c). The
robot could also be obligedto take the turn, for example if a user looked at the robot and asked a direct question (Fig. 9b). In
between these, there were cases where it is possible to take the turn if needed, and cases where it is appropriate to take the
turn, but not obligatory. These were often cases where the user was attending both the robot and the other user, or objects on
the table (Fig. 9a). Head pose (as a proxy for gaze) was a fairly informative feature, which might not be surprising, since gaze can
both serve the role as a turn-yielding signal and as a device to select the next speaker. By adding more features, such as prosody
and verbal features, the performance was improved further.
Fig. 9. Possible turn-transitions in multi-party human-robot interaction.
G. Skantze / Computer Speech & Language 67 (2021) 101178 19
7.2. Addressing users and regulating the interaction
When it comes to generating visual turn-taking cues, an animated face might be sufcient for dyadic interaction, as the main
turn-regulatory function of gaze in such settings is to signal whether the agent is yielding or holding the turn (i.e., looking
towards the user or looking away). However, in multi-party interaction, an animated agent on a at screen might not sufce, as
the users and the agent are not sharing the same physical space. This makes it impossible for the user to see exactly where the
agent is looking (in the usersphysical space), a problem typically referred to as the Mona Lisa effect(Al Moubayed et al., 2012).
Thus, in a multi-party setting, this means that the agent cannot establish exclusive mutual gaze with one of the users, and in a sit-
uated interaction the object that is attended to cannot be inferred. Al Moubayed and Skantze (2011) compared an animated agent
on a 2D display with the Furhat robot head, which uses back-projection on a 3D mask, in a multi-party setting to explore the
impact on turn-taking. It was found that the turn-taking accuracy (i.e. how often the addressed user was the one taking the turn)
was 84% in the 3D condition, vs. 53% in the 2D condition. The response time was also faster in the 3D condition (1.38s vs. 1.85s),
which indicates that there was less confusion regarding the gaze target. It has also been shown that humans can utilise the
robots gaze to disambiguate references to objects to achieve joint attention (Skantze et al., 2014).
Several studies have found that robots in multi-party interaction can inuence the turn-taking by actively selecting the next
speaker using gaze, which the users typically conform to Mutlu et al. (2012),Bohus and Horvitz (2010) and Skantze et al. (2015).
This can be used to regulate the interaction, for example to increase the participation equality. One example of this is
Andrist et al. (2013), where a virtual agent was used to balance the contributions from a group of children playing a game. Such
balancing also requires a measure of dominance or participation equality. In Strohkorb et al. (2015), a model for classifying child-
rens social dominance in group interactions is presented. Based on manual annotation of social dominance, a model was trained.
The main predictor that turned out to be useful was the childrens gaze towards the robot (although there was no two-way spo-
ken interaction between the robot and the children). Another example is Nakano et al. (2015), where a dominance estimation
model (based on gaze and speech) was used to decide how the robot should regulate the interaction using gaze (in a Wizard-of-
Oz setup).
Skantze (2017a) investigated the ability of the robot to shape the interaction in a more open three-party conversation. The
results showed that the speaking time for most pairs of speakers was fairly imbalanced (with one participant speaking almost
twice as much as the other). However, in line with other studies, the robot was able to reduce the overall imbalance by address-
ing the less dominant speaker (Fig. 9f). This effect was stronger when mutual gaze was established between the robot and the
addressee. When the oor was open for self-selection (Fig. 9d-e), the imbalance instead increased.
7.3. Turn-taking in a physical world
Most models of turn-taking only consider speech and other communicative signals, such as gaze and gestures. However, in
situated interaction, humans naturally make use of what Clark (2005) refers to as material signals. This includes the placement
of objects (or themselves) in special sites for the addressee to interpret. For example, when a customer in a shop places an item
to buy on the counter, this can be interpreted as the dialogue act I would like to buy this item. Sometimes such actions are
accompanied with a spoken utterance (before, during or after the action takes place), but sometimes not. Thus, models of turn-
taking in a physical world should also take such actions into account. This might be especially important for human-robot inter-
action, which often involves physical tasks. For example, Hough and Schlangen (2016) explored a setting where a human gave
instructions to a robot arm to move objects on a table, where the robot did not speak at all, but where the actions of the robot
provided feedback on what the robot had understood, and where the timing of the instructions and the movements was impor-
tant for the uidity of the interaction. Furthermore, the execution of physical actions can be done in overlap with the spoken
instruction, and might need some time for preparation, and thus the need for incremental processing and predictive mechanisms
becomes more important, as the robot can start to prepare and execute the action before the instruction is complete
(Gervits et al., 2020).
Another potential settings for human-robot interaction is that of joint assembly, where only one person at a time might be
able to perform an action (Calisgan et al., 2012), and this turn-taking has to be coordinated just like the taking of turns speaking.
Calisgan et al. (2012) performed an experiment where humans were given a joint assembly task, in order to identify the signals
used to coordinate their turn-taking. Common turn-yielding signals included putting the hands on the table, crossing the arms,
or taking a step back. Often, several cues were combined.
A related phenomenon also relevant for human-robot interaction is that of hand-overs. Moon et al. (2014) did a comparative
study on how the robots gaze behaviour affects the efciency with which the robot can hand over an object to a human. It was
found that the most human-like behaviour where the robot rst gazes at the location where the hand-over will take place, and
then up at the human when the human reaches for the object was more efcient. In this condition, the human moved the hand
to the hand-over location even before the robots hand reached there.
In case the task execution is accompanied with a spoken utterance (such as an acknowledgement), the spoken utterance may
provide information on the timing of the task execution. Skantze et al. (2014) explored the setting of a robot giving giving
instructions to a human subject on how to draw a route on a map. Each piece of instruction was typically followed by an acknowl-
edgement from the human. However, the lexical choice (okay,yeah,mhm, etc.) and prosody of that acknowledgement var-
ied depending on whether it was produced before, while or after the corresponding route segment was drawn. Thus the form of
20 G. Skantze / Computer Speech & Language 67 (2021) 101178
acknowledgement seemed to serve to help the instruction giver to know the timing of the drawing action, and when the next
piece of instruction should follow.
When developing end-of-turn detection models (as discussed in Section 4) in interactions involving physical tasks, it is of
course also possible to include the manipulation of objects as a turn-taking cue. An example of this is the end-of-turn detection
model of Johansson and Skantze (2015), in the context of a robot playing a card-sorting game with two users, which also incorpo-
rated the movement of the cards as a feature for the data-driven turn-taking model.
8. Summary and future directions
As this review has shown, turn-taking in dialogue is a highly complex phenomenon that has attracted the interest of research-
ers from multiple disciplines. The coordination of when a turn will transition, and who will speak next, relies on multiple signals
across different modalities, including syntax, pragmatics, prosody and gaze. To some extent these cues are redundant, and to
some extent complementary, but the relative merits of different signals still needs to be investigated further. Conversational sys-
tems have traditionally relied on simple silence timeouts to identify suitable turn transitions, but this is clearly not sufcient, as
pauses within turns are frequent. Several studies have shown how turn-taking cues from the user can be identied to better
detect or predict when the turn is yielded. It is also important to consider how the system can make use of such signals to help
the user understand when the oor is open and not, and who is supposed to speak next. If the agent is embodied, the visual chan-
nel adds more potential cues which can further improve the coordination of turn-taking.
However, despite this progress in computational modelling of turn-taking, it is interesting to note how simplistic turn-taking
models in most state-of-the-art conversational systems still are. Recently, there has been a surge in the research on neural mod-
els for dialogue systems and chatbots, in both academia and industry, especially regarding dialogue state tracking and end-to-
end response generation. The target channels where these models have been deployed and evaluated have typically been either
written chats, where turn-taking is trivial, or voice assistants in smart phones and smart speakers, where explicit turn-taking
using wake words has become a norm (for various reasons, as discussed in the introduction). This might help to explain why
turn-taking has not attracted the same level of interest as these other problems. We think this might change as the applications
of spoken dialogue systems move beyond question-answering and command-and-control, and towards more conversational
interaction styles, as well as social robots that exhibit a wider repertoire of turn-taking signals and interaction settings.
In the rest of this section, we will discuss some potential directions for future research on turn-taking in conversational sys-
tems.
8.1. General and adaptive models
A problem with current models of turn-taking is that they are trained and evaluated on specic (often human-human) cor-
pora, such as Switchboard (Godfrey et al., 1995) or Map Task (Anderson et al., 1991). Thus, there is of course a risk that these
models learn the patterns and peculiarities of these corpora, and it is not clear how well they actually generalise to other dialogue
styles, and especially to human-computer dialogue. As was discussed in Section 4, models that have been applied to actual dia-
logue systems have typically been trained on data of interactions with those specic systems. In addition, these models have
required some form of labelling, either through manual annotation or through Wizard-of-Oz. This is typically too costly for the
development of most dialogue systems, and might help to explain why such models are not widely used in practical systems.
Of course, not only the interaction setting and style affects turn-taking behaviour, but there are also cultural, demographic and
individual differences, as discussed in Section 2.5. Thus, it is important that future research investigates how general models can
be trained on various types of interactions, and how well such models perform on human-computer interactions of different
sorts. One approach would be to use some form of self-supervised (Skantze, 2017b) adaptation of the model, by letting the model
make predictions about the future on a certain interaction data set, if that is available. Another approach would be to use some
form of reinforcement learning (Jonsdottir et al., 2008; Selfridge and Heeman, 2010; Khouzaimi et al., 2015). Whether such mod-
els could be used to ne-tune a turn-taking model in interaction with humans is of course a very interesting question to explore.
8.2. Achieving natural timing
Most models of turn-taking in conversational systems have been based on the assumption that the system should be able to
respond as quickly as possible. However, if this goal is achieved, this might not result in very natural conversations. In human-
human dialogue, response time naturally depends on factors such as cognitive load, personality and situation
(Str
ombergsson et al., 2013). While too long response time can lead to uncomfortable silence and turn-taking problems (as dis-
cussed in Section 4.1), too quick responses can be perceived as insincere or unreected. Thus, a certain delay might be more
appropriate for certain questions, and this delay should depend on the preceding context and the type of response. Looking at
human-human dialogue, it is clear that response time varies depending on the type of dialogue acts exchanged. In an analysis of
the Switchboard corpus, Str
ombergsson et al. (2013) found that both question type and response type affected the response
time, where open and wh-questions had a longer response time than yes/no and alternative questions (300450ms vs.
100180ms). Stivers et al. (2009) found that conrmations are delivered 100-500ms faster on average, compared to disconrma-
tions, in a study of 10 different languages.
G. Skantze / Computer Speech & Language 67 (2021) 101178 21
A recent model that incorporates this aspect was presented by Roddy and Harte (2020), where a continuous end-of-turn
detection model uses an encoding of both the previous utterance (from the user) and the upcoming response (from the system)
to incrementally predict response timings, based on human-human dialogue data.
8.3. Taking pragmatics and utility into account
As the review on turn-taking cues in Section 3 showed, syntactic and pragmatic completion plays a major role in turn-taking.
However, pragmatic completion is very hard to model. Current approaches are often very simplistic, for example relying on the
last POS tags in the utterance. Either much stronger language models are needed, which can take the larger dialogue context into
account (as discussed in Section 3.1), or the system needs to involve deeper processing of the users speech, where all compo-
nents of the dialogue system is involved in the decision of when to take the turn and not. This requires an incremental processing
framework, as discussed in Section 4.3.
When training turn-taking models on task-oriented dialogue data, such as Map Task, it is clear that information related to the
task is missing. In the example shown in Fig. 1, speaker B responds quickly to the rst question (have you got an extinct vol-
cano?) because speaker B can identify the object on her map. If that would not have been the case, the response time would
have been different, and this is not information that is accessible to the turn-taking model when trained purely on speech data.
In general, people do not just take the turn because there is a TRP, but because they have something to say. Thus any turn-tak-
ing model for task-oriented dialogue should also take the task into account and involve the utility of speaking. If the utility is high
(e.g. there is a re!), the system might want to speak regardless of whether there is a TRP. If the utility is low, it might be less
prone. However, even if it does not have something important to say, it might still be obliged to take the turn, given a strong
enough TRP (e.g. after someone has asked it What do you think?). Decision-theoretic models of turn-taking (which involve a
notion of utility) have been proposed by Raux and Eskenazi (2009) and Bohus and Horvitz (2011).
8.4. Predictive modelling
As discussed in Section 4.3, most turn-taking models are still reactive, in the sense that they do not try to predict the end of
the users turn before it is actually complete. While reactive models may be sufcient for many scenarios, as conversational sys-
tems do not have the same cognitive and physiological constraints as humans, and therefore do not have to prepare their
response in advance, there are several potential applications for such models. For example, as discussed in Section 7.3, physical
actions of robots do have constraints and may need to be planned for in advance.
Beyond turn-shifts, predictive models can also be trained to predict other types of future events, such as prosody, sentence-
completions and dialogue acts. This could for example be used by the system to predict how the user will react to the systems
own actions. This way, the system could explore different potential future scenarios and plan ahead accordingly. For example, by
simulating different prosodic realisations of an utterance, it could evaluate how the user is likely to react, and choose the most
appropriate one.
When it comes to prediction, it is still unclear what types of predictions humans make. The model by Skantze (2017b) was
based on prediction of speech activity in a 3 second window. A problem with that approach was that the predictions made
towards the end of this window were not very accurate. As argued by Sacks et al. (1974), and others, predictions are likely cen-
tered on words, where syntactic and pragmatic completion could help. However, the timing of those words is also important in
order to achieve accurate coordination.
Another motivation for predictive modelling is to be able to produce cooperative overlaps. While most models of turn-taking
in conversational systems are still based on the goal of minimising gaps and overlaps, cooperative overlaps in human-human
interaction do contribute to increased uency and feeling of rapport, as discussed in Section 2.3. Apart from the generation of
backchannels, there is so far very little work on how to produce cooperative overlapping speech. A notable exception was
DeVault et al. (2009), who developed a model for predicting what the user was about to say, in order to help the user to complete
the sentence, possibly overlapping with the users speech. However, in their experience when testing the system with users, this
often resulted in the agent being perceived as barging in and interrupting the users speech. Thus, given the current state of the
technology, the behaviour was deemed to be undesirable in most cases.
Generating cooperative overlapping speech, like the choral talk discussed in Section 2.3, would likely be an extremely chal-
lenging task, as it involves both the incremental processing of the users speech, prediction of how it is likely to continue, as well
as the incremental generation of synthesized speech with precise timing. If humans do rely on some form of coupled oscillation
model, as proposed by Wilson and Wilson (2005), this could possibly be used to put the system in concert with the user.
Declaration of interests
Gabriel Skantze is Professor in speech technology at the Department of Speech Music and Hearing at KTH. He is also co-
founder of the company Furhat Robotics.
22 G. Skantze / Computer Speech & Language 67 (2021) 101178
Acknowledgments
This work is supported by the Swedish research council (VR) project Coordination of Attention and Turn-taking in Situated
Interaction (#2013-1403). Thanks to Nigel Ward for initial discussions on this review, and to Erik Ekstedt, Martin Johansson and
Raveesh Meena for their contributions to our work on modelling turn-taking at KTH. Thanks also to the reviewers and editor for
their very helpful remarks.
References
Al Moubayed, S., Edlund, J., Beskow, J., 2012. Taming mona lisa : communicating gaze faithfully in 2D and 3D facial projections. ACM Trans. Interact. Intell. Syst. 1
(2), 125. https://doi.org/10.1145/2070719.2070724.
Al Moubayed, S., Skantze, G., 2011. Turn-taking control using gaze in multiparty human-Computer dialogue: effects of 2D and 3D displays. Proceedings of the
International Conference on Audio-Visual Speech Processing, 99102.
Anderson, A., Bader, M., Bard, E., Boyle, E., Doherty, G., Garrod, S., Isard, S., Kowtko, J., McAllister, J., Miller, J., Sotillo, C., Thompson, H., Weinert, R., 1991. The HCRC
map task corpus. Lang Speech 34 (4), 351366.
Andrist, S., Leite, I., Lehman, J., 2013. Fun and fair: inuencing turn-taking in a multi-party game with a virtual agent. Proceedings of the 12th International Confer-
ence on Interaction Design and Children - IDC 13, 352355. 10.1145/2485760.2485800
Argyle, M., Cook, M., 1976. Gaze and Mutual Gaze. Cambridge: Cambridge University Press.
Argyle, M., Graham, J.A., 1976. The central europe experiment: looking at persons and looking at objects. Environ. Psychol. Nonverbal Behav. 1 (1), 616.
Auer, P., 2018. Gaze, Addressee Selection and Turn-taking in Three-party Interaction. In: Br^
one, G., Oben, B. (Eds.), Eye-tracking in Interaction: Studies on the Role
of Eye Gaze in Dialogue. John Benjamins, pp. 197232. https://doi.org/10.1075/ais.10.09aue.
Ba, S.O., Odobez, J.-M., 2009. Recognizing visual focus of attention from head pose in natural meetings. IEEE Trans. Syst. Man Cybern. Part B: Cybern. 39 (1), 1633.
Ball, P., 1975. ListenersResponses to lled pauses in relation to oor apportionment. Br. J. Soc. Clin. Psychol. 14 (4), 423424.
Bavelas, J.B., Coates, L., Johnson, T., 2002. Listener responses as a collaborative process: the role of eye gaze. J. Commun. 52 (September), 566580.
Bell, L., Boye, J., Gustafson, J., 2001. Real-time Handling of Fragmented Utterances. In: Proceedings of the NAACL Workshop on Adaption in Dialogue Systems.
Bennett, A., 1981. Interruptions and the interpretation of conversation. Discourse Process 4, 171188.
B
ogels, S., Torreira, F., 2015. Listeners use intonational phrase boundaries to project turn ends in spoken interaction. J. Phon. 52, 4657. https://doi.org/10.1016/j.
wocn.2015.04.004.
Bohus, D., Horvitz, E., 2010. Facilitating multiparty dialog with gaze, gesture, and speech. In: Proceedings of International Conference on Multimodal Interfaces,
ICMI.Beijing, China
Bohus, D., Horvitz, E., 2011. Decisions about turns in multiparty conversation: from perception to action. In: Proceedings of International Conference on Multi-
modal Interfaces, ICMI, pp. 153160.
Brady, P.T., 1965. A technique for investigating on off patterns of speech. Bell Syst. Tech. J. 44 (1), 122. https://doi.org/10.1002/j.1538-7305.1965.tb04135.x.
Brady, P.T., 1968. A statistical analysis of on off patterns in 16 conversations. Bell Syst. Tech. J. 47 (1), 7391. https://doi.org/10.1002/j.1538-7305.1968.tb00031.x.
Calisgan, E., Haddadi, A., Loos, H.F.M.V.D., Alcazar, J.A., Croft, E.A., 2012. Identifying nonverbal cues for automated human-robot turn-taking. In: Proceedings of the
IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication.
Cassell, J., Bickmore, T., Campbell, L., Vilhjalmsson, H., Yan, H., 2001. Human Conversation as a System Framework: Designing Embodied Conversational Agents.
In: Cassell, J., Sullivan, J., Prevost, S., Churchill, E. (Eds.), Embodied Conversational Agents. MIT Press, Cambridge, MA, US, pp. 2963.
Clancy, P.M., Thompson, S.A., Suzuki, R., Tao, H., 1996. The conversational use of reactive tokens in English, Japanese, and Mandarin. J. Pragmat. 26 (3), 355387.
https://doi.org/10.1016/0378-2166(95)00036-4.
Clark, H., 1996. Using Language. Cambridge University Press, Cambridge, UK.
Clark, H., 2005. Coordinating with each other in a material world. Discourse Stud. 7 (45), 507525.
Clark, H., Fox Tree, J.E., 2002. Using uh and UM in spontaneous speaking. Cognition 84 (1), 73111.
Coates, J., 1994. No gap, lots of overlap; turn-taking patterns in the talk of women friends. Researching Language and Literacy in Social Context: A Reader. Multi-
lingual Matters.
Cummins, F., 2012. Gaze and blinking in dyadic conversation: a study in coordinated behaviour among individuals. Lang Cogn. Process.. https://doi.org/10.1080/
01690965.2011.615220.
De Kok, I., Heylen, D., 2009. Multimodal end-of-turn prediction in multi-party meetings. In: Proceedings of International Conference on Multimodal Interfaces,
ICMI, pp. 9197. https://doi.org/10.1145/1647314.1647332.
Dethlefs, N., Hastie, H., Cuayahuitl, H., Yu, Y., Rieser, V., Lemon, O., 2016. Information density and overlap in spoken dialogue. Comput. Speech Lang. 37, 8297.
https://doi.org/10.1016/j.csl.2015.11.001.
DeVault, D., Sagae, K., Traum, D., 2009. Can I Finish? Learning When to Respond to Incremental Interpretation Results in Interactive Dialogue. In: Proceedings of
the Annual Meeting of the Special Interest Group on Discourse and Dialogue, SIGDIAL, pp. 1120.London, UK
Dittman, A.T., Llewellyn, L.G., 1967. The phonemic clause as a unit of speech decoding. J. Pers. Soc. Psychol. 6 (3), 341349. https://doi.org/10.1037/h0024739.
Duncan, S., 1972. Some signals and rules for taking speaking turns in conversations. J. Pers. Soc. Psychol. 23 (2), 283292.
Duncan, S., 1974. On signalling that its your turn to speak. J. Exp. Soc. Psychol. 10 (3), 234247.
Duncan, S., Fiske, D., 1977. Face-to-face Interaction: Research, Methods and Theory. Lawrence Erlbaum Associates, Hillsdale, New Jersey, US.
Edlund, J., Beskow, J., 2009. Mushypeek : A Framework for online investigation of audiovisual dialogue phenomena. Lang Speech 52 (2/3), 351367. https://doi.
org/10.1177/0023830909103179.
Edlund, J., Heldner, M., 2005. Exploring prosody in interaction control. Phonetica 62 (24), 215226. https://doi.org/10.1159/000090099 .
Ekstedt, E., Skantze, G., 2020. TurnGPT: a transformer-based language model for predicting turn-taking in spoken dialog. In: Proceedings of the Findings of the
Association for Computational Linguistics: EMNLP 2020, pp. 29812990. https://doi.org/10.18653/v1/2020.ndings-emnlp.268.
Ervin-Tripp, S.M., 1979. Childrens Verbal Turn-taking. In: Ochs, E., Schieffelin, B. (Eds.), Developmental pragmatics. Academic Press, pp. 391 414.
Eyben, F., W
ollmer, M., Schuller, B., 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the ACM Multimedia,
pp. 14591462.Florence, Italy
Ferrer, L., Shriberg, E., Stolcke, A., 2002. Is the speaker done yet? Faster and more accurate end-of utterance detection using prosody. In: Procedings of the Interna-
tional Conference on Spoken Language Processing, ICSLP, pp. 20612064.
Ford, C., Thompson, S., 1996. Interactional units in conversation: syntactic, intonational, and pragmatic resources for the management of turns. In: Ochs, E.,
Schegloff, E., Thompson, A. (Eds.), Interaction and Grammar. Cambridge University Press, Cambridge, pp. 134184.
French, P., Local, J., 1983. Turn-competitive incomings. J. Pragmat.. https://doi.org/10.1016/0378-2166(83)90147-9.
Gao, Y., Mishchenko, Y., Shah, A., Matsoukas, S., Vitaladevuni, S., 2020. Towards data-efcient modeling for wake word spotting. In: Proceedings of the IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing, ICASSP. IEEE, pp. 74797483.
Garrod, S., Pickering, M.J., 2015. The use of content and timing to predict turn transitions. Front Psychol. 6, 751. https://doi.org/10.3389/fpsyg.2015.00751.
Gervits, F., Thielstrom, R., Roque, A., Scheutz, M., 2020. Its About Time : Turn-Entry Timing For Situated Human-Robot Dialogue. In: Proceedings of the Annual
Meeting of the Special Interest Group on Discourse and Dialogue, SIGDIAL, pp. 8696.
Godfrey, J., Holliman, E., McDaniel, J., 1995. Switchboard,: Telephone Speech Corpus for Research and Development. In: Proceedings of the IEEE International Con-
ference on Acoustics, Speech and Signal Processing, ICASSP, pp. 517520.San Francisco
G. Skantze / Computer Speech & Language 67 (2021) 101178 23
Goffman, E., 1979. Footing. Semiotica 25 (12), 130.
Goodwin, C., 1981. Conversational Organization: Interaction Between Speakers and Hearers. Academic Press, New York.
Gravano, A., Hirschberg, J., 2011. Turn-taking cues in task-oriented dialogue. Comput. Speech Lang. 25 (3), 601634.
Gravano, A., Hirschberg, J., 2012. A Corpus-Based Study of Interruptions in Spoken Dialogue. In: Proceedings of the Annual Conference of the International Speech
Communication Association, INTERSPEECH.
Heins, R., Franzke, M., Durian, M., Bayya, A., 1997. Turn-taking as a design principle for barge-in spoken language systems. Int. J. Speech Technol. 2 (2), 155164.
Heldner, M., Edlund, J., 2010. Pauses, gaps and overlaps in conversations. J Phon 38 (4), 555568. https://doi.org/10.1016/j.wocn.2010.08.002.
Hemphill, C.T., Godfrey, J.J., Doddington, G.R., 1990. The atis spoken language systems pilot corpus. In: Proceedings of Speech and Natural Language Workshop.
Hjalmarsson, A., 2011. The additive effect of turn-taking cues in human and synthetic voice. Speech Commun. 53 (1), 2335. https://doi.org/10.1016/j.spe-
com.2010.08.003.
Hjalmarsson, A., Oertel, C., 2011. Gaze direction as a backchannel inviting cue in dialogue. In: Proceedings of the IVA 2012 workshop on Realtime Conversational
Virtual Agents, p. 1.Santa Cruz, CA
Holler, J., Kendrick, K.H., Levinson, S.C., 2018. Processing language in face-to-face conversation: questions with gestures get faster responses. Psychonomic Bull.
Rev. 25 (5), 19001908. https://doi.org/10.3758/s13423-017-1363-z.
Hough, J., Schlangen, D., 2016. Investigating Fluidity for Human-Robot Interaction with Real-time, Real-world Grounding Strategies. In: Proceedings of the Annual
Meeting of the Special Interest Group on Discourse and Dialogue, SIGDIAL, pp. 288298. https://doi.org/10.18653/v1/W16-3637.
Hussain, N., Erzin, E., Metin Sezgin, T., Yemez, Y., 2019. Speech driven backchannel generation using deep Q-network for enhancing engagement in human-robot
interaction. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. https://doi.org/10.21437/Inter-
speech.2019-2521.
Ishii, R., Otsuka, K., Kumano, S., Yamato, J., 2014. Analysis of respiration for prediction of Who Will Be Next Speaker and When?in multi-party meetings. In: Pro-
ceedings of the International Conference on Multimodal Interfaces, ICMI. https://doi.org/10.1145/2663204.2663271.
Ishii, R., Otsuka, K., Kumano, S., Yamato, J., 2016. Prediction of who will be the next speaker and when using gaze behavior in multiparty meetings. ACM Trans.
Interact. Intell. Syst.. https://doi.org/10.1145/2757284.
Johansson, M., Hori, T., Skantze, G., Hïthker, A., Gustafson, J., H
othker, A., Gustafson, J., 2016. Making turn-taking decisions for an active listening robot for memory
training. In: Proceedings of the International Conference on Social Robotics, 9979 LNAI, pp. 940949. https://doi.org/10.1007/978-3-319-47437-3_92.
Johansson, M., Skantze, G., 2015. Opportunities and Obligations to take turns in collaborative multi-party human-robot interaction. In: Proceedings of the Annual
Meeting of the Special Interest Group on Discourse and Dialogue, SIGDIAL, p. 305314.
Johansson, M., Skantze, G., Gustafson, J., 2013. Head pose patterns in multiparty human-robot team-building interactions. In: Proceedings of International Confer-
ence on Social robotics (ICSR), 8239 LNAI, . https://doi.org/10.1007/978-3-319-02675-6_35.
Jokinen, K., Furukawa, H., Nishida, M., Yamamoto, S., 2013. Gaze and turn-taking behavior in casual conversational interactions. ACM Trans. Interact. Intell. Syst. 3
(2).
Jokinen, K., Nishida, M., Yamamoto, S., 2010. On Eye-gaze and Turn-taking. In: Proceedings of the International Conference on Intelligent User Interfaces, Proceed-
ings IUI, pp. 118123. https://doi.org/10.1145/2002333.2002352.
Jonsdottir, G.R., Thorisson, K.R., Nivel, E., 2008. Learning smooth, human-like turntaking in realtime dialogue. In: Proceedings of the Intelligent Virtual Agents, IVA.
https://doi.org/10.1007/978-3-540-85483-8_17.
Jovanovic, N., Op Den Akker, R., Nijholt, A., 2006. Addressee identication in face-to-face meetings. In: Proceedings of the Conference of the European Chapter of
the Association for Computational Linguistics, EACL, pp. 169176.
Katzenmaier, M., Stiefelhagen, R., Schultz, T., Rogina, I., Waibel, A., 2004. Identifying the addressee in human-human-robot interactions based on head pose and
speech. In: Proceedings of International Conference on Multimodal Interfaces, ICMI.PA, USA
Kawahara, T., Iwatate, T., Takanashi, K., 2012. Prediction of turn-taking by combining prosodic and eye-gaze information in poster conversations.. In: Proceedings
of the Annual Conference of the International Speech Communication Association, INTERSPEECH.
Kendon, A., 1967. Some functions of gaze direction in social interaction. Acta Psychol. (Amst) 26, 2263.
Khouzaimi, H., Laroche, R., Lef
evre, F., 2015. Optimising turn-taking strategies with reinforcement learning. In: Proceedings of the Annual Meeting of the Special
Interest Group on Discourse and Dialogue, SIGDIAL, pp. 315324. https://doi.org/10.18653/v1/w15-4643.
Koiso, H., Horiuchi, Y., Tutiya, S., Ichikawa, A., Den, Y., 1998. An analysis of turn-taking and backchannels based on prosodic and syntactic features in Japanese map
task dialogs. Lang Speech. 41, 295321. https://doi.org/10.1177/002383099804100404.
Kunc, L., Míkovec, Z., Slavík, P., 2013. Avatar and dialog turn-yielding phenomena. Int. J. Technol. Hum. Interact. 9 (2), 6688. https://doi.org/10.4018/
jthi.2013040105.
Lala, D., Inoue, K., Kawahara, T., 2019. Smooth turn-taking by a robot using an online continuous model to generate turn-taking cues. In: Proceedings of Interna-
tional Conference on Multimodal Interfaces, ICMI, pp. 226234. https://doi.org/10.1145/3340555.3353727.
Lee, C., Narayanan, S., 2010. Predicting interruptions in dyadic spoken interactions. In: Proceedings of the IEEE International Conference on Acoustics, Speech and
Signal Processing, pp. 52505253.
Levelt, W., 1989. Speaking: From Intention to Articulation. MIT Press, Cambridge, Mass., USA.
Levinson, S.C., Torreira, F., 2015. Timing in turn-taking and its implications for processing models of language. Front. Psychol. 6 (JUN). https://doi.org/10.3389/
fpsyg.2015.00731.
Leyzberg, D., Spaulding, S., Toneva, M., Scassellati, B., 2012. The physical presence of a robot tutor increases cognitive learning gains. In: Proceedings of the 34th
Annual Conference of the Cognitive Science Society.ISBN 978-0-9768318-8-4
Local, J., Kelly, J., Wells, W., 1986. Towards a phonology of conversation: turn-taking in tyneside english. J. Linguist. 22 (2), 411437.
Maier, A., Hough, J., Schlangen, D., 2017. Towards deep end-of-Turn prediction for situated spoken dialogue systems. In: Proceedings of the Annual Conference of
the International Speech Communication Association, INTERSPEECH, 2017, pp. 16761680. https://doi.org/10.21437/Interspeech.2017-1593.
Martinez, M.A., 1987. Dialogues among children and between children and their mothers. Child Dev. 58 (4), 10351043. https://doi.org/10.2307/1130544.
Masumura, R., Tanaka, T., Ando, A., Ishii, R., Higashinaka, R., Aono, Y., 2018. Neural dialogue context online end-of-turn detection. In: Proceedings of the 19th
Annual SIGdial Meeting on Discourse and Dialogue, pp. 224228. https://doi.org/10.18653/v1/W18-5024.Melbourne, Australia
McFarland, D.H., 2001. Respiratory markers of conversational interaction. J. Speech Lang. Hear. Res. 44, 128143.
McGlashan, S., Burnett, D. C., Carter, J., Danielsen, P., Ferrans, J., Hunt, A., Lucas, B., Porter, B., Rehor, K., Tryphonas, S., 2004. Voice extensible markup language (Voi-
ceXML): version 2.0.
Meena, R., Skantze, G., Gustafson, J., 2014. Data-driven models for timing feedback responses in a map task dialogue system. Comput. Speech Lang. 28 (4), 903
922.
Mondada, L., 2007. Multimodal resources for turn-taking: pointing and the emergence of possible next speakers. Discourse Stud.. https://doi.org/10.1177/
1461445607075346.
Moon, A., Troniak, D., Gleeson, B., Pan, M., Zheng, M., Blumer, B., MacLean, K., Croft, E., 2014. Meet Me Where Im Gazing: How Shared Attention Gaze Affects
Human-robot Handover Timing. In: Proceedings of the ACM/IEEE International Conference on Human-robot Interaction, HRI, pp. 334341. https://doi.org/
10.1145/2559636.2559656.New York, NY
Morency, L.P., de Kok, I., Gratch, J., 2008. Predicting listener backchannels: A probabilistic multimodal approach. In: Proceedings of the Intelligent Virtual Agents,
IVA. Springer, Tokyo, Japan, pp. 176190.
Mutlu, B., Kanda, T., Forlizzi, J., Hodgins, J., Ishiguro, H., 2012. Conversational gaze mechanisms for humanlike robots. ACM Trans. Interact. Intell. Syst. 1 (2), 112.
https://doi.org/10.1145/2070719.2070725.
Nakano, Y.I., Yoshino, T., Yatsushiro, M., Takase, Y., 2015. Generating robot gaze on the basis of participation roles and dominance estimation in multiparty inter-
action. ACM Trans. Interact. Intell. Syst. 5 (4), 123. https://doi.org/10.1145/2743028.
24 G. Skantze / Computer Speech & Language 67 (2021) 101178
Neiberg, D., Truong, K.P., 2011. Online detection of vocal Listener Responses with maximum latency constraints. In: Proceedings of the IEEE International Confer-
ence on Acoustics, Speech and Signal Processing, ICASSP, pp. 58365839. https://doi.org/10.1109/ICASSP.2011.5947688.
OConaill, B., Whittaker, S., Wilbur, S., 1993. Conversations over video conferences: an evaluation of the spoken aspects of video-Mediated communication.
Human-Comput. Int.. https://doi.org/10.1207/s15327051hci0804_4.
OConnell, D.C., Kowal, S., 2005. Uh and um revisited: are they interjections for signaling delay? J. Psychol. Res. 34 (6), 555576. https://doi.org/10.1007/s10936-
005-9164-3.
OConnell, D.C., Kowal, S., Kaltenbacher, E., 1990. Turn-taking: a critical analysis of the research tradition. J. Psychol. Res. 19 (6), 345373. https://doi.org/10.1007/
BF01068884.
Oertel, C., Wlodarczak, M., Edlund, J., Wagner, P., Gustafson, J., 2012. Gaze patterns in turn-taking. In: Proceedings of the Annual Conference of the International
Speech Communication Association, INTERSPEECH.
Oertel, C., Wlodarczak, M., Tarasov, A., Campbell, N., Wagner, P., 2012. Context cues for classication of competitive and collaborative overlaps. In: Proceedings of
the International Conference on Speech Prosody, pp. 721724.
Pelachaud, C., Badler, N., Steedman, M., 1996. Generating facial expressions for speech. Cogn. Sci. 20 (1).
Poesio, M., Rieser, H., 2010. Completions, Coordination, and Alignment in Dialogue. Dial. Discour. 1 (1). https://doi.org/10.5087/dad.2010.001.
Raux, A., Bohus, D., Langner, B., Black, A.W., Eskenazi, M., 2006. Doing research on a deployed spoken dialogue system: One year of Lets Go! experience. In: Pro-
ceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH.Pittsburgh, PA, USA
Raux, A., Eskenazi, M., 2008. Optimizing endpointing thresholds using dialogue features in a spoken dialogue system. In: Proceedings of the Annual Meeting of the
Special Interest Group on Discourse and Dialogue, SIGDIAL.Columbus, OH, USA
Raux, A., Eskenazi, M., 2009. A nite-state turn-taking model for spoken dialog systems. In: Proceedings of the Conference of the North American Chapter of the
Association for Computational Linguistics, NAACL, pp. 629637.Boulder, CO, USA
Rochet-Capellan, A., Fuchs, S., 2014. Take a breath and take the turn: how breathing meets turns in spontaneous dialogue. Philosoph. Trans. R. Soc. B Biol. Sci..
https://doi.org/10.1098/rstb.2013.0399.
Roddy, M., Harte, N., 2020. Neural Generation of Dialogue Response Timings. In: Proceedings of the Annual Meeting of the Association for Computational Linguis-
tics, ACL.arXiv:2005.09128v1
Roddy, M., Skantze, G., Harte, N., 2018. Investigating speech features for continuous turn-taking prediction using LSTMs. In: Proceedings of the Annual Conference
of the International Speech Communication Association, INTERSPEECH.Hyderabad, India
Roddy, M., Skantze, G., Harte, N., 2018. Multimodal continuous turn-taking prediction using multiscale RNNs. In: Proceedings of the International Conference on
Multimodal Interfaces, ICMI, pp. 186190. https://doi.org/10.1145/3242969.3242997.New York, New York, USA
Ruede, R., M
uller, M., St
uker, S., Waibel, A., 2019. Yeah, right, uh-huh: A deep learning backchannel predictor. In: Proceedings of the Lecture Notes in Electrical
Engineering. https://doi.org/10.1007/978-3-319-92108-2_25.
de Ruiter, J.-P., Mitterer, H., Eneld, N.J., 2006. Projecting the end of a speakers turn: a cognitive cornerstone of conversation. Language (Baltim) 82 (3), 515535.
https://doi.org/10.1353/lan.2006.0130.
Sacks, H., Schegloff, E., Jefferson, G., 1974. A simplest systematics for the organization of turn-taking for conversation. Language (Baltim) 50, 696735.
Sato, R., Higashinaka, R., Tamoto, M., Nakano, M., Aikawa, K., 2002. Learning decision trees to determine turn-taking by spoken dialogue systems. In: Proceedings
of the International Conference on Spoken Language Processing, ICSLP.
Schegloff, E., 2000. Overlapping talk and the organization of turn-taking for conversation. Lang. Soc. 29 (1), 163.
Schlangen, D., 2006. From reaction to prediction: experiments with computational models of turn-taking. In: Proceedings of the Annual Conference of the Inter-
national Speech Communication Association, INTERSPEECH, pp. 20102013.Pittsburgh, PA, USA
Schlangen, D., Skantze, G., 2009. A general, abstract model of incremental dialogue processing. In: Proceedings of the Conference of the European Chapter of the
Association for Computational Linguistics, EACL, pp. 710718.
Schneider, L., Goffman, E., 1964. Behavior in public places: notes on the social organization of gatherings.. Am. Sociol Rev.. https://doi.org/10.2307/2091496.
Selfridge, E.O., Arizmendi, I., Heeman, P.A., Williams, J.D., 2013. Continuously predicting and processing barge-in during a live spoken dialogue task. In: Proceed-
ings of the Annual Meeting of the Special Interest Group on Discourse and Dialogue, SIGDIAL, pp. 384393.
Selfridge, E.O., Heeman, P.a., 2010. Importance-driven turn-bidding for spoken dialogue systems. In: Proceedings of the Annual Meeting of the Association for
Computational Linguistics, ACL, pp. 177185.Uppsala, Sweden
Sellen, A.J., 1995. Remote conversations: the effects of mediating talk with technology. Human-Comput. Int.. https://doi.org/10.1207/s15327051hci1004_2.
Selting, M., 1996. On the interplay of syntax and prosody in the constitution of turnconstructional units and turns in conversation. Pragmatics 6, 357388.
Shriberg, E., Stolcke, A., Hakkani-T
ur, D., Heck, L., 2012. Learning when to listen: detecting system-addressed speech in human-human-computer dialog. In: Pro-
ceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH.Partland, OR, USA
Sikveland, R.O., Ogden, R., 2012. Holding gestures across turns. Gesture 12 (2), 166199. https://doi.org/10.1075/gest.12.2.03sik.
Skantze, G., 2016. Real-Time coordination in human-Robot interaction using face and voice. AI Magazine 37 (4), 19. https://doi.org/10.1609/aimag.v37i4.2686.
Skantze, G., 2017. Predicting and regulating participation equality in human-robot conversations: effects of age and gender. In: Proceedings of the ACM/IEEE
International Conference on Human-robot Interaction, HRI, pp. 196204. https://doi.org/10.1145/2909824.3020210.
Skantze, G., 2017. Towards a general, continuous model of turn-taking in spoken dialogue using LSTM recurrent neural networks. In: Proceedings of the Annual
Meeting of the Special Interest Group on Discourse and Dialogue, SIGDIAL, pp. 220230. https://doi.org/10.18653/v1/W17-5527.
Skantze, G., Gustafson, J., 2009. Attention and interaction control in a human-human-computer dialogue setting. In: Proceedings of the Annual Meeting of the
Special Interest Group on Discourse and Dialogue, SIGDIAL.
Skantze, G., Hjalmarsson, A., 2013. Towards incremental speech generation in conversational systems. Comput. Speech Lang. 27 (1), 243262.
Skantze, G., Hjalmarsson, A., Oertel, C., 2014. Turn-taking, feedback and joint attention in situated human-robot interaction. Speech Commun. 65, 5066. https://
doi.org/10.1016/j.specom.2014.05.005.
Skantze, G., Johansson, M., Beskow, J., 2015. Exploring Turn-taking Cues in Multi-party Human-robot Discussions about Objects. In: Proceedings of the ACM on
International Conference on Multimodal Interaction, pp. 6774. https://doi.org/10.1145/2818346.2820749.
Skantze, G., Schlangen, D., 2009. Incremental dialogue processing in a micro-domain. In: Proceedings of the Conference of the European Chapter of the Association
for Computational Linguistics, EACL, pp. 745753.
Stiefelhagen, R., Yang, J., Waibel, A., 2002. Modeling focus of attention for meeting indexing based on multiple cues. IEEE Trans. Neural Networks 13 (4), 928938.
Stivers, T., Eneld, N.J., Brown, P., Englert, C., Hayashi, M., Heinemann, T., Hoymann, G., Rossano, F., de Ruiter, J.-P., Yoon, K.-E., Levinson, S.C., 2009. Universals and
cultural variation in turn-taking in conversation.. Proc. Natl. Acad. Sci. U.S.A. 106 (26), 1058710592. https://doi.org/10.1073/pnas.0903616106.
Streeck, J., Hartge, U., 1992. Previews: gestures at the transition place. In: Auer, P., di Luzio, P.A. (Eds.), The Contextualization of Language. Amsterdam: Benjamins,
pp. 135157.
Strohkorb, S., Leite, I., Warren, N., Scassellati, B., 2015. Classication of childrens social dominance in group interactions with robots. In: Proceedings of the Inter-
national Conference on Multimodal Interfaces, ICMI, pp. 227234.
Str
om, N., Seneff, S., 2000. Intelligent barge-in conversational systems. In: Proceedings of the International Conference on Spoken Language Processing, ICSLP.
Str
ombergsson, S., Hjalmarsson, A., Edlund, J., House, D., 2013. Timing responses to questions in dialogue. In: Proceedings of the Annual Conference of the Interna-
tional Speech Communication Association, INTERSPEECH, pp. 25842588.
Ten Bosch, L., Oostdijk, N., Boves, L., 2005. On temporal aspects of turn taking in conversational dialogues. Speech Commun. 47. https://doi.org/10.1016/j.spe-
com.2005.05.009.
Th
orisson, K.R., 1999. A mind model for multimodal communicative creatures and humanoids. Int. J. Appl. Artif. Intell. 13 (45), 449486.
Tomasello, M., Hare, B., Lehmann, H., Call, J., 2007. Reliance on head versus eyes in the gaze following of great apes and human infants: the cooperative eye
hypothesis. J. Hum. Evol. 52 (3), 314320.
G. Skantze / Computer Speech & Language 67 (2021) 101178 25
Torreira, F., B
ogels, S., Levinson, S.C., 2015. Breathing for answering: the time course of response planning in conversation. Front Psychol. 6. https://doi.org/
10.3389/fpsyg.2015.00284.
Traum, D., Aggarwal, P., Artstein, R., Foutz, S., Gerten, J., Katsamanis, A., Leuski, A., Noren, D., Swartout, W., 2012. Ada and grace: Direct interaction with museum
visitors. In: Proceedings of the International Conference on Intelligent Virtual Agents. Springer, pp. 245251.
Traum, D., Rickel, J., 2002. Embodied agents for multi-party dialogue in immersive virtual worlds. In: Proceedings of the IJCAI Workshop on Knowledge and Rea-
soning in Practical Dialogue Systems, 2, pp. 766773. https://doi.org/10.1145/544862.544922.New York, NY, USA
Traum, D., Roque, A., Leuski, A., Georgiou, P., Gerten, J., Martinovski, B., Narayanan, S., Robinson, S., Vaswani, A., 2007. Hassan: A virtual human for tactical ques-
tioning. In: Proceedings of the Annual Meeting of the Special Interest Group on Discourse and Dialogue, SIGDIAL, pp. 7174.
Truong, K., Poppe, R., Heylen, D., 2010. A rule-based backchannel prediction model using pitch and pause information.. In: Kobayashi, T., Hirose, K.,
Nakamura, S. (Eds.), Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 30583061.
Truong, K.P., 2013. Classication of cooperative and competitive overlaps in speech using cues from the context,overlapper, and overlappee. In: Proceedings of the
Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 14041408.
Velichkovsky, B.M., 1995. Communicating attention: gaze position transfer in cooperative problem solving. Pragmat. Cogn. 3, 199224.
Vinyals, O., Bohus, D., Caruana, R., 2012. Learning speaker, addressee and overlap detection models from multimodal streams. In: Proceedings of the 14th ACM
international conference on Multimodal Interaction. ACM, pp. 417424.
Wang, Y.T., Nip, I.S., Green, J.R., Kent, R.D., Kent, J.F., Ullman, C., 2012. Accuracy of perceptual and acoustic methods for the detection of inspiratory loci in sponta-
neous speech. Behav. Res. Methods 44 (4), 11211128. https://doi.org/10.3758/s13428-012-0194-0.
Ward, N., 1996. Using prosodic clues to decide when to produce backchannel utterances. In: Proceedings of the fourth International Conference on Spoken Lan-
guage Processing, pp. 17281731.Philadelphia, USA
Ward, N., 2004. Pragmatic Functions of Prosodic Features in Non-Lexical Utterances. In: Proceedings of the International Conference on Speech Prosody, pp. 325
328.10.1.1.2.433
Ward, N., 2019. Prosodic Patterns in English Conversation. Cambridge University Press. https://doi.org/10.1017/9781316848265.
Ward, N., Aguirre, D., Cervantes, G., Fuentes, O., 2018. Turn-Taking Predictions across Languages and Genres Using an LSTM Recurrent Neural Network. In: Pro-
ceedings of the IEEE Spoken Language Technology Workshop, SLT, pp. 831837. https://doi.org/10.1109/SLT.2018.8639673.
Ward, N., Rivera, A., Ward, K., Novick, D., 2005. Root causes of lost time and user stress in a simple dialog system. In: Proceedings of Interspeech 2005.Lisbon,
Portugal
Weiss, C., 2018. When gaze-selected next speakers do not take the turn. J. Pragmat.. https://doi.org/10.1016/j.pragma.2018.05.016.
Weizenbaum, J., 1966. ELIZA - A computer program for the study of natural language communication between man and machine. Commun. Assoc. Comput. Mach.
9, 3645.
Wilson, M., Wilson, T.P., 2005. An oscillator model of the timing of turn-taking. Psychon. Bull. Rev. 12 (6), 957968.
Witt, S., 2015. Modeling user response timings in spoken dialog systems. Int. J. Speech Technol. 18 (2), 231243. https://doi.org/10.1007/s10772-014-9265-1.
W»odarczak, M., Heldner, M., 2016. Respiratory belts and whistles: A preliminary study of breathing acoustics for turn-taking. In: Proceedings of the Annual Con-
ference of the International Speech Communication Association, INTERSPEECH, pp. 510514. https://doi.org/10.21437/Interspeech.2016-344.
Woodruff, A., Aoki, P.M., 2003. How push-to-talk makes talk less pushy. In: Proceedings of the International ACM SIGGROUP Conference on Supporting Group
Work, pp. 170179. https://doi.org/10.1145/958160.958187.New York, NY, USA
Yang, L.-C., 2001. Visualizing spoken discourse: prosodic form and discourse functions of interruptions. In: Proceedings of the Annual Meeting of the Special Inter-
est Group on Discourse and Dialogue, SIGDIAL, pp. 110. https://doi.org/10.3115/1118078.1118106.
Yngve, V.H., 1970. On getting a word in edgewise. In: Proceedings of the Papers from the sixth regional meeting of the Chicago Linguistic Society. Department of
Linguistics, Chicago, pp. 567578.
Zellers, M., House, D., Alexanderson, S., 2016. Prosody and hand gesture at turn boundaries in Swedish. In: Proceedings of the International Conference on Speech
Prosody, pp. 831835. https://doi.org/10.21437/speechprosody.2016-170.
Zima, E., Weiß, C., Br^
one, G., 2019. Gaze and overlap resolution in triadic interactions. J. Pragmat. 140, 4969. https://doi.org/10.1016/j.pragma.2018.11.019.
26 G. Skantze / Computer Speech & Language 67 (2021) 101178
... Although effortless for humans, turn-taking is highly challenging for robots. Most turn-taking algorithms in use today are based on a robot reacting to silence after a turn [4], resulting in interactions that are less fluid than those between humans [5,6]. Predictive turntaking models (PTTMs) have been proposed to overcome these limitations [7,4]. ...
... Most turn-taking algorithms in use today are based on a robot reacting to silence after a turn [4], resulting in interactions that are less fluid than those between humans [5,6]. Predictive turntaking models (PTTMs) have been proposed to overcome these limitations [7,4]. PTTMs learn human-like turn-taking from large corpora of human interaction, e.g. ...
... Most PTTMs use speech features [4]. Yet in human-human interaction, listeners make faster and more accurate turn-taking decisions when they can see and hear a speaker [10]. ...
Preprint
Full-text available
Accurate predictive turn-taking models (PTTMs) are essential for naturalistic human-robot interaction. However, little is known about their performance in noise. This study therefore explores PTTM performance in types of noise likely to be encountered once deployed. Our analyses reveal PTTMs are highly sensitive to noise. Hold/shift accuracy drops from 84% in clean speech to just 52% in 10 dB music noise. Training with noisy data enables a multimodal PTTM, which includes visual features to better exploit visual cues, with 72% accuracy in 10 dB music noise. The multimodal PTTM outperforms the audio-only PTTM across all noise types and SNRs, highlighting its ability to exploit visual cues; however, this does not always generalise to new types of noise. Analysis also reveals that successful training relies on accurate transcription, limiting the use of ASR-derived transcriptions to clean conditions. We make code publicly available for future research.
... Across this body of work, several technical and design challenges remain unresolved, including high response latency [36,37], hallucinated content, poor alignment with user expectations [38], and limited support for multilingual and dialectal variation [39]. Furthermore, very few systems have been tested with both younger and older adult users in real-world settings. ...
... Effective turn-taking remains one of the most persistent challenges in spoken human-robot interaction [36]. In our evaluation, fixed timing strategies-such as a static silence threshold of 1.2 s-proved inadequate across both age groups. ...
Article
Full-text available
Large Language Models (LLMs), particularly those enhanced through Reinforcement Learning from Human Feedback, such as ChatGPT, have opened up new possibilities for natural and open-ended spoken interaction in social robotics. However, these models are not inherently designed for embodied, multimodal contexts. This paper presents a user-centred approach to integrating an LLM into a humanoid robot, designed to engage in fluid, context-aware conversation with socially isolated older adults. We describe our system architecture, which combines real-time speech processing, layered memory summarisation, persona conditioning, and multilingual voice adaptation to support personalised, socially appropriate interactions. Through iterative development and evaluation, including in-home exploratory trials with older adults (n = 7) and a preliminary study with young adults (n = 43), we investigated the technical and experiential challenges of deploying LLMs in real-world human–robot dialogue. Our findings show that memory continuity, adaptive turn-taking, and culturally attuned voice design enhance user perceptions of trust, naturalness, and social presence. We also identify persistent limitations related to response latency, hallucinations, and expectation management. This work contributes design insights and architectural strategies for future LLM-integrated robots that aim to support meaningful, emotionally resonant companionship in socially assistive settings.
... Theoretically, it clarifies how universal turn-taking principles unfold within a diverse landscape of demographic and individual factors. Practically, these insights are important for the development of turn-taking models in conversational systems [23]. While there have been a lot of efforts recently in developing predictive models of turn-taking based on human-human conversational data [24,25,26], the question remains how universal these models are and to what extent they should be conditioned on individual or demographic factors. ...
... For the development of predictive models of turn-taking in conversational systems [23], these results indicate that such models should take other factors into account than just the immediately preceding context. By conditioning the models on more long-term idiosyncratic and dyad-specific patterns, the model predictions are likely to improve. ...
Preprint
Full-text available
Turn-taking in dialogue follows universal constraints but also varies significantly. This study examines how demographic (sex, age, education) and individual factors shape turn-taking using a large dataset of US English conversations (Fisher). We analyze Transition Floor Offset (TFO) and find notable interspeaker variation. Sex and age have small but significant effects female speakers and older individuals exhibit slightly shorter offsets - while education shows no effect. Lighter topics correlate with shorter TFOs. However, individual differences have a greater impact, driven by a strong idiosyncratic and an even stronger "dyadosyncratic" component - speakers in a dyad resemble each other more than they resemble themselves in different dyads. This suggests that the dyadic relationship and joint activity are the strongest determinants of TFO, outweighing demographic influences.
... Yet, more empirical work is needed to explore the nuances of group interaction, especially since social interactions frequently occur in group contexts [8]. These scenarios demand strategies that can accommodate multiple individuals simultaneously, particularly in areas such as turntaking [9], [10], gaze behavior [11], open-domain dialogue *The authors acknowledge the financial support by the Federal Ministry of Research, Technology and Space, Germany in the framework FH-Kooperativ 2-2019 (project number 13FH504KX9). 1 [12], and speaker diarization [13]. As a result, adaptive conversation design has gained increasing attention, driven by advances in generative AI that support more dynamic MPIs [13]- [15]. ...
Preprint
Full-text available
This paper investigates the impact of a group-adaptive conversation design in two socially interactive agents (SIAs) through two real-world studies. Both SIAs - Furhat, a social robot, and MetaHuman, a virtual agent - were equipped with a conversational artificial intelligence (CAI) backend combining hybrid retrieval and generative models. The studies were carried out in an in-the-wild setting with a total of N=188N = 188 participants who interacted with the SIAs - in dyads, triads or larger groups - at a German museum. Although the results did not reveal a significant effect of the group-sensitive conversation design on perceived satisfaction, the findings provide valuable insights into the challenges of adapting CAI for multi-party interactions and across different embodiments (robot vs.\ virtual agent), highlighting the need for multimodal strategies beyond linguistic pluralization. These insights contribute to the fields of Human-Agent Interaction (HAI), Human-Robot Interaction (HRI), and broader Human-Machine Interaction (HMI), providing insights for future research on effective dialogue adaptation in group settings.
... In addition, the participants engaged in back-and-forth conversations with the virtual AI agent for 5 min. This means that interpersonal interaction and linguistics-related factors, such as utterance sequences (Solomon et al., 2021) and turn-taking patterns (Levinson, 2016;Sacks et al., 1978;Skantze, 2021), could have moderated the effect of gender matching on outcomes. ...
... Active speaker detection (ASD) [4,27,23,18,15,26,11] aims to identify whether a visible person in a video is speaking. This task plays a critical role in various downstream applications, including speaker diarization [16,24], audiovisual speech recognition [25,2,17], and human-robot interaction [12,21,22]. To support the development of ASD models, several benchmark datasets have been proposed [20,13,4], most notably the AVA-ActiveSpeaker dataset [20], which is constructed entirely from movie content. ...
Preprint
Full-text available
We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underrepresented languages, noisy backgrounds, and crowded scenes - such as multiple visible speakers speaking concurrently or in overlapping turns. It contains over 44.5 hours of video with frame-level active speaker annotations across 48,693 speaking identities, and spans a broad range of video types that reflect real-world conditions. Through rigorous evaluation, we show that state-of-the-art models, while achieving nearly perfect scores on AVA, fail to reach saturation on UniTalk, suggesting that the ASD task remains far from solved under realistic conditions. Nevertheless, models trained on UniTalk demonstrate stronger generalization to modern "in-the-wild" datasets like Talkies and ASW, as well as to AVA. UniTalk thus establishes a new benchmark for active speaker detection, providing researchers with a valuable resource for developing and evaluating versatile and resilient models. Dataset: https://huggingface.co/datasets/plnguyen2908/UniTalk-ASD Code: https://github.com/plnguyen2908/UniTalk-ASD-code
Article
Full-text available
This article develops the concept of joint journeys as a metaphor to analyze how smart speakers become embedded in everyday domestic life and to trace the reciprocal, linguistically-mediated processes of domestication. While the domestication framework is well established in media studies, AI-based, networked technologies like smart speakers challenge its underlying assumptions by connecting private households to global infrastructures, thereby blurring boundaries between the public and the private. Drawing on video and audio recordings from German households, the article explores how conversational linguistic practices contribute to the domestication of smart speakers. Using methods from ethnomethodological conversation analysis and interactional linguistics, the study traces how smart speakers become integrated into everyday life, not just materially and functionally but also discursively, through practices relating to placement decisions, adaptation to sequential structures, personalization features, and reactions to malfunction. The article shows that mutual accommodation takes place: while users adapt their language to interface constraints, devices also get ‘personalized’ towards their users. The metaphor of joint journeys emphasizes that the co-evolution of users and devices is an ongoing, non-linear expedition shaped by language, socio-material environments, and infrastructural logics. These observations make it clear that it is through practices and language that AI technologies become integrated into everyday culture, which also raises questions about the broader datafied ecosystems to which interactions with them contribute.
Chapter
This book describes research in all aspects of the design, implementation, and evaluation of embodied conversational agents as well as details of specific working systems. Embodied conversational agents are computer-generated cartoonlike characters that demonstrate many of the same properties as humans in face-to-face conversation, including the ability to produce and respond to verbal and nonverbal communication. They constitute a type of (a) multimodal interface where the modalities are those natural to human conversation: speech, facial displays, hand gestures, and body stance; (b) software agent, insofar as they represent the computer in an interaction with a human or represent their human users in a computational environment (as avatars, for example); and (c) dialogue system where both verbal and nonverbal devices advance and regulate the dialogue between the user and the computer. With an embodied conversational agent, the visual dimension of interacting with an animated character on a screen plays an intrinsic role. Not just pretty pictures, the graphics display visual features of conversation in the same way that the face and hands do in face-to-face conversation among humans. This book describes research in all aspects of the design, implementation, and evaluation of embodied conversational agents as well as details of specific working systems. Many of the chapters are written by multidisciplinary teams of psychologists, linguists, computer scientists, artists, and researchers in interface design. The authors include Elisabeth Andre, Norm Badler, Gene Ball, Justine Cassell, Elizabeth Churchill, James Lester, Dominic Massaro, Cliff Nass, Sharon Oviatt, Isabella Poggi, Jeff Rickel, and Greg Sanders.
Conference Paper
Turn-taking in human-robot interaction is a crucial part of spoken dialogue systems, but current models do not allow for human-like turn-taking speed seen in natural conversation. In this work we propose combining two independent prediction models. A continuous model predicts the upcoming end of the turn in order to generate gaze aversion and fillers as turn-taking cues. This prediction is done while the user is speaking, so turn-taking can be done with little silence between turns, or even overlap. Once a speech recognition result has been received at a later time, a second model uses the lexical information to decide if or when the turn should actually be taken. We constructed the continuous model using the speaker’s prosodic features as inputs and evaluated its online performance. We then conducted a subjective experiment in which we implemented our model in an android robot and asked participants to compare it to one without turn-taking cues, which produces a response when a speech recognition result is received. We found that using both gaze aversion and a filler was preferred when the continuous model correctly predicted the upcoming end of turn, while using only gaze aversion was better if the prediction was wrong.