Turn-taking latency in the Switchboard corpus (Godfrey et al., 1995), as calculated and visualised by Levinson and Torreira (2015). Negative latencies represent overlaps and positive latencies gaps.

Turn-taking latency in the Switchboard corpus (Godfrey et al., 1995), as calculated and visualised by Levinson and Torreira (2015). Negative latencies represent overlaps and positive latencies gaps.

Source publication
Article
Full-text available
The taking of turns is a fundamental aspect of dialogue. Since it is difficult to speak and listen at the same time, the participants need to coordinate who is currently speaking and when the next person can start to speak. Humans are very good at this coordination, and typically achieve fluent turn-taking with very small gaps and little overlap. C...

Contexts in source publication

Context 1
... operationalising turn-taking using IPUs, it possible to analyse turn-taking patterns and statistics in larger corpora with automatic methods (e.g. Brady 1968;Ten Bosch et al. 2005;Heldner and Edlund 2010;Levinson and Torreira 2015). One example of this is the histogram of turn-taking latency shown in Fig. 2. As can be seen, even if gaps and overlaps are common, humans are (Godfrey et al., 1995), as calculated and visualised by Levinson and Torreira (2015). Negative latencies represent overlaps and positive latencies ...
Context 2
... the turn-taking cues outlined above provide some explanation as to how the listener can differentiate pauses from gaps, they cannot provide a full account for how turn-taking is coordinated. If the listener only focused on cues at the end of the turn, it would not really be plausible to find a typical response time of 200ms (as seen Fig. 2). This would not give the listener enough time to react to the cue, prepare a response and start speaking. According to psycholinguistic estimates, the response time would rather be around 600-1500ms ( Levinson and Torreira, 2015). This has led many researchers to conclude that there must be some sort of prediction mechanism involved ( ...
Context 3
... to conclude that there must be some sort of prediction mechanism involved ( Sacks et al., 1974;Levinson and Torreira, 2015;Garrod and Pickering, 2015;Ward, 2019). This is even more evident when considering the fairly large proportion of turn-shifts that occur without any gap at all, i.e., before the IPU has even been completed (as seen in Fig. 2). An example of such a turn-shift was shown in Fig. 1. Well before the first question is complete (before the final word "volcano" is spoken), speaker B must predict that the turn is about to end, what kind of dialogue act is being produced (a question), as well as the final word in the question, in order to prepare a meaningful ...

Similar publications

Article
Full-text available
Robots increasingly act as our social counterparts in domains such as healthcare and retail. For these human-robot interactions (HRI) to be effective, a question arises on whether we trust robots the same way we trust humans. We investigated whether the determinants competence and warmth, known to influence interpersonal trust development, influenc...

Citations

... Across this body of work, several technical and design challenges remain unresolved, including high response latency [36,37], hallucinated content, poor alignment with user expectations [38], and limited support for multilingual and dialectal variation [39]. Furthermore, very few systems have been tested with both younger and older adult users in real-world settings. ...
... Effective turn-taking remains one of the most persistent challenges in spoken human-robot interaction [36]. In our evaluation, fixed timing strategies-such as a static silence threshold of 1.2 s-proved inadequate across both age groups. ...
Article
Full-text available
Large Language Models (LLMs), particularly those enhanced through Reinforcement Learning from Human Feedback, such as ChatGPT, have opened up new possibilities for natural and open-ended spoken interaction in social robotics. However, these models are not inherently designed for embodied, multimodal contexts. This paper presents a user-centred approach to integrating an LLM into a humanoid robot, designed to engage in fluid, context-aware conversation with socially isolated older adults. We describe our system architecture, which combines real-time speech processing, layered memory summarisation, persona conditioning, and multilingual voice adaptation to support personalised, socially appropriate interactions. Through iterative development and evaluation, including in-home exploratory trials with older adults (n = 7) and a preliminary study with young adults (n = 43), we investigated the technical and experiential challenges of deploying LLMs in real-world human–robot dialogue. Our findings show that memory continuity, adaptive turn-taking, and culturally attuned voice design enhance user perceptions of trust, naturalness, and social presence. We also identify persistent limitations related to response latency, hallucinations, and expectation management. This work contributes design insights and architectural strategies for future LLM-integrated robots that aim to support meaningful, emotionally resonant companionship in socially assistive settings.
... In addition, the participants engaged in back-and-forth conversations with the virtual AI agent for 5 min. This means that interpersonal interaction and linguistics-related factors, such as utterance sequences (Solomon et al., 2021) and turn-taking patterns (Levinson, 2016;Sacks et al., 1978;Skantze, 2021), could have moderated the effect of gender matching on outcomes. ...
... Theoretically, it clarifies how universal turn-taking principles unfold within a diverse landscape of demographic and individual factors. Practically, these insights are important for the development of turn-taking models in conversational systems [23]. While there have been a lot of efforts recently in developing predictive models of turn-taking based on human-human conversational data [24,25,26], the question remains how universal these models are and to what extent they should be conditioned on individual or demographic factors. ...
... For the development of predictive models of turn-taking in conversational systems [23], these results indicate that such models should take other factors into account than just the immediately preceding context. By conditioning the models on more long-term idiosyncratic and dyad-specific patterns, the model predictions are likely to improve. ...
Preprint
Full-text available
Turn-taking in dialogue follows universal constraints but also varies significantly. This study examines how demographic (sex, age, education) and individual factors shape turn-taking using a large dataset of US English conversations (Fisher). We analyze Transition Floor Offset (TFO) and find notable interspeaker variation. Sex and age have small but significant effects female speakers and older individuals exhibit slightly shorter offsets - while education shows no effect. Lighter topics correlate with shorter TFOs. However, individual differences have a greater impact, driven by a strong idiosyncratic and an even stronger "dyadosyncratic" component - speakers in a dyad resemble each other more than they resemble themselves in different dyads. This suggests that the dyadic relationship and joint activity are the strongest determinants of TFO, outweighing demographic influences.
... Although effortless for humans, turn-taking is highly challenging for robots. Most turn-taking algorithms in use today are based on a robot reacting to silence after a turn [4], resulting in interactions that are less fluid than those between humans [5,6]. Predictive turntaking models (PTTMs) have been proposed to overcome these limitations [7,4]. ...
... Most turn-taking algorithms in use today are based on a robot reacting to silence after a turn [4], resulting in interactions that are less fluid than those between humans [5,6]. Predictive turntaking models (PTTMs) have been proposed to overcome these limitations [7,4]. PTTMs learn human-like turn-taking from large corpora of human interaction, e.g. ...
... Most PTTMs use speech features [4]. Yet in human-human interaction, listeners make faster and more accurate turn-taking decisions when they can see and hear a speaker [10]. ...
Preprint
Full-text available
Accurate predictive turn-taking models (PTTMs) are essential for naturalistic human-robot interaction. However, little is known about their performance in noise. This study therefore explores PTTM performance in types of noise likely to be encountered once deployed. Our analyses reveal PTTMs are highly sensitive to noise. Hold/shift accuracy drops from 84% in clean speech to just 52% in 10 dB music noise. Training with noisy data enables a multimodal PTTM, which includes visual features to better exploit visual cues, with 72% accuracy in 10 dB music noise. The multimodal PTTM outperforms the audio-only PTTM across all noise types and SNRs, highlighting its ability to exploit visual cues; however, this does not always generalise to new types of noise. Analysis also reveals that successful training relies on accurate transcription, limiting the use of ASR-derived transcriptions to clean conditions. We make code publicly available for future research.
... Active speaker detection (ASD) [4,27,23,18,15,26,11] aims to identify whether a visible person in a video is speaking. This task plays a critical role in various downstream applications, including speaker diarization [16,24], audiovisual speech recognition [25,2,17], and human-robot interaction [12,21,22]. To support the development of ASD models, several benchmark datasets have been proposed [20,13,4], most notably the AVA-ActiveSpeaker dataset [20], which is constructed entirely from movie content. ...
Preprint
Full-text available
We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underrepresented languages, noisy backgrounds, and crowded scenes - such as multiple visible speakers speaking concurrently or in overlapping turns. It contains over 44.5 hours of video with frame-level active speaker annotations across 48,693 speaking identities, and spans a broad range of video types that reflect real-world conditions. Through rigorous evaluation, we show that state-of-the-art models, while achieving nearly perfect scores on AVA, fail to reach saturation on UniTalk, suggesting that the ASD task remains far from solved under realistic conditions. Nevertheless, models trained on UniTalk demonstrate stronger generalization to modern "in-the-wild" datasets like Talkies and ASW, as well as to AVA. UniTalk thus establishes a new benchmark for active speaker detection, providing researchers with a valuable resource for developing and evaluating versatile and resilient models. Dataset: https://huggingface.co/datasets/plnguyen2908/UniTalk-ASD Code: https://github.com/plnguyen2908/UniTalk-ASD-code
... Turn-taking is therefore predictive: listeners plan their next turn while the speaker is still speaking (Levinson and Torreira, 2015;Garrod and Pickering, 2015). Multimodal cues including syntax, prosody and gaze support this process (Holler et al., 2016), enabling speakers to hold the floor or to shift to another speaker (Skantze, 2021). ...
... This results in dialogue that is less spontaneous than human interaction (Li et al., 2022;Woodruff and Aoki, 2003). Predictive turn-taking models (PTTMs) aim to overcome these issues (Skantze, 2021). Inspired by human turn-taking, PTTMs are neural networks trained to continually make turn-taking predictions, e.g. the probability of an upcoming shift (Skantze, 2017). ...
... Inspired by human turn-taking, PTTMs are neural networks trained to continually make turn-taking predictions, e.g. the probability of an upcoming shift (Skantze, 2017). Most work on PTTMs has been conducted using corpora of twoparty human interaction (Skantze, 2021). ...
Preprint
Full-text available
Turn-taking is richly multimodal. Predictive turn-taking models (PTTMs) facilitate naturalistic human-robot interaction, yet most rely solely on speech. We introduce MM-VAP, a multimodal PTTM which combines speech with visual cues including facial expression, head pose and gaze. We find that it outperforms the state-of-the-art audio-only in videoconferencing interactions (84% vs. 79% hold/shift prediction accuracy). Unlike prior work which aggregates all holds and shifts, we group by duration of silence between turns. This reveals that through the inclusion of visual features, MM-VAP outperforms a state-of-the-art audio-only turn-taking model across all durations of speaker transitions. We conduct a detailed ablation study, which reveals that facial expression features contribute the most to model performance. Thus, our working hypothesis is that when interlocutors can see one another, visual cues are vital for turn-taking and must therefore be included for accurate turn-taking prediction. We additionally validate the suitability of automatic speech alignment for PTTM training using telephone speech. This work represents the first comprehensive analysis of multimodal PTTMs. We discuss implications for future work and make all code publicly available.
... We define turn-taking events based on prior studies [24,25,26,27]. These events encompass turn-taking behaviors, backchannels, and interjections. ...
Preprint
Full-text available
Despite significant progress in neural spoken dialog systems, personality-aware conversation agents -- capable of adapting behavior based on personalities -- remain underexplored due to the absence of personality annotations in speech datasets. We propose a pipeline that preprocesses raw audio recordings to create a dialogue dataset annotated with timestamps, response types, and emotion/sentiment labels. We employ an automatic speech recognition (ASR) system to extract transcripts and timestamps, then generate conversation-level annotations. Leveraging these annotations, we design a system that employs large language models to predict conversational personality. Human evaluators were engaged to identify conversational characteristics and assign personality labels. Our analysis demonstrates that the proposed system achieves stronger alignment with human judgments compared to existing approaches.
... Unlike text-based communication, which is often static, asynchronous, and turn-based, voice naturally enables rich, dynamic, and human-like interactions. For example, we speak to draw attention and initiate dialogue (even when the other person is not looking), interrupt or overlap speech to signal urgency or redirect conversational flow, and use simple backchanneling cues like 'mmh' or 'yeah' to convey attentiveness and engagement when others speak (Skantze, 2021;Yang et al., 2022). In addition, voice carries rich vocal cues (such as tone, inflection, and rhythm) and subtle emotional nuances that other modalities cannot replicate (Bora, 2024;Schroeder and Epley, 2016). ...
Preprint
Full-text available
A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real-time, and emotionally expressive manner. Rather than merely reacting to commands, it would continuously listen, reason, and respond proactively, fostering fluid, dynamic, and emotionally resonant interactions. We introduce Voila, a family of large voice-language foundation models that make a step towards this vision. Voila moves beyond traditional pipeline systems by adopting a new end-to-end architecture that enables full-duplex, low-latency conversations while preserving rich vocal nuances such as tone, rhythm, and emotion. It achieves a response latency of just 195 milliseconds, surpassing the average human response time. Its hierarchical multi-scale Transformer integrates the reasoning capabilities of large language models (LLMs) with powerful acoustic modeling, enabling natural, persona-aware voice generation -- where users can simply write text instructions to define the speaker's identity, tone, and other characteristics. Moreover, Voila supports over one million pre-built voices and efficient customization of new ones from brief audio samples as short as 10 seconds. Beyond spoken dialogue, Voila is designed as a unified model for a wide range of voice-based applications, including automatic speech recognition (ASR), Text-to-Speech (TTS), and, with minimal adaptation, multilingual speech translation. Voila is fully open-sourced to support open research and accelerate progress toward next-generation human-machine interactions.
... Based on the Ethnography of Communication, a communicative sequence, as defined within the act sequence component, describes the ordering of communicative acts within the same event [6]. In our study, based on the interactive patterns of turn-taking and idea-building in group discussions [26], we identified several pairs of communicative sequence, such as ask/answer, give/agree, give/disagree, and give/build on. When those communicative sequences have different events, we prompt the LLM to refer to the overall context and modify the code accordingly. ...
Preprint
Full-text available
Dialogue data has been a key source for understanding learning processes, offering critical insights into how students engage in collaborative discussions and how these interactions shape their knowledge construction. The advent of Large Language Models (LLMs) has introduced promising opportunities for advancing qualitative research, particularly in the automated coding of dialogue data. However, the inherent contextual complexity of dialogue presents unique challenges for these models, especially in understanding and interpreting complex contextual information. This study addresses these challenges by developing a novel LLM-assisted automated coding approach for dialogue data. The novelty of our proposed framework is threefold: 1) We predict the code for an utterance based on dialogue-specific characteristics -- communicative acts and communicative events -- using separate prompts following the role prompts and chain-of-thoughts methods; 2) We engaged multiple LLMs including GPT-4-turbo, GPT-4o, DeepSeek in collaborative code prediction; 3) We leveraged the interrelation between events and acts to implement consistency checking using GPT-4o. In particular, our contextual consistency checking provided a substantial accuracy improvement. We also found the accuracy of act predictions was consistently higher than that of event predictions. This study contributes a new methodological framework for enhancing the precision of automated coding of dialogue data as well as offers a scalable solution for addressing the contextual challenges inherent in dialogue analysis.
... In the last two decades there has been an increasing interest in investigating speech phenomena of spontaneous conversations, both in speech science and speech technology (cf., e.g., Arnold et al., 2017;Skantze, 2021;Lopez et al., 2022). In speech technology, the motivations to develop better models for spontaneous speech are, among others, to make automatic speech recognition more robust to variation, to make speech synthesis sound more natural and to make speaking robots more social. ...
Preprint
Full-text available
This paper has two goals. First, we present the turn-taking annotation layers created for 95 minutes of conversational speech of the Graz Corpus of Read and Spontaneous Speech (GRASS), available to the scientific community. Second, we describe the annotation system and the annotation process in more detail, so other researchers may use it for their own conversational data. The annotation system was developed with an interdisciplinary application in mind. It should be based on sequential criteria according to Conversation Analysis, suitable for subsequent phonetic analysis, thus time-aligned annotations were made Praat, and it should be suitable for automatic classification, which required the continuous annotation of speech and a label inventory that is not too large and results in a high inter-rater agreement. Turn-taking was annotated on two layers, Inter-Pausal Units (IPU) and points of potential completion (PCOMP; similar to transition relevance places). We provide a detailed description of the annotation process and of segmentation and labelling criteria. A detailed analysis of inter-rater agreement and common confusions shows that agreement for IPU annotation is near-perfect, that agreement for PCOMP annotations is substantial, and that disagreements often are either partial or can be explained by a different analysis of a sequence which also has merit. The annotation system can be applied to a variety of conversational data for linguistic studies and technological applications, and we hope that the annotations, as well as the annotation system will contribute to a stronger cross-fertilization between these disciplines.