Fig 9 - available via license: Creative Commons Attribution 4.0 International
Content may be subject to copyright.
Source publication
The taking of turns is a fundamental aspect of dialogue. Since it is difficult to speak and listen at the same time, the participants need to coordinate who is currently speaking and when the next person can start to speak. Humans are very good at this coordination, and typically achieve fluent turn-taking with very small gaps and little overlap. C...
Contexts in source publication
Context 1
... of turn-holding and turn-yielding cues, which is relevant for any type of spoken interaction, multi-party interaction also involves the identification of the addressee of utterances. This means that the system has to both detect whom a user might be addressing, and display proper behaviours when addressing a specific user, as illustrated in Fig. 9. We will discuss both these issues in this section, as well as turn-taking in interactions that involve physical manipulation of objects. ...
Context 2
... and two humans) according to how appropriate it was for the robot to take the turn. Instead of a binary notion, they used a scale ranging from "not appropriate" to "obliged". Cases where it is not appropriate to take the turn could either be because the current speaker did not yield the turn, or because the turn was yielded to the other human (Fig. 9c). The robot could also be "obliged" to take the turn, for example if a user looked at the robot and asked a direct question (Fig. 9b). In between these, there were cases where it is possible to take the turn "if needed", and cases where it is appropriate to take the turn, but not obligatory. These were often cases where the user was ...
Context 3
... ranging from "not appropriate" to "obliged". Cases where it is not appropriate to take the turn could either be because the current speaker did not yield the turn, or because the turn was yielded to the other human (Fig. 9c). The robot could also be "obliged" to take the turn, for example if a user looked at the robot and asked a direct question (Fig. 9b). In between these, there were cases where it is possible to take the turn "if needed", and cases where it is appropriate to take the turn, but not obligatory. These were often cases where the user was attending both the robot and the other user, or objects on the table (Fig. 9a). Head pose (as a proxy for gaze) was a fairly informative ...
Context 4
... example if a user looked at the robot and asked a direct question (Fig. 9b). In between these, there were cases where it is possible to take the turn "if needed", and cases where it is appropriate to take the turn, but not obligatory. These were often cases where the user was attending both the robot and the other user, or objects on the table (Fig. 9a). Head pose (as a proxy for gaze) was a fairly informative feature, which might not be surprising, since gaze can both serve the role as a turn-yielding signal and as a device to select the next speaker. By adding more features, such as prosody and verbal features, the performance was improved further. ...
Context 5
... shape the interaction in a more open three-party conversation. The results showed that the speaking time for most pairs of speakers was fairly imbalanced (with one participant speaking almost twice as much as the other). However, in line with other studies, the robot was able to reduce the overall imbalance by addressing the less dominant speaker (Fig. 9f). This effect was stronger when mutual gaze was established between the robot and the addressee. When the floor was open for self-selection (Fig. 9d-e), the imbalance instead ...
Context 6
... (with one participant speaking almost twice as much as the other). However, in line with other studies, the robot was able to reduce the overall imbalance by addressing the less dominant speaker (Fig. 9f). This effect was stronger when mutual gaze was established between the robot and the addressee. When the floor was open for self-selection (Fig. 9d-e), the imbalance instead ...
Similar publications
Robots increasingly act as our social counterparts in domains such as healthcare and retail. For these human-robot interactions (HRI) to be effective, a question arises on whether we trust robots the same way we trust humans. We investigated whether the determinants competence and warmth, known to influence interpersonal trust development, influenc...
Citations
... Yet, more empirical work is needed to explore the nuances of group interaction, especially since social interactions frequently occur in group contexts [8]. These scenarios demand strategies that can accommodate multiple individuals simultaneously, particularly in areas such as turntaking [9], [10], gaze behavior [11], open-domain dialogue *The authors acknowledge the financial support by the Federal Ministry of Research, Technology and Space, Germany in the framework FH-Kooperativ 2-2019 (project number 13FH504KX9). 1 [12], and speaker diarization [13]. As a result, adaptive conversation design has gained increasing attention, driven by advances in generative AI that support more dynamic MPIs [13]- [15]. ...
Accepted as a regular paper at the 2025 IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). © IEEE. This is the preprint version. The final version will appear in the IEEE proceedings.
... Across this body of work, several technical and design challenges remain unresolved, including high response latency [36,37], hallucinated content, poor alignment with user expectations [38], and limited support for multilingual and dialectal variation [39]. Furthermore, very few systems have been tested with both younger and older adult users in real-world settings. ...
... Effective turn-taking remains one of the most persistent challenges in spoken human-robot interaction [36]. In our evaluation, fixed timing strategies-such as a static silence threshold of 1.2 s-proved inadequate across both age groups. ...
Large Language Models (LLMs), particularly those enhanced through Reinforcement Learning from Human Feedback, such as ChatGPT, have opened up new possibilities for natural and open-ended spoken interaction in social robotics. However, these models are not inherently designed for embodied, multimodal contexts. This paper presents a user-centred approach to integrating an LLM into a humanoid robot, designed to engage in fluid, context-aware conversation with socially isolated older adults. We describe our system architecture, which combines real-time speech processing, layered memory summarisation, persona conditioning, and multilingual voice adaptation to support personalised, socially appropriate interactions. Through iterative development and evaluation, including in-home exploratory trials with older adults (n = 7) and a preliminary study with young adults (n = 43), we investigated the technical and experiential challenges of deploying LLMs in real-world human–robot dialogue. Our findings show that memory continuity, adaptive turn-taking, and culturally attuned voice design enhance user perceptions of trust, naturalness, and social presence. We also identify persistent limitations related to response latency, hallucinations, and expectation management. This work contributes design insights and architectural strategies for future LLM-integrated robots that aim to support meaningful, emotionally resonant companionship in socially assistive settings.
... In addition, the participants engaged in back-and-forth conversations with the virtual AI agent for 5 min. This means that interpersonal interaction and linguistics-related factors, such as utterance sequences (Solomon et al., 2021) and turn-taking patterns (Levinson, 2016;Sacks et al., 1978;Skantze, 2021), could have moderated the effect of gender matching on outcomes. ...
... Theoretically, it clarifies how universal turn-taking principles unfold within a diverse landscape of demographic and individual factors. Practically, these insights are important for the development of turn-taking models in conversational systems [23]. While there have been a lot of efforts recently in developing predictive models of turn-taking based on human-human conversational data [24,25,26], the question remains how universal these models are and to what extent they should be conditioned on individual or demographic factors. ...
... For the development of predictive models of turn-taking in conversational systems [23], these results indicate that such models should take other factors into account than just the immediately preceding context. By conditioning the models on more long-term idiosyncratic and dyad-specific patterns, the model predictions are likely to improve. ...
Turn-taking in dialogue follows universal constraints but also varies significantly. This study examines how demographic (sex, age, education) and individual factors shape turn-taking using a large dataset of US English conversations (Fisher). We analyze Transition Floor Offset (TFO) and find notable interspeaker variation. Sex and age have small but significant effects female speakers and older individuals exhibit slightly shorter offsets - while education shows no effect. Lighter topics correlate with shorter TFOs. However, individual differences have a greater impact, driven by a strong idiosyncratic and an even stronger "dyadosyncratic" component - speakers in a dyad resemble each other more than they resemble themselves in different dyads. This suggests that the dyadic relationship and joint activity are the strongest determinants of TFO, outweighing demographic influences.
... Although effortless for humans, turn-taking is highly challenging for robots. Most turn-taking algorithms in use today are based on a robot reacting to silence after a turn [4], resulting in interactions that are less fluid than those between humans [5,6]. Predictive turntaking models (PTTMs) have been proposed to overcome these limitations [7,4]. ...
... Most turn-taking algorithms in use today are based on a robot reacting to silence after a turn [4], resulting in interactions that are less fluid than those between humans [5,6]. Predictive turntaking models (PTTMs) have been proposed to overcome these limitations [7,4]. PTTMs learn human-like turn-taking from large corpora of human interaction, e.g. ...
... Most PTTMs use speech features [4]. Yet in human-human interaction, listeners make faster and more accurate turn-taking decisions when they can see and hear a speaker [10]. ...
Accurate predictive turn-taking models (PTTMs) are essential for naturalistic human-robot interaction. However, little is known about their performance in noise. This study therefore explores PTTM performance in types of noise likely to be encountered once deployed. Our analyses reveal PTTMs are highly sensitive to noise. Hold/shift accuracy drops from 84% in clean speech to just 52% in 10 dB music noise. Training with noisy data enables a multimodal PTTM, which includes visual features to better exploit visual cues, with 72% accuracy in 10 dB music noise. The multimodal PTTM outperforms the audio-only PTTM across all noise types and SNRs, highlighting its ability to exploit visual cues; however, this does not always generalise to new types of noise. Analysis also reveals that successful training relies on accurate transcription, limiting the use of ASR-derived transcriptions to clean conditions. We make code publicly available for future research.
... Active speaker detection (ASD) [4,27,23,18,15,26,11] aims to identify whether a visible person in a video is speaking. This task plays a critical role in various downstream applications, including speaker diarization [16,24], audiovisual speech recognition [25,2,17], and human-robot interaction [12,21,22]. To support the development of ASD models, several benchmark datasets have been proposed [20,13,4], most notably the AVA-ActiveSpeaker dataset [20], which is constructed entirely from movie content. ...
We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underrepresented languages, noisy backgrounds, and crowded scenes - such as multiple visible speakers speaking concurrently or in overlapping turns. It contains over 44.5 hours of video with frame-level active speaker annotations across 48,693 speaking identities, and spans a broad range of video types that reflect real-world conditions. Through rigorous evaluation, we show that state-of-the-art models, while achieving nearly perfect scores on AVA, fail to reach saturation on UniTalk, suggesting that the ASD task remains far from solved under realistic conditions. Nevertheless, models trained on UniTalk demonstrate stronger generalization to modern "in-the-wild" datasets like Talkies and ASW, as well as to AVA. UniTalk thus establishes a new benchmark for active speaker detection, providing researchers with a valuable resource for developing and evaluating versatile and resilient models. Dataset: https://huggingface.co/datasets/plnguyen2908/UniTalk-ASD Code: https://github.com/plnguyen2908/UniTalk-ASD-code
... Turn-taking is therefore predictive: listeners plan their next turn while the speaker is still speaking (Levinson and Torreira, 2015;Garrod and Pickering, 2015). Multimodal cues including syntax, prosody and gaze support this process (Holler et al., 2016), enabling speakers to hold the floor or to shift to another speaker (Skantze, 2021). ...
... This results in dialogue that is less spontaneous than human interaction (Li et al., 2022;Woodruff and Aoki, 2003). Predictive turn-taking models (PTTMs) aim to overcome these issues (Skantze, 2021). Inspired by human turn-taking, PTTMs are neural networks trained to continually make turn-taking predictions, e.g. the probability of an upcoming shift (Skantze, 2017). ...
... Inspired by human turn-taking, PTTMs are neural networks trained to continually make turn-taking predictions, e.g. the probability of an upcoming shift (Skantze, 2017). Most work on PTTMs has been conducted using corpora of twoparty human interaction (Skantze, 2021). ...
Turn-taking is richly multimodal. Predictive turn-taking models (PTTMs) facilitate naturalistic human-robot interaction, yet most rely solely on speech. We introduce MM-VAP, a multimodal PTTM which combines speech with visual cues including facial expression, head pose and gaze. We find that it outperforms the state-of-the-art audio-only in videoconferencing interactions (84% vs. 79% hold/shift prediction accuracy). Unlike prior work which aggregates all holds and shifts, we group by duration of silence between turns. This reveals that through the inclusion of visual features, MM-VAP outperforms a state-of-the-art audio-only turn-taking model across all durations of speaker transitions. We conduct a detailed ablation study, which reveals that facial expression features contribute the most to model performance. Thus, our working hypothesis is that when interlocutors can see one another, visual cues are vital for turn-taking and must therefore be included for accurate turn-taking prediction. We additionally validate the suitability of automatic speech alignment for PTTM training using telephone speech. This work represents the first comprehensive analysis of multimodal PTTMs. We discuss implications for future work and make all code publicly available.
... We define turn-taking events based on prior studies [24,25,26,27]. These events encompass turn-taking behaviors, backchannels, and interjections. ...
Despite significant progress in neural spoken dialog systems, personality-aware conversation agents -- capable of adapting behavior based on personalities -- remain underexplored due to the absence of personality annotations in speech datasets. We propose a pipeline that preprocesses raw audio recordings to create a dialogue dataset annotated with timestamps, response types, and emotion/sentiment labels. We employ an automatic speech recognition (ASR) system to extract transcripts and timestamps, then generate conversation-level annotations. Leveraging these annotations, we design a system that employs large language models to predict conversational personality. Human evaluators were engaged to identify conversational characteristics and assign personality labels. Our analysis demonstrates that the proposed system achieves stronger alignment with human judgments compared to existing approaches.
... Unlike text-based communication, which is often static, asynchronous, and turn-based, voice naturally enables rich, dynamic, and human-like interactions. For example, we speak to draw attention and initiate dialogue (even when the other person is not looking), interrupt or overlap speech to signal urgency or redirect conversational flow, and use simple backchanneling cues like 'mmh' or 'yeah' to convey attentiveness and engagement when others speak (Skantze, 2021;Yang et al., 2022). In addition, voice carries rich vocal cues (such as tone, inflection, and rhythm) and subtle emotional nuances that other modalities cannot replicate (Bora, 2024;Schroeder and Epley, 2016). ...
A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real-time, and emotionally expressive manner. Rather than merely reacting to commands, it would continuously listen, reason, and respond proactively, fostering fluid, dynamic, and emotionally resonant interactions. We introduce Voila, a family of large voice-language foundation models that make a step towards this vision. Voila moves beyond traditional pipeline systems by adopting a new end-to-end architecture that enables full-duplex, low-latency conversations while preserving rich vocal nuances such as tone, rhythm, and emotion. It achieves a response latency of just 195 milliseconds, surpassing the average human response time. Its hierarchical multi-scale Transformer integrates the reasoning capabilities of large language models (LLMs) with powerful acoustic modeling, enabling natural, persona-aware voice generation -- where users can simply write text instructions to define the speaker's identity, tone, and other characteristics. Moreover, Voila supports over one million pre-built voices and efficient customization of new ones from brief audio samples as short as 10 seconds. Beyond spoken dialogue, Voila is designed as a unified model for a wide range of voice-based applications, including automatic speech recognition (ASR), Text-to-Speech (TTS), and, with minimal adaptation, multilingual speech translation. Voila is fully open-sourced to support open research and accelerate progress toward next-generation human-machine interactions.
... Based on the Ethnography of Communication, a communicative sequence, as defined within the act sequence component, describes the ordering of communicative acts within the same event [6]. In our study, based on the interactive patterns of turn-taking and idea-building in group discussions [26], we identified several pairs of communicative sequence, such as ask/answer, give/agree, give/disagree, and give/build on. When those communicative sequences have different events, we prompt the LLM to refer to the overall context and modify the code accordingly. ...
Dialogue data has been a key source for understanding learning processes, offering critical insights into how students engage in collaborative discussions and how these interactions shape their knowledge construction. The advent of Large Language Models (LLMs) has introduced promising opportunities for advancing qualitative research, particularly in the automated coding of dialogue data. However, the inherent contextual complexity of dialogue presents unique challenges for these models, especially in understanding and interpreting complex contextual information. This study addresses these challenges by developing a novel LLM-assisted automated coding approach for dialogue data. The novelty of our proposed framework is threefold: 1) We predict the code for an utterance based on dialogue-specific characteristics -- communicative acts and communicative events -- using separate prompts following the role prompts and chain-of-thoughts methods; 2) We engaged multiple LLMs including GPT-4-turbo, GPT-4o, DeepSeek in collaborative code prediction; 3) We leveraged the interrelation between events and acts to implement consistency checking using GPT-4o. In particular, our contextual consistency checking provided a substantial accuracy improvement. We also found the accuracy of act predictions was consistently higher than that of event predictions. This study contributes a new methodological framework for enhancing the precision of automated coding of dialogue data as well as offers a scalable solution for addressing the contextual challenges inherent in dialogue analysis.