Gabriel Skantze

Gabriel Skantze
KTH Royal Institute of Technology | KTH · Department of Speech, Music and Hearing (TMH)

Professor

About

113
Publications
14,945
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,807
Citations
Additional affiliations
August 2008 - present
KTH Royal Institute of Technology
Position
  • Professor (Associate)
January 2008 - July 2008
Universität Potsdam
Position
  • PostDoc Position

Publications

Publications (113)
Conference Paper
Full-text available
We present a general model and conceptual framework for specifying architectures for incremental processing in dialogue systems, in particular with respect to the topology of the network of modules that make up the system, the way information flows through this network, how information increments are 'packaged', and how these increments are process...
Conference Paper
Full-text available
This paper describes a fully incremental dia- logue system that can engage in dialogues in a simple domain, number dictation. Be- cause it uses incremental speech recognition and prosodic analysis, the system can give rapid feedback as the user is speaking, with a very short latency of around 200ms. Be- cause it uses incremental speech synthesis an...
Conference Paper
Full-text available
In this paper, we present a dialog system that was exhibited at the Swedish National Museum of Science and Technology. Two visitors at a time could play a collaborative card sorting game together with the robot head Furhat, where the three players discuss the solution together. The cards are shown on a touch table between the players, thus constitu...
Article
Full-text available
This paper presents a model of incremental speech generation in practical conversational systems. The model allows a conversational system to incrementally interpret spoken input, while simultaneously planning, realising and self-monitoring the system response. If these processes are time consuming and result in a response delay, the system can aut...
Article
Full-text available
In this study, an explorative experiment was conducted in which subjects were asked to give route directions to each other in a simulated campus (similar to Map Task). In order to elicit error handling strategies, a speech recogniser was used to corrupt the speech in one direction. This way, data could be collected on how the subjects might recover...
Preprint
The modeling of turn-taking in dialog can be viewed as the modeling of the dynamics of voice activity of the interlocutors. We extend prior work and define the predictive task of Voice Activity Projection, a general, self-supervised objective, as a way to train turn-taking models without the need of labeled data. We highlight a theoretical weakness...
Article
Full-text available
Intelligent agents interacting with humans through conversation (such as a robot, embodied conversational agent, or chatbot) need to receive feedback from the human to make sure that its communicative acts have the intended consequences. At the same time, the human interacting with the agent will also seek feedback, in order to ensure that her comm...
Article
Full-text available
Different applications or contexts may require different settings for a conversational AI system, as it is clear that e.g., a child-oriented system would need a different interaction style than a warning system used in emergency situations. The current article focuses on the extent to which a system's usability may benefit from variation in the per...
Article
Full-text available
Feedback is an essential part of all communication, and agents communicating with humans must be able to both give and receive feedback in order to ensure mutual understanding. In this paper, we analyse multimodal feedback given by humans towards a robot that is presenting a piece of art in a shared environment, similar to a museum setting. The dat...
Preprint
This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is grounded in vision. Given a pre-trained multimodal embedding model, where language and images are projected in the same semantic space (in this case CLIP by OpenAI), CoLLIE learns a transformation function that adjusts the language embeddings when ne...
Conference Paper
Full-text available
In this paper, we investigate the differences between human-directed speech and robot-directed speech during spontaneous human-human-robot interactions. The interactions under study are different from previous studies, in the sense that the robot has a more similar role as the human interlocutors, which leads to more spontaneous turn-taking. 20 con...
Article
The work presented here is a culmination of developments within the Swedish project COIN: Co-adaptive human-robot interactive systems, funded by the Swedish Foundation for Strategic Research (SSF), which addresses a unified framework for co-adaptive methodologies in human–robot co-existence. We investigate co-adaptation in the context of safe plann...
Poster
Full-text available
In everyday communication, the goal of speakers is to communicate their messages in an intelligible manner to their listeners. When they are aware of a speech perception difficulty on the part of the listener due to background noise, a hearing impairment, or a different native language, speakers will naturally and spontaneously modify their speech...
Article
Full-text available
The taking of turns is a fundamental aspect of dialogue. Since it is difficult to speak and listen at the same time, the participants need to coordinate who is currently speaking and when the next person can start to speak. Humans are very good at this coordination, and typically achieve fluent turn-taking with very small gaps and little overlap. C...
Preprint
Full-text available
Syntactic and pragmatic completeness is known to be important for turn-taking prediction, but so far machine learning models of turn-taking have used such linguistic information in a limited way. In this paper, we introduce TurnGPT, a transformer-based language model for predicting turn-shifts in spoken dialog. The model has been trained and evalua...
Conference Paper
In this paper, we address the problem of how an interactive agent (such as a robot) can present information to an audience and adapt the presentation according to the feedback it receives. We extend a previous behaviour tree-based model to generate the presentation from a knowledge graph (Wikidata), which allows the agent to handle feedback increme...
Conference Paper
Full-text available
In human-human interactions, the situational context plays a large role in the degree of speakers’ accommodation. In this paper, we investigate whether the degree of accommodation in a human-robot computer game is affected by (a) the duration of the interaction and (b) the success of the players in the game. 30 teams of two players played two card...
Conference Paper
In dialogue, speakers continuously adapt their speech to accommodate the listener, based on the feedback they receive. In this paper, we explore the modelling of such behaviours in the context of a robot presenting a painting. A Behaviour Tree is used to organise the behaviour on different levels, and allow the robot to adapt its behaviour in real-...
Preprint
Full-text available
Mixed reality, which seeks to better merge virtual objects and their interactions with the real environment, offers numerous potentials for the improved design of robots and our interactions with them. In this paper, we present our ongoing work towards the development of a mixed reality platform for designing social interactions with robots through...
Conference Paper
In this paper we present a crowdsourcing-based approach for collecting dialog data for a social chat dialog system, which gradually builds a dialog graph from actual user responses and crowd-sourced system answers, conditioned by a given persona and other instructions. This approach was tested during the second instalment of the Amazon Alexa Prize...
Conference Paper
Full-text available
The adoption of conversational agents is growing at a rapid pace. Agents however, are not optimised to simulate key social aspects of situated human conversational environments. Humans are intellectually biased towards social activity when facing more anthropomorphic agents or when presented with subtle social cues. In this work, we explore the eff...
Conference Paper
Full-text available
Mixed reality offers new potentials for social interaction experiences with virtual agents. In addition, it can be used to experiment with the design of physical robots. However, while previous studies have investigated comfortable social distances between humans and artificial agents in real and virtual environments, there is little data with rega...
Conference Paper
Humans use verbal and non-verbal cues to communicate their intent in collaborative tasks. In situated dialogue, speakers typically direct their interlocutor's attention to referent objects using multimodal cues, and references to such entities are resolved in a collaborative nature. In this study we designed a multiparty task where humans teach eac...
Preprint
Full-text available
In human conversational interactions, turn-taking exchanges can be coordinated using cues from multiple modalities. To design spoken dialog systems that can conduct fluid interactions it is desirable to incorporate cues from separate modalities into turn-taking models. We propose that there is an appropriate temporal granularity at which modalities...
Preprint
Full-text available
For spoken dialog systems to conduct fluid conversational interactions with users, the systems must be sensitive to turn-taking cues produced by a user. Models should be designed so that effective decisions can be made as to when it is appropriate, or not, for the system to speak. Traditional end-of-turn models, where decisions are made at utteranc...
Conference Paper
Full-text available
In this paper we present a corpus of multiparty situated interaction where participants collaborated on moving virtual objects on a large touch screen. A moderator facilitated the discussion and directed the interaction. The corpus contains recordings of a variety of multimodal data, in that we captured speech, eye gaze and gesture data using a mul...
Conference Paper
Full-text available
We present a demonstration system for psychotherapy training that uses the Furhat social robot platform to implement virtual patients. The system runs an educational program with various modules, starting with training of basic psychotherapeutic skills and then moves on to tasks where these skills need to be integrated. Such training relies heavily...
Conference Paper
Full-text available
In this demo, we will showcase a platform we are currently developing for experimenting with situated interaction using mixed reality. The user will wear a Microsoft HoloLens and be able to interact with a virtual character presenting a poster. We argue that a poster presentation scenario is a good test bed for studying phenomena such as multi-part...
Conference Paper
Full-text available
In this paper, we investigate participation equality, in terms of speaking time, between users in multi-party human-robot conversations. We analyse a dataset where pairs of users (540 in total) interact with a conversational robot exhibited at a technical museum. The data encompass a wide range of different users in terms of age (adults/children) a...
Article
Full-text available
When humans interact and collaborate with each other, they coordinate their turn-taking behaviors using verbal and nonverbal signals, expressed in the face and voice. If robots of the future are supposed to engage in social interaction with humans, it is essential that they can generate and understand these behaviors. In this article, I give an ove...
Conference Paper
Full-text available
In this paper we present a dialogue system and response model that allows a robot to act as an active listener, encouraging users to tell the robot about their travel memories. The response model makes a combined decision about when to respond and what type of response to give, in order to elicit more elaborate descriptions from the user and avoid...
Article
This special issue includes research articles which apply spoken language processing to robots that interact with human users through speech, possibly combined with other modalities. Robots that can listen to human speech, understand it, interact according to the conveyed meaning, and respond represent major research and technological challenges. T...
Conference Paper
Full-text available
This paper addresses the problem of automatic detection of repeated turns in Spoken Dialogue Systems. Repetitions can be a symptom of problematic communication between users and systems. Such repetitions are often due to speech recognition errors, which in turn makes it hard to use speech recognition to detect repetitions. We present an approach to...
Conference Paper
Full-text available
In this paper, we present a data-driven approach for detecting instances of miscommunication in dialogue system inter-actions. A range of generic features that are both automatically extractable and manually annotated were used to train two models for online detection and one for offline analysis. Online detection could be used to raise the error a...
Conference Paper
In this demonstration we present a test-bed for collecting data and testing out models for multi-party, situated interaction between humans and robots. Two users are playing a collaborative card sorting game together with the robot head Furhat. The cards are shown on a touch table between the players, thus constituting a target for joint attention....
Conference Paper
Full-text available
In this demonstration we show how situated multi-party human-robot interaction can be modelled using the open source framework IrisTK. We will demonstrate the capabilities of IrisTK by showing an application where two users are playing a collaborative card sorting game together with the robot head Furhat, where the cards are shown on a touch table...
Conference Paper
Full-text available
In this paper we present a data-driven model for detecting opportunities and obligations for a robot to take turns in multi-party discussions about objects. The data used for the model was collected in a public setting, where the robot head Furhat played a collaborative card sorting game together with two users. The model makes a combined detection...
Conference Paper
In this paper, we present an experiment where two human subjects are given a team-building task to solve together with a robot. The setting requires that the speakers' attention is partly directed towards objects on the table between them, as well as to each other, in order to coordinate turn-taking. The symmetrical setup allows us to compare human...
Conference Paper
In this paper, we present a brief summary of the international workshop on Modeling Multiparty, Multimodal Interactions. The UM3I 2014 workshop is held in conjunction with the ICMI 2014 conference. The workshop will highlight recent developments and adopted methodologies in the analysis and modeling of multiparty and multimodal interactions, the de...
Article
Full-text available
In this paper, we present a study where a robot instructs a human on how to draw a route on a map. The human and robot are seated face-to-face with the map placed on the table between them. The user’s and the robot’s gaze can thus serve several simultaneous functions: as cues to joint attention, turn-taking, level of understanding and task progress...
Article
Full-text available
Traditional dialogue systems use a fixed silence threshold to detect the end of users’ turns. Such a simplistic model can result in system behaviour that is both interruptive and unresponsive, which in turn affects user experience. Various studies have observed that human interlocutors take cues from speaker behaviour, such as prosody, syntax, and...
Conference Paper
Full-text available
In this paper, we present a data-driven chunking parser for automatic interpretation of spoken route directions into a route graph that is useful for robot navigation. Different sets of features and machine learning algorithms are explored. The results indicate that our approach is robust to speech recognition errors.
Conference Paper
Full-text available
We present a technique for crowd-sourcing street-level geographic information using spoken natural language. In particular, we are interested in obtaining first-person-view information about what can be seen from different positions in the city. This information can then for example be used for pedestrian routing services. The approach has been tes...
Conference Paper
This paper describes a novel experimental setup exploiting state-of-the-art capture equipment to collect a multimodally rich game-solving collaborative multiparty dialogue corpus. The corpus is targeted and designed towards the development of a dialogue system platform to explore verbal and nonverbal tutoring strategies in multiparty spoken interac...
Conference Paper
In this paper, we describe a project that explores a novel experimental setup towards building a spoken, multi-modally rich, and human-like multiparty tutoring robot. A human-robot interaction setup is designed, and a human-human dialogue corpus is collected. The corpus targets the development of a dialogue system platform to study verbal and nonve...
Conference Paper
Furhat [1] is a robot head that deploys a back-projected animated face that is realistic and human-like in anatomy. Furhat relies on a state-of-the-art facial animation architecture allowing accurate synchronized lip movements with speech, and the control and generation of non-verbal gestures, eye movements and facial expressions. Furhat is built t...
Conference Paper
In this paper, we describe a project that explores a novel experi-mental setup towards building a spoken, multi-modally rich, and human-like multiparty tutoring robot. A human-robot interaction setup is designed, and a human-human dialogue corpus is collected. The corpus targets the development of a dialogue system platform to study verbal and nonv...
Conference Paper
This project explores a novel experimental setup towards building spoken, multi-modally rich, and human-like multiparty tutoring agent. A setup is developed and a corpus is collected that targets the development of a dialogue system platform to explore verbal and nonverbal tutoring strategies in multiparty spoken interactions with embodied agents....
Conference Paper
Full-text available
We present a data collection setup for exploring turn-taking in three-party human-robot interaction involving objects competing for attention. The collected corpus comprises 78 minutes in four interactions. Using automated techniques to record head pose and speech patterns, we analyze head pose patterns in turn-transitions. We find that introductio...
Conference Paper
In this demonstrator we present the Furhat robot head. Furhat is a highly human-like robot head in terms of dynamics, thanks to its use of back-projected facial animation. Furhat also takes advantage of a complex and advanced dialogue toolkits designed to facilitate rich and fluent multimodal multiparty human-machine situated and spoken dialogue. T...
Conference Paper
We present a data-driven model for detecting suitable response locations in the user’s speech. The model has been trained on human–machine dialogue data and implemented and tested in a spoken dialogue system that can perform the Map Task with users. To our knowledge, this is the first example of a dialogue system that uses automatically extracted s...
Conference Paper
The demonstrator presents a test-bed for collecting data on human–computer dialogue: a fully automated dialogue system that can perform Map Task with a user. In a first step, we have used the test-bed to collect human–computer Map Task dialogue data, and have trained various data-driven models on it for detecting feedback response locations in the...
Article
Full-text available
In this paper, we present Furhat — a back-projected human-like robot head using state-of-the art facial animation. Three experiments are presented where we investigate how the head might facilitate human–robot face-to-face interaction. First, we investigate how the animated lips increase the intelligibility of the spoken output, and compare this to...
Article
Full-text available
Whereas much of the state-of-the-art research in Human-Robot Interaction (HRI) investigates task-oriented interaction, this paper aims at exploring what people talk about to a robot if the content of the conversation is not prede�ned. We used the robot head Furhat to explore the conversational behavior of people who encounter a robot in the public...
Conference Paper
We present a human evaluation of the usefulness of conceptual route graphs (CRGs) when it comes to route following using spoken route descriptions. We describe a method for data-driven semantic interpretation of route descriptions into CRGs. The comparable performances of human participants in sketching a route using the manually transcribed CRGs a...
Article
Full-text available
This paper investigates forms and functions of user feedback in a map task dialogue between a human and a robot, where the robot is the instruction-giver and the human is the instruction- follower. First, we investigate how user acknowledgements in task-oriented dialogue signal whether an activity is about to be initiated or has been completed. The...
Conference Paper
Accurate human perception of robots' gaze direction is crucial for the design of a natural and fluent situated multimodal face-to-face interaction between humans and machines. In this paper, we present an experiment targeted at quantifying the effects of different gaze cues synthesized using the Furhat back-projected robot head, on the accuracy of...