Conference Paper

An Attentive Listening System with Android ERICA: Comparison of Autonomous and WOZ Interactions

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Furthermore, ERICA produces a variety of visual cues, encompassing facial expressions, synchronized lip movements during speech, blinking, and nodding gestures. ERICA showcases versatility by assuming distinctive roles, including attentive listening [5], conducting practice job interviews [6,7], and serving as a laboratory guide [8]. The primary objective of ERICA is to replicate human-like communication abilities across a broad range of scenarios. ...
... Attentive listening involves the system actively engaging in speech perception, attentively listening to the subject's spoken communication. Attentive listening with conversational robots offers notable benefits, particularly in offering companionship to elderly individuals [5] and patients in psychiatric daycare settings [9]. These systems demonstrate attentiveness by generating backchannels, asking follow-up questions, and offering empathetic responses. ...
... Conversation topics varied, with participants discussing subjects such as food, travel, or the challenges posed by the COVID-19 pandemic. The attentive listening system used in this work was developed by [5]. ...
Preprint
This study examined users' behavioral differences in a large corpus of Japanese human-robot interactions, comparing interactions between a tele-operated robot and an autonomous dialogue system. We analyzed user spoken behaviors in both attentive listening and job interview dialogue scenarios. Results revealed significant differences in metrics such as speech length, speaking rate, fillers, backchannels, disfluencies, and laughter between operator-controlled and autonomous conditions. Furthermore, we developed predictive models to distinguish between operator and autonomous system conditions. Our models demonstrated higher accuracy and precision compared to the baseline model, with several models also achieving a higher F1 score than the baseline.
... This task has been found to be useful for elderly people living alone who desire social interaction. We have so far developed an attentive listening dialogue system using an autonomous android ERICA (Inoue et al., 2020) that is capable of generating listener responses such as backchannels (e.g., "Yeah"), repeats of focus words, elaborating questions, and assessments (e.g., "That is nice"). ...
... 2 Multi-party attentive listening system First, the basic attentive listening system (Inoue et al., 2020) used in this study is explained. As illustrated in Figure 2, the system generates listener responses such as backchannels, assessments, elaborating questions, repeats, and generic responses, with the speech enhancement and automatic speech recognition implemented through a 16-channel microphone array. ...
... In the previous one-on-one attentive listening system, the assessment responses such as "That is nice" had been generated on the basis of sentiment analysis (positive, negative, or neutral) using sentiment word dictionaries (Inoue et al., 2020). The assessment responses have been used to express empathy towards the speaker, which is an important role in ERICA Main speaker Side participant ...
... In this dialogue, the task is to attentively listen to a user's talk, and the system needs to utter the listener responses, such as backchannels and questions. Several attentive listening systems have been proposed so far [9,17,19]. In this instance, we employed an existing system [9] for this data collection. ...
... Several attentive listening systems have been proposed so far [9,17,19]. In this instance, we employed an existing system [9] for this data collection. The interface of the system was an android robot [10] whose appearance is similar to that of human beings. ...
... Thus, there were a total of 69 university students acting as users. After each dialogue concluded, the participants were asked to answer a 19-item questionnaire evaluation created in a previous study [9]. ...
Preprint
Full-text available
This paper tackles the challenging task of evaluating socially situated conversational robots and presents a novel objective evaluation approach that relies on multimodal user behaviors. In this study, our main focus is on assessing the human-likeness of the robot as the primary evaluation metric. While previous research often relied on subjective evaluations from users, our approach aims to evaluate the robot's human-likeness based on observable user behaviors indirectly, thus enhancing objectivity and reproducibility. To begin, we created an annotated dataset of human-likeness scores, utilizing user behaviors found in an attentive listening dialogue corpus. We then conducted an analysis to determine the correlation between multimodal user behaviors and human-likeness scores, demonstrating the feasibility of our proposed behavior-based evaluation method.
... All of the artifact shaped robots found in the papers are robotic wheelchairs [25,38,39,90], which are used for navigation. The papers using android robots, all use ERICA as the robot [42,48,49,64]. One of the robots was not described sufficiently to classify it [34]. ...
... [81] also use an FSM for tracking the general state of the dialogue system, but make use of external services for questions and declarative utterances. [49] use logistic regression to select so-called experts, which can have different strategies, while [42] rely on a priority system in combination with additional triggers for a backchannel and backup question module. Backchannels are short responses that are integrated into the conversation while listening, to show engagement (e.g. ...
... This variety makes it challenging to select the most appropriate solution in cases where an existing framework is used. Due to the tight coupling of modules in a robot, the dialogue managers are also influenced by the problems of other modules, such as speech recognition errors [30,42,49,76]. Apart from ...
Preprint
Full-text available
As social robots see increasing deployment within the general public, improving the interaction with those robots is essential. Spoken language offers an intuitive interface for the human-robot interaction (HRI), with dialogue management (DM) being a key component in those interactive systems. Yet, to overcome current challenges and manage smooth, informative and engaging interaction a more structural approach to combining HRI and DM is needed. In this systematic review, we analyse the current use of DM in HRI and focus on the type of dialogue manager used, its capabilities, evaluation methods and the challenges specific to DM in HRI. We identify the challenges and current scientific frontier related to the DM approach, interaction domain, robot appearance, physical situatedness and multimodality.
... In this context, we have been developing an intelligent conversational android ERICA to be engaged in human-level dialogue [1]. At the moment, it can perform attentive listening with senior people for five to seven minutes, but subjective evaluations suggest that the quality of dialogue is still behind that of a human (Wizard of Oz) dialogue [2]. There are many remaining challenges ahead before the realization of truly human-level conversational robots. ...
... Attentive listening is also set up for senior people for maintaining communication skills and refreshing memories. We have developed an autonomous attentive listening system using the android ERICA, which can take a majority of the role of this task [2]. Thus, it can be extended to the proposed system allowing multiple users to talk in parallel. ...
... The base attentive listening system [2] is briefly explained below. The system generates various listener responses such as backchannels, repeats, elaborating questions, assessments, generic responses as depicted in Figure 3. Backchannels are short responses such as 'Yeah' in English and 'Un' in Japanese. ...
Article
Full-text available
Many people are now engaged in remote conversations for a wide variety of scenes such as interviewing, counseling, and consulting, but there is a limited number of skilled experts. We propose a novel framework of parallel conversations with semi-autonomous avatars, where one operator collaborates with several remote robots or agents simultaneously. The autonomous dialogue system mostly manages the conversation, but switches to the human operator when necessary. This framework circumvents the requirement for autonomous systems to be completely perfect. Instead, we need to detect dialogue breakdown or disengagement. We present a prototype of this framework for attentive listening.
... An attentive listening system (Inoue et al. 2020) was developed by integrating the aforementioned modules. Attentive listening is useful for seniors who need to be heard and maintain their communication skills. ...
... 3.2. The details of the evaluation are provided in (Inoue et al. 2020). The system can also be combined with sophisticated chatbots using large language models, such as GPTs, by generating a simple response on the fly before obtaining an elaborate response from the server. ...
Chapter
Full-text available
Speech technology has made significant advances with the introduction of deep learning and large datasets, enabling automatic speech recognition and synthesis at a practical level. Dialogue systems and conversational AI have also achieved dramatic advances based on the development of large language models. However, the application of these technologies to humanoid robots remains challenging because such robots must operate in real time and in the real world. This chapter reviews the current status and challenges of spoken dialogue technology for communicative robots and virtual agents. Additionally, we present a novel framework for the semi-autonomous cybernetic avatars investigated in this study.
... They function as feedback mechanisms, signaling attentiveness, understanding, and agreement without interrupting the speaker. Accurate prediction and generation of backchannels in spoken dialogue systems are essential for creating more natural and human-like interactions (Schroder et al., 2011;DeVault et al., 2014;Inoue et al., 2020b). Although some definitions of backchannels include longer and more linguistic tokens such as "I see," this work focuses on short tokens that are frequently and dynamically used by listeners. ...
... In this experiment, the android ERICA was employed, with a human operator remotely controlling it, which was transmitted and played through ER-ICA's speaker system ( Figure 2). The dialogue task focused on attentive listening, where human participants shared personal experiences, and ER-ICA actively engaged as a listener (Inoue et al., 2020b;Lala et al., 2017). This task was advantageous because it allowed for the collection of numerous backchannel responses by ERICA. ...
Preprint
Full-text available
In human conversations, short backchannel utterances such as "yeah" and "oh" play a crucial role in facilitating smooth and engaging dialogue. These backchannels signal attentiveness and understanding without interrupting the speaker, making their accurate prediction essential for creating more natural conversational agents. This paper proposes a novel method for real-time, continuous backchannel prediction using a fine-tuned Voice Activity Projection (VAP) model. While existing approaches have relied on turn-based or artificially balanced datasets, our approach predicts both the timing and type of backchannels in a continuous and frame-wise manner on unbalanced, real-world datasets. We first pre-train the VAP model on a general dialogue corpus to capture conversational dynamics and then fine-tune it on a specialized dataset focused on backchannel behavior. Experimental results demonstrate that our model outperforms baseline methods in both timing and type prediction tasks, achieving robust performance in real-time environments. This research offers a promising step toward more responsive and human-like dialogue systems, with implications for interactive spoken dialogue applications such as virtual assistants and robots.
... To enhance the naturalness of ERICA's behavior during the conversation, a random gazing model is also introduced. ERICA normally does speaker tracking using Kinect (Inoue et al., 2016(Inoue et al., , 2020, but since the participant in our case is not on-site, we model gazing behavior as a random uniform sampling of a gaze point nearby the webcam. The gaze point will be randomly changed within a hollow cylinder from the center of the webcam with an outer radius of 0.3 m, inner radius of 0.05 m, and width of 0.2 m. ...
... Empathetic dialogue systems are attracting more interest in the field of psychiatry as well (Vaidyam et al., 2019), especially those equipped with nonverbal features (DeVault et al., 2014;Rizzo et al., 2011). In addition, Inoue et al. (2016Inoue et al. ( , 2020 utilized ER-ICA's nonverbal features to make her more empathetic in more generic situations. ...
... To enhance the naturalness of ERICA's behavior during the conversation, a random gazing model is also introduced. ERICA normally does speaker tracking using Kinect (Inoue et al., 2016(Inoue et al., , 2020, but since the participant in our case is not on-site, we model gazing behavior as a random uniform sampling of a gaze point nearby the webcam. The gaze point will be randomly changed within a hollow cylinder from the center of the webcam with an outer radius of 0.3 m, inner radius of 0.05 m, and width of 0.2 m. ...
... Empathetic dialogue systems are attracting more interest in the field of psychiatry as well (Vaidyam et al., 2019), especially those equipped with nonverbal features (DeVault et al., 2014;Rizzo et al., 2011). In addition, Inoue et al. (2016Inoue et al. ( , 2020 utilized ER-ICA's nonverbal features to make her more empathetic in more generic situations. ...
Preprint
Over the past year, research in various domains, including Natural Language Processing (NLP), has been accelerated to fight against the COVID-19 pandemic, yet such research has just started on dialogue systems. In this paper, we introduce an end-to-end dialogue system which aims to ease the isolation of people under self-quarantine. We conduct a control simulation experiment to assess the effects of the user interface, a web-based virtual agent called Nora vs. the android ERICA via a video call. The experimental results show that the android offers a more valuable user experience by giving the impression of being more empathetic and engaging in the conversation due to its nonverbal information, such as facial expressions and body gestures.
... We implemented an attentive listening system with ERICA [6,26]. The aforementioned backchannel generation model is a core component in this system. ...
... We also manually evaluated each system response and found that about 60% of the system responses were acknowledged as appropriate, which means the other 40% responses can be improved. The novelty of our experiment is to compare our system with a human listener implemented in a WOZ (Wizard-of-OZ) setting with a hidden human operator tele-operating ERICA [26]. From the subjective evaluation, our system achieved comparable scores against the WOZ setting in basic skills of attentive listening such as encouragement to talk, focused on the talk, and actively listening. ...
Preprint
Full-text available
Following the success of spoken dialogue systems (SDS) in smartphone assistants and smart speakers, a number of communicative robots are developed and commercialized. Compared with the conventional SDSs designed as a human-machine interface, interaction with robots is expected to be in a closer manner to talking to a human because of the anthropomorphism and physical presence. The goal or task of dialogue may not be information retrieval, but the conversation itself. In order to realize human-level "long and deep" conversation, we have developed an intelligent conversational android ERICA. We set up several social interaction tasks for ERICA, including attentive listening, job interview, and speed dating. To allow for spontaneous, incremental multiple utterances, a robust turn-taking model is implemented based on TRP (transition-relevance place) prediction, and a variety of backchannels are generated based on time frame-wise prediction instead of IPU-based prediction. We have realized an open-domain attentive listening system with partial repeats and elaborating questions on focus words as well as assessment responses. It has been evaluated with 40 senior people, engaged in conversation of 5-7 minutes without a conversation breakdown. It was also compared against the WOZ setting. We have also realized a job interview system with a set of base questions followed by dynamic generation of elaborating questions. It has also been evaluated with student subjects, showing promising results.
... Employing a Task Adaptive Pre-Training (TAPT) approach with a BERT-based model, our method demonstrated superior performance across all modules compared to other models, including a random baseline, the baseline BERT, and ChatGPT, in both textual-based dialogue and spoken dialogue settings. As a direction for future research, we aim to conduct user experiments using the conversational robot [27]. This will enable us to evaluate our model's efficacy in complex, real-time conversational settings, further validating the utility of our proposed framework. ...
Conference Paper
Full-text available
In the realm of human-AI dialogue, the facilitation of empathetic responses is important. Validation is one of the key communication techniques in psychology , which entails recognizing, understanding, and acknowledging others' emotional states, thoughts, and actions. This study introduces the first framework designed to engender empathetic dialogue with validating responses. Our approach incorporates a tripartite module system: 1) validation timing detection, 2) users' emotional state identification, and 3) validating response generation. Utilizing Japanese EmpatheticDialogues dataset-a textual-based dialogue dataset consisting of 8 emotional categories from Plutchik's wheel of emotions-the Task Adaptive Pre-Training (TAPT) BERT-based model outperforms both random baseline and the ChatGPT performance, in term of F1-score, in all modules. Further validation of our model's efficacy is confirmed in its application to the TUT Emotional Story-telling Corpus (TESC), a speech-based dialogue dataset, by surpassing both random baseline and the ChatGPT. This consistent performance across both textual and speech-based dialogues underscores the effectiveness of our framework in fostering empathetic human-AI communication.
... Backchannels also play an important role in cultivating a supportive environment that builds rapport and trust between individuals [5]. Therefore, for a robot or dialogue system to act like a human and increase user satisfaction, it must not only speak fluently but also select and use the appropriate backchannels [6,7]. ...
... We used a Japanese conversational speech corpus provided by the JST ERATO project, which consists of spoken dialogue via Wizard of Oz using android ERICA [25]. The conversation task was one-on-one attentive listening, in which the participant in the role of a speaker told his/her experiences and impressions to a listener. ...
... In call center applications, certain studies [1], [2] argue that autonomous agents are beneficial for easy and simple cases, while human intervention is desirable for more advanced and complex cases to solve problems swiftly. Other studies [3], [4] report that while the performance of basic functions, such as backchannel communication, by autonomous dialogue agents is comparable to that of human operators, the performance of advanced functions, such as showing empathy, is insufficient. Consequently, semiautonomous teleoperation systems, where an operator switches in conversations from autonomous robots, become essential. ...
Article
Full-text available
Recently, teleoperation systems have been developed enabling a single operator to engage with users across multiple locations simultaneously. However, under such systems, a potential challenge exists where the operator, upon switching locations, may need to join ongoing conversations without a complete understanding of their history. Consequently, a seamless transition and the development of high-quality conversations may be impeded. This study directs its attention to the utilization of multiple robots, aiming to create a semiautonomous teleoperation system. This system enables an operator to switch between twin robots at each location as needed, thereby facilitating the provision of higher-quality dialogue services simultaneously. As an initial phase, a field experiment was conducted to assess user satisfaction with recommendations made by the operator using twin robots. Results collected from 391 participants over 13 days revealed heightened user satisfaction when the operator intervened and provided recommendations through multiple robots compared with autonomous recommendations by the robots. These findings contribute to the formulation of a teleoperation system that allows a single operator to deliver multipoint conversational services.
... Creating chatbots to behave like real people is important in terms of believability (Traum et al., 2015;Higashinaka et al., 2018). Errors in general chatbots and chatbots that follow a rough persona (Li et al., 2016;Zhang et al., 2018;Zhou et al., 2020;Inoue et al., 2020;Song et al., 2020; have been studied, but those in chatbots that behave like real people have not been thoroughly investigated. ...
... Jin et al. [22] demonstrate an outbound agent that could recognize user interruptions and discontinuous expressions. Inoue et al. [21] built an android called ERICA with backchannel responses for attentive listening. ...
Preprint
Full-text available
In this paper, we present Duplex Conversation, a multi-turn, multimodal spoken dialogue system that enables telephone-based agents to interact with customers like a human. We use the concept of full-duplex in telecommunication to demonstrate what a human-like interactive experience should be and how to achieve smooth turn-taking through three subtasks: user state detection, backchannel selection, and barge-in detection. Besides, we propose semi-supervised learning with multimodal data augmentation to leverage unlabeled data to increase model generalization. Experimental results on three sub-tasks show that the proposed method achieves consistent improvements compared with baselines. We deploy the Duplex Conversation to Alibaba intelligent customer service and share lessons learned in production. Online A/B experiments show that the proposed system can significantly reduce response latency by 50%.
Article
As social robots see increasing deployment within the general public, improving the interaction with those robots is essential. Spoken language offers an intuitive interface for the human-robot interaction (HRI), with dialogue management (DM) being a key component in those interactive systems. Yet, to overcome current challenges and manage smooth, informative and engaging interaction a more structural approach to combining HRI and DM is needed. In this systematic review, we analyse the current use of DM in HRI and focus on the type of dialogue manager used, its capabilities, evaluation methods and the challenges specific to DM in HRI. We identify the challenges and current scientific frontier related to the DM approach, interaction domain, robot appearance, physical situatedness and multimodality.
Chapter
Previous studies on backchannel prediction model have suggested that replicating human backchannel can enhance user’s human-robot interaction experience. In this study, we propose a real-time non-verbal backchannel prediction model which utilizes both an acoustic feature and a temporal feature. Our goal is to improve the quality of robot’s backchannel and user’s experience. To conduct this research, we collected a human-human interview dataset. Using this dataset, we proceeded to develop three distinct backchannel prediction models: a temporal, an acoustic, and a mixed (temporal & acoustic) model. Subsequently, we conducted a user study to compare the perception of robot implemented with the three models. The results demonstarted that the robot employing the mixed model was preferred by participants and exhibited moderate frequency of backchannel. These results emphasize the advantages of incorporating acoustic and temporal features in developing backchannel prediction model to enhance the quality of human-robot interactions, specifically with regards to backchannel frequency and timing.
Chapter
To assess the conversational proficiency of language learners, it is essential to samples that are representative of the learner’s full linguistic ability. This is realized through the adjustment of oral interview questions to the learner’s perceived proficiency level. An automatic system eliciting ratable samples must incrementally predict the approximate proficiency from a few turns of dialog and employ an adaptable question generation strategy according to this prediction. This study investigates the feasibility of such incremental adjustment of oral interview question difficulty during the interaction between a virtual agent and learner. First, we create an interview scenario with questions designed for different levels of proficiency and collect interview data using a Wizard-of-Oz virtual agent. Next, we build an incremental scoring model and analyze the accuracy. Finally, we discuss the future direction of automated adaptive interview system design.
Article
Full-text available
Spoken dialogue systems must be able to express empathy to achieve natural interaction with human users. However, laughter generation requires a high level of dialogue understanding. Thus, implementing laughter in existing systems, such as in conversational robots, has been challenging. As a first step toward solving this problem, rather than generating laughter from user dialogue, we focus on “shared laughter,” where a user laughs using either solo or speech laughs (initial laugh), and the system laughs in turn (response laugh). The proposed system consists of three models: 1) initial laugh detection, 2) shared laughter prediction, and 3) laugh type selection. We trained each model using a human-robot speed dating dialogue corpus. For the first model, a recurrent neural network was applied, and the detection performance achieved an F1 score of 82.6%. The second model used the acoustic and prosodic features of the initial laugh and achieved a prediction accuracy above that of the random prediction. The third model selects the type of system’s response laugh as social or mirthful laugh based on the same features of the initial laugh. We then implemented the full shared laughter generation system in an attentive listening dialogue system and conducted a dialogue listening experiment. The proposed system improved the impression of the dialogue system such as empathy perception compared to a naive baseline without laughter and a reactive system that always responded with only social laughs. We propose that our system can be used for situated robot interaction and also emphasize the need for integrating proper empathetic laughs into conversational robots and agents.
Article
An attentive listening system for autonomous android ERICA is presented. Our goal is to realize a humanlike natural attentive listener for elderly people. The proposed system generates listener responses: backchannels, repeats, elaborating questions, assessments, and generic responses. The system incorporates speech processing using a microphone array and real-time dialogue processing including continuous backchannel prediction and turn-taking prediction. In this study, we conducted a dialogue experiment with elderly people. The system was compared with a WOZ system where a human operator played the listener role behind the robot. As a result, the system showed comparable scores in basic skills of attentive listening, such as easy to talk, seriously listening, focused on the talk, and actively listening. It was also found that there is still a gap between the system and the human (WOZ) for high-level attentive listening skills such as dialogue understanding, showing interest, and empathy towards the user.
Article
In dialogue systems, conveying the understanding results of user utterances is important because it enables the users to feel understood by the system. However, it is not clear which types of understanding results should be conveyed to users; some utterances may be offensive, for example, and some may be too commonsensical. In this paper, we explore the effect of conveying the understanding results of user utterances in a chat-oriented dialogue system by an experiment with human subjects. We created system utterances conveying various understanding results and then investigated which types of results were favorable and unfavorable through manual evaluation. We found that understanding results referring to a user’s internal state were unfavorable, while those related to user’s positive attributes were likely to be favorable. In addition, understanding results related to objective facts about users or general facts unrelated to users were favorable. These findings can be used as a guideline for constructing a dialogue system that conveys appropriate understanding results.
Conference Paper
Full-text available
In this paper we present a dialogue system and response model that allows a robot to act as an active listener, encouraging users to tell the robot about their travel memories. The response model makes a combined decision about when to respond and what type of response to give, in order to elicit more elaborate descriptions from the user and avoid non-sequitur responses. The model was trained on human-robot dialogue data collected in a Wizard-of-Oz setting, and evaluated in a fully autonomous version of the same dialogue system. Compared to a baseline system, users perceived the dialogue system with the trained model to be a significantly better listener. The trained model also resulted in dialogues with significantly fewer mistakes, a larger proportion of user speech and fewer interruptions.
Conference Paper
Full-text available
The development of an android with convincingly lifelike appearance and behavior has been a long-standing goal in robotics, and recent years have seen great progress in many of the technologies needed to create such androids. However, it is necessary to actually integrate these technologies into a robot system in order to assess the progress that has been made towards this goal and to identify important areas for future work. To this end, we are developing ERICA, an autonomous android system capable of conversational interaction, featuring advanced sensing and speech synthesis technologies, and arguably the most humanlike android built to date. Although the project is ongoing, initial development of the basic android platform has been completed. In this paper we present an overview of the requirements and design of the platform, describe the development process of an interactive application, report on ERICA's first autonomous public demonstration, and discuss the main technical challenges that remain to be addressed in order to create humanlike, autonomous androids.
Article
Full-text available
In Human-Humanoid Interaction (HHI), empathy is the crucial key in order to overcome the current limitations of social robots. In facts, a principal defining characteristic of human social behaviour is empathy. The present paper presents a robotic architecture for an android robot as a basis for natural empathic human-android interaction. We start from the hypothesis that the robots, in order to become personal companions need to know how to empathic interact with human beings. To validate our research, we have used the proposed system with the minimalistic humanoid robot Telenoid. We have conducted human-robot interactions test with elderly people with no prior interaction experience with robot. During the experiment, elderly persons engaged a stimulated conversation with the humanoid robot. Our goal is to overcome the state of loneliness of elderly people using this minimalistic humanoid robot capable to exhibit a dialogue similar to what usually happens in real life between human beings. The experimental results have shown a humanoid robotic system capable to exhibit a natural and empathic interaction and conversation with a human user.
Conference Paper
Full-text available
We present SimSensei Kiosk, an implemented virtual human interviewer designed to create an engaging face-to-face interaction where the user feels comfortable talking and sharing information. SimSensei Kiosk is also designed to create interactional situations favorable to the automatic assessment of distress indicators, defined as verbal and nonverbal behaviors correlated with depression, anxiety or post-traumatic stress disorder (PTSD). In this paper, we summarize the design methodology, performed over the past two years, which is based on three main development cycles: (1) analysis of face-to-face human interactions to identify potential distress indicators, dialogue policies and virtual human gestures, (2) development and analysis of a Wizard-of-Oz prototype system where two human operators were deciding the spoken and gestural responses, and (3) development of a fully automatic virtual interviewer able to engage users in 15-25 minute interactions. We show the potential of our fully automatic virtual human interviewer in a user study, and situate its performance in relation to the Wizard-of-Oz prototype. Copyright © 2014, International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org). All rights reserved.
Conference Paper
Full-text available
We explored the potential of teleoperated android robots, which are embodied telecommunication media with humanlike appearances, and how they affect people in the real world when they are employed to express a telepresence and a sense of ‘being there’. In Denmark, our exploratory study focused on the social aspects of Telenoid, a teleoperated android, which might facilitate communication between senior citizens and Telenoid’s operator. After applying it to the elderly in their homes, we found that the elderly assumed positive attitudes toward Telenoid, and their positivity and strong attachment to its huggable minimalistic human design were cross-culturally shared in Denmark and Japan. Contrary to the negative reactions by non-users in media reports, our result suggests that teleoperated androids can be accepted by the elderly as a kind of universal design medium for social inclusion.
Conference Paper
Full-text available
During face-to-face interactions, listeners use backchannel feedback such as head nods as a signal to the speaker that the communication is working and that they should continue speaking. Predicting these backchannel opportunities is an important milestone for building engaging and natural virtual humans. In this paper we show how sequential probabilistic models (e.g., Hidden Markov Model or Conditional Random Fields) can automatically learn from a database of human-to-human interactions to predict listener backchannels using the speaker multimodal output features (e.g., prosody, spoken words and eye gaze). The main challenges addressed in this paper are automatic selection of the relevant features and optimal feature representation for probabilistic models. For prediction of visual backchannel cues (i.e., head nods), our prediction model shows a statistically significant improvement over a previously published approach based on hand-crafted rules.
Conference Paper
Full-text available
We manually designed rules for a backchannel (BC) prediction model based on pitch and pause information. In short, the model predicts a BC when there is a pause of a certain length that is preceded by a falling or rising pitch. This model was validated against the Dutch IFADV Corpus in a corpus-based evaluation method. The results showed that our model performs slightly better than another well-known rule-based BC prediction model that uses only pitch information. We observed that the length of a pause preceding a BC is one of the important features in this model, next to the duration of the pitch slope at the end of an utterance. Further, we discuss implications of a corpus-based approach to BC prediction evaluation.
Conference Paper
Full-text available
Automatic extraction of human opinions from Web documents has been receiving increasing interest. To automate the process of opinion extraction, having a collection of evaluative expressions such as "the seats are comfortable" would be useful. However, it can be costly to manually create an exhaustive list of such expressions for many domains, because they tend to be domain-dependent. Motivated by this, we have been exploring ways to accelerate the process of collecting evaluative expressions by applying a text mining technique. This paper proposes a semi-automatic method that uses particular cooccurrence patterns of evaluated subjects, focused attributes and value expressions.
Conference Paper
Turn-taking in human-robot interaction is a crucial part of spoken dialogue systems, but current models do not allow for human-like turn-taking speed seen in natural conversation. In this work we propose combining two independent prediction models. A continuous model predicts the upcoming end of the turn in order to generate gaze aversion and fillers as turn-taking cues. This prediction is done while the user is speaking, so turn-taking can be done with little silence between turns, or even overlap. Once a speech recognition result has been received at a later time, a second model uses the lexical information to decide if or when the turn should actually be taken. We constructed the continuous model using the speaker’s prosodic features as inputs and evaluated its online performance. We then conducted a subjective experiment in which we implemented our model in an android robot and asked participants to compare it to one without turn-taking cues, which produces a response when a speech recognition result is received. We found that using both gaze aversion and a filler was preferred when the continuous model correctly predicted the upcoming end of turn, while using only gaze aversion was better if the prediction was wrong.
Conference Paper
The task of identifying when to take a conversational turn is an important function of spoken dialogue systems. The turn-taking system should also ideally be able to handle many types of dialogue, from structured conversation to spontaneous and unstructured discourse. Our goal is to determine how much a generalized model trained on many types of dialogue scenarios would improve on a model trained only for a specific scenario. To achieve this goal we created a large corpus of Wizard-of-Oz conversation data which consisted of several different types of dialogue sessions, and then compared a generalized model with scenario-specific models. For our evaluation we go further than simply reporting conventional metrics, which we show are not informative enough to evaluate turn-taking in a real-time system. Instead, we process results using a performance curve of latency and false cut-in rate, and further improve our model's real-time performance using a finite-state turn-taking machine. Our results show that the generalized model greatly outperformed the individual model for attentive listening scenarios but was worse in job interview scenarios. This implies that a model based on a large corpus is better suited to conversation which is more user-initiated and unstructured. We also propose that our method of evaluation leads to more informative performance metrics in a real-time system.
Conference Paper
We propose Intimate Touch Conversation (ITC) as a new remote communication paradigm in which an individual, who is holding a telepresence humanoid, engages in a conversation-over-distance with a remote partner that is teleoperating the humanoid. We investigate the effect of this new communication paradigm on interpersonal closeness in comparison with in-person and video-chat. Our results suggest that ITC significantly enhances the feeling of interpersonal closeness, as opposed to video-chat and in-person. In addition, they show that ITC allows elderly people to find their conversation more interesting. These results imply that feeling of intimate touch that is evoked by the presence of teleoperated android enables elderly users to establish a closer relationship with their conversational partners over distance, thereby reducing their feeling of loneliness. Our findings benefit researchers and engineers in elderly care facilities in search of effective means of establishing a social relation with their elderly users to reduce their feeling of social isolation and loneliness.
Chapter
We present a dialogue system for a conversational robot, Erica. Our goal is for Erica to engage in more human-like conversation, rather than being a simple question-answering robot. Our dialogue manager integrates question-answering with a statement response component which generates dialogue by asking about focused words detected in the user’s utterance, and a proactive initiator which generates dialogue based on events detected by Erica. We evaluate the statement response component and find that it produces coherent responses to a majority of user utterances taken from a human-machine dialogue corpus. An initial study with real users also shows that it reduces the number of fallback utterances by half. Our system is beneficial for producing mixed-initiative conversation.
Article
Back-channel feedback, responses such as uh-uh from a listener, is a pervasive feature of conversation. It has long been thought that the production of back-channel feedback depends to a large extent on the actions of the other conversation partner, not just on the volition of the one who produces them. In particular, prosodic cues from the speaker have long been thought to play a role, but have so far eluded identification. We have earlier suggested that an important prosodic cue involved, in both English and Japanese, is a region of low pitch late in an utterance (Ward, 1996). This paper discusses issues in the definition of back-channel feedback, presents evidence for our claim, surveys other factors which elicit or inhibit back-channel responses, and mentions a few related phenomena and theoretical issues.
Conference Paper
During face-to-face conversation, people naturally integrate speech, gestures and higher level language interpretations to predict the right time to start talking or to give backchannel feedback. In this paper we introduce a new model called Latent Mixture of Discriminative Experts which addresses some of the key issues with multimodal language processing: (1) temporal synchrony/asynchrony between modalities, (2) micro dynamics and (3) integration of different levels of interpretation. We present an empirical evaluation on listener nonverbal feedback prediction (e.g., head nod), based on observable behaviors of the speaker. We confirm the importance of combining four types of multimodal features: lexical, syntactic structure, eye gaze, and prosody. We show that our Latent Mixture of Discriminative Experts model outperforms previous approaches based on Conditional Random Fields (CRFs) and Latent-Dynamic CRFs.
Conference Paper
This paper introduces the Finite-State Turn- Taking Machine (FSTTM), a new model to control the turn-taking behavior of conversa- tional agents. Based on a non-deterministic finite-state machine, the FSTTM uses a cost matrix and decision theoretic principles to se- lect a turn-taking action at any time. We show how the model can be applied to the problem of end-of-turn detection. Evaluation results on a deployed spoken dialog system show that the FSTTM provides significantly higher respon- siveness than previous approaches.
A job interview dialogue system with autonomous android ERICA
  • Koji Inoue
  • Kohei Hara
  • Divesh Lala
  • Shizuka Nakamura
  • Katsuya Takanashi
  • Tatsuya Kawahara
Koji Inoue, Kohei Hara, Divesh Lala, Shizuka Nakamura, Katsuya Takanashi, and Tatsuya Kawahara. 2019. A job interview dialogue system with autonomous android ERICA. In International Workshop on Spoken Dialog System Technology (IWSDS).
Spoken dialogue system for a human-like conversational robot ERICA
  • Tatsuya Kawahara
Tatsuya Kawahara. 2019. Spoken dialogue system for a human-like conversational robot ERICA. In International Workshop on Spoken Dialog System Technology (IWSDS).
Prediction and generation of backchannel form for attentive listening systems
  • Tatsuya Kawahara
  • Takashi Yamaguchi
  • Koji Inoue
  • Katsuya Takanashi
  • Nigel G Ward
Tatsuya Kawahara, Takashi Yamaguchi, Koji Inoue, Katsuya Takanashi, and Nigel G. Ward. 2016. Prediction and generation of backchannel form for attentive listening systems. In INTERSPEECH, pages 2890-2894.
Detection of social signals for recognizing engagement in human-robot interaction
  • Divesh Lala
  • Koji Inoue
  • Pierrick Milhorat
  • Tatsuya Kawahara
Divesh Lala, Koji Inoue, Pierrick Milhorat, and Tatsuya Kawahara. 2017a. Detection of social signals for recognizing engagement in human-robot interaction. In AAAI Fall Symposium on Natural Communication for Human-Robot Collaboration.
Building autonomous sensitive artificial listeners
  • Marc Schröder
  • Elisabetta Bevacqua
  • Roddy Cowie
  • Florian Eyben
  • Hatice Gunes
  • Dirk Heylen
  • Gary Mark Ter Maat
  • Sathish Mckeown
  • Maja Pammi
  • Catherine Pantic
  • Björn Pelachaud
  • Etienne Schuller
  • Michel De Sevin
  • Martin Valstar
  • Wöllmer
Marc Schröder, Elisabetta Bevacqua, Roddy Cowie, Florian Eyben, Hatice Gunes, Dirk Heylen, Mark ter Maat, Gary McKeown, Sathish Pammi, Maja Pantic, Catherine Pelachaud, Björn Schuller, Etienne de Sevin, Michel Valstar, and Martin Wöllmer. 2012. Building autonomous sensitive artificial listeners. IEEE Transaction on Affective Computing, 3(2):165-183.