Book

Dialogues with Social Robots: Enablements, Analyses, and Evaluation

Authors:
  • AIST Tokyo Waterfront

Abstract

This book explores novel aspects of social robotics, spoken dialogue systems, human-robot interaction, spoken language understanding, multimodal communication, and system evaluation. It offers a variety of perspectives on and solutions to the most important questions about advanced techniques for social robots and chat systems. Chapters by leading researchers address key research and development topics in the field of spoken dialogue systems, focusing in particular on three special themes: dialogue state tracking, evaluation of human-robot dialogue in social robotics, and socio-cognitive language processing. The book offers a valuable resource for researchers and practitioners in both academia and industry whose work involves advanced interaction technology and who are seeking an up-to-date overview of the key topics. It also provides supplementary educational material for courses on state-of-the-art dialogue system technologies, social robotics, and related research fields.

Chapters (40)

The DigiSami project operates in the general context of revitalisation of endangered languages, and focuses on the digital visibility and viability of the North Sami language in particular. The goal is to develop technological tools and resources that can be used for speech and language processing and for experimenting with interactive applications. Here we propose an interactive talking robot application as a means to reach these goals, and present preliminary analyses of a spoken North Sami conversational corpus as a starting point for supporting interaction studies. These are first steps in the development of SamiTalk, a Sami-speaking robot application which will allow North Sami speakers to interact with digital resources using their own language. The on-going work addresses the challenges of the digital revolution by showing that multilingual technology can be applied to small languages with promising results.
The article describes a comparative study of text preprocessing techniques for natural language call routing. Seven different unsupervised and supervised term weighting methods were considered. Four different dimensionality reduction methods were applied: stop-words filtering with stemming, feature selection based on term weights, feature transformation based on term clustering, and a novel feature transformation method based on terms belonging to classes. As classification algorithms we used k-NN and the SVM-based algorithm Fast Large Margin. The numerical experiments showed that the most effective term weighting method is Term Relevance Ratio (TRR). Feature transformation based on term clustering is able to significantly decrease dimensionality without significantly changing the classification effectiveness, unlike other dimensionality reduction methods. The novel feature transformation method reduces the dimensionality radically: number of features is equal to number of classes.
User satisfaction is often considered as the objective that should be achieved by spoken dialogue systems. This is why the reward function of Spoken Dialogue Systems (SDS) trained by Reinforcement Learning (RL) is often designed to reflect user satisfaction. To do so, the state space representation should be based on features capturing user satisfaction characteristics such as the mean speech recognition confidence score for instance. On the other hand, for deployment in industrial systems there is a need for state representations that are understandable by system engineers. In this article, we propose to represent the state space using a Genetic Sparse Distributed Memory. This is a state aggregation method computing state prototypes which are selected so as to lead to the best linear representation of the value function in RL. To do so, previous work on Genetic Sparse Distributed Memory for classification is adapted to the Reinforcement Learning task and a new way of building the prototypes is proposed. The approach is tested on a corpus of dialogues collected with an appointment scheduling system. The results are compared to a grid-based linear parametrisation. It is shown that learning is accelerated and made more memory efficient. It is also shown that the framework is scalable in that it is possible to include many dialogue features in the representation, interpret the resulting policy and identify the most important dialogue features.
This chapter introduces a simulator for incremental human-machine dialogue in order to generate artificial dialogue datasets that can be used to train and test data-driven methods. We review the various simulator components in detail, including an unstable speech recognizer, and their differences with non-incremental approaches. Then, as an illustration of its capacities, an incremental strategy based on hand-crafted rules is implemented and compared to several non-incremental baselines. Their performances in terms of dialogue efficiency are presented under different noise conditions and prove that the simulator is able to handle several configurations which are representative of real usages.
While example-based dialog is a popular option for the construction of dialog systems, creating example bases for a specific task or domain requires significant human effort. To reduce this human effort, in this paper, we propose an active learning framework to construct example-based dialog systems efficiently. Specifically, we propose two uncertainty sampling strategies for selecting inputs to present to human annotators who create system responses for the selected inputs. We compare performance of these proposed strategies with a random selection strategy in simulation-based evaluation on 6 different domains. Evaluation results show that the proposed strategies are good alternatives to random selection in domains where the complexity of system utterances is low.
We have been developing a method for a dialogue system to acquire information (e.g., cuisines of unknown restaurants) through dialogue by asking questions of the users. It is important that the questions are concise and concrete to prevent users from being annoyed. Our method selects the most appropriate question on the basis of expected utility calculated for four types of question: Yes/No, alternative, 3-choice, and Wh- questions. We define utility values for the four types and also derive the probability representing how likely each question is to contain a correct cuisine. The expected utility is then calculated as the sum totals of their products. We empirically compare several ways to integrate two previously proposed basic confidence measures (CMs) when deriving the probability for each question. We also examine the appropriateness of the utility values through questionnaires administered to 15 participants.
Numerous toolkits are available for developing speech-based dialogue systems. We survey a range of currently available toolkits, highlighting the different facilities provided by each. Most of these toolkits include not only a method for representing states and actions, but also a mechanism for reasoning about and selecting the actions, often combined with a technical framework designed to simplify the task of creating end-to-end systems. This near-universal tight coupling of representation, reasoning, and implementation in a single toolkit makes it difficult both to compare different approaches to dialogue system design, as well as to analyse the properties of individual techniques. We contrast this situation with the state of the art in a related research area—automated planning—where a set of common representations have been defined and are widely used to enable direct comparison of different reasoning approaches. We argue that adopting a similar separation would greatly benefit the dialogue research community.
This article presents SimpleDS, a simple and publicly available dialogue system trained with deep reinforcement learning. In contrast to previous reinforcement learning dialogue systems, this system avoids manual feature engineering by performing action selection directly from raw text of the last system and (noisy) user responses. Our initial results, in the restaurant domain, report that it is indeed possible to induce reasonable behaviours with such an approach that aims for higher levels of automation in dialogue control for intelligent interactive systems and robots.
It is difficult to determine the cause of a breakdown in a dialogue when using a chat-oriented dialogue system due to a wide variety of possible causes. To address this problem, we analyzed a chat dialogue corpus and formulated a taxonomy of the errors that could lead to dialogue breakdowns. The experimental results demonstrated the effectiveness of the taxonomy to some degree. We also developed a breakdown detector that comprises combinations of classifiers for different causes of errors based on the taxonomy.
Mixed-initiative assistants are systems that support humans in their decision-making and problem-solving capabilities in a collaborative manner. Such systems have to integrate various artificial intelligence capabilities, such as knowledge representation, problem solving and planning, learning, discourse and dialog, and human-computer interaction. These systems aim at solving a given problem autonomously for the user, yet involve the user into the planning process for a collaborative decision-making, to respect e.g. user preferences. However, how the user is involved into the planning can be framed in various ways, using different involvement strategies, varying e.g. in their degree of user freedom. Hence, here we present results of a study examining the effects of different user involvement strategies on the user experience in a mixed-initiative system.
With the development of Interactive Voice Response (IVR) systems , people can not only operate computer systems through task-oriented conversation but also enjoy non-task-oriented conversation with the computer. When an IVR system generates a response, it usually refers to just verbal information of the user’s utterance. However, when a person gloomily says “I’m fine,” people will respond not by saying “That’s wonderful” but “Really?” or “Are you OK?” because we can consider both verbal and non-verbal information such as tone of voice, facial expressions, gestures, and so on. In this article, we propose an intelligent IVR system that considers not only verbal but also non-verbal information. To estimate a speaker’s emotion (positive, negative, or neutral), 384 acoustic features extracted from the speaker’s utterance are utilized to machine learning (SVM). Artificial Intelligence Markup Language (AIML)-based response generating rules are expanded to be able to consider the speaker’s emotion. As a result of the experiment, subjects felt that the proposed dialog system was more likable, enjoyable, and did not give machine-like reactions.
While approaches on automatic recognition of human emotion from speech have already achieved reasonable results , a lot of room for improvement still remains there. In our research, we select the most essential features by applying a self-adaptive multi-objective genetic algorithm. The proposed approach is evaluated using data from different languages (English and German) with two different feature sets consisting of 37 and 384 dimensions, respectively. The obtained results of the developed technique have increased the emotion recognition performance by up to 49.8 % relative improvement in accuracy. Furthermore, in order to identify salient features across speech data from different languages, we analysed the selection count of the features to generate a feature ranking. Based on this, a feature set for speech-based emotion recognition based on the most salient features has been created. By applying this feature set, we achieve a relative improvement of up to 37.3 % without the need of time-consuming feature selection using a genetic algorithm.
A frequent difficulty faced by developers of Dialog Systems is the absence of a corpus of conversations to model the dialog statistically. Even when such a corpus is available, neither an agenda nor a statistically-based dialog control logic are options if the domain knowledge is broad. This article presents a module that automatically generates system-turn utterances to guide the user through the dialog. These system-turns are not established beforehand, and vary with each dialog. In particular, the task defined in this paper is the automation of a call-routing service. The proposed module is used when the user has not given enough information to route the call with high confidence. Doing so, and using the generated system-turns, the obtained information is improved through the dialog. The article focuses on the development and operation of this module, which is valid for agenda-based and statistical approaches, being applicable in both types of corpora.
We develop a question-answering system for questions that ask about a conversational agent’s personality based on large-scale question-answer pairs created by hand. In casual dialogues, the speaker sometimes asks his conversation partner questions about favorites or experiences. Since this behavior also appears in conversational dialogues with a dialogue system, systems must be developed to respond to such questions. However, the effectiveness of personality-question-answering for conversational agents has not been investigated. Our user-machine chat experiments show that our question-answering system, which estimates appropriate answers with 60.7 % accuracy for the personality questions in our conversation corpus, significantly improves user’s subjective evaluations.
The involvement of affect information in a spoken dialogue system can increase the user-friendliness and provide a more natural way for the interaction experience. This can be reached by speech emotion recognition, where the features are usually dominated by the spectral amplitude information while they ignore the use of the phase spectrum. In this chapter, we propose to use phase-based features to build up such an emotion recognition system. To exploit these features, we employ Fisher kernels. The according technique encodes the phase-based features by their deviation from a generative Gaussian mixture model. The resulting representation is fed to train a classification model with a linear kernel classifier. Experimental results on the GeWEC database including ‘normal’ and whispered phonation demonstrate the effectiveness of our method.
In modern software development , localisation is a straightforward process–assuming internationalisation has been considered during development. The localisation of spoken dialogue systems is less mature, possibly because they differ from common software in that interaction with them is situated and uses multiple modalities. We claim that it is possible to apply software internationalisation practices to spoken dialogue systems and that this helps the rapid localisation of such systems to new languages. We internationalised and localised the WikiTalk spoken dialogue system. During the process we identified needs relevant to spoken dialogue systems that will benefit from further research and engineering efforts.
vAssist (Voice Controlled Assistive Care and Communication Services for the Home) is a European project for which several research institutes and companies have been working on the development of adapted spoken interfaces to support home care and communication services. This paper describes the spoken dialog system that has been built. Its natural language understanding module includes a novel reference resolver and it introduces a new hierarchical paradigm to model dialog tasks. The user-centered approach applied to the whole development process led to the setup of several experiment sessions with real users. Multilingual experiments carried out in Austria, France and Spain are described along with their analyses and results in terms of both system performance and user experience. An additional experimental comparison of the RavenClaw and Disco-LFF dialog managers built into the vAssist spoken dialog system highlighted similar performance and user acceptance.
We present an open-source web-based multimodal dialog framework, “Multimodal HALEF”, that integrates video conferencing and telephony abilities into the existing HALEF cloud-based dialog framework via the FreeSWITCH video telephony server. Due to its distributed and cloud-based architecture, Multimodal HALEF allows researchers to collect video and speech data from participants interacting with the dialog system outside of traditional lab settings, therefore largely reducing cost and labor incurred during the traditional audio-visual data collection process. The framework is equipped with a set of tools including a web-based user survey template, a speech transcription, an annotation and rating portal, a web visual processing server that performs head tracking, and a database that logs full-call audio and video recordings as well as other call-specific information. We present observations from an initial data collection based on an job interview application. Finally we report on some future plans for development of the framework.
This article describes the data collection system and methods adopted in the METALOGUE (Multiperspective Multimodal Dialogue System with Metacognitive Abilities) project. The ultimate goal of the METALOGUE project is to develop a multimodal dialogue system with abilities to deliver instructional advice by interacting with humans in a natural way. The data we are collecting will facilitate the development of a dialogue system which will exploit metacognitive reasoning in order to deliver feedback on the user’s performance in debates and negotiations. The initial data collection scenario consists of debates where two students are exchanging views and arguments on a social issue, such as a proposed ban on smoking in public areas, and delivering their presentations in front of an audience. Approximately 3 hours of data has been recorded to date, and all recorded streams have been precisely synchronized and pre-processed for statistical learning. The data consists of audio, video and 3-dimensional skeletal movement information of the participants. This data will be used in the development of cognitive dialogue and discourse models to underpin educational interventions in public speaking training.
People usually interact with intelligent agents (IAs) when they have certain goals to be accomplished. Sometimes these goals are complex and may require interacting with multiple applications, which may focus on different domains. Current IAs may be of limited use in such cases and the user needs to directly manage the task at hand. An ideal personal agent would be able to learn, over time, these tasks spanning different resources. In this article, we address the problem of cross-domain task assistance in the context of spoken dialog systems, and describe our approach about discovering such tasks and how IAs learn to talk to users about the task being carried out. Specifically we investigate how to learn user activity patterns in a smartphone environment that span multiple apps and how to incorporate users’ descriptions about their high-level intents into human-agent interaction.
This article describes a text-based , open-domain dialog system developed especially for social, smalltalk-like conversations. While much research is focused on goal-oriented dialog currently, in human-to-human communication many dialogs do not have a predefined goal. In order to achieve similar communication with a computer, we propose a framework which is easily extensible by combining different response patterns. The individual components are trained on web-crawled data. Using a data-driven approach, we are able to generate a large variety of answers to diverse user inputs.
Recent years have seen significant market penetration for voice-based personal assistants such as Apple’s Siri. However, despite this success, user take-up is frustratingly low. This article argues that there is a habitability gap caused by the inevitable mismatch between the capabilities and expectations of human users and the features and benefits provided by contemporary technology. Suggestions are made as to how such problems might be mitigated, but a more worrisome question emerges: “is spoken language all-or-nothing”? The answer, based on contemporary views on the special nature of (spoken) language, is that there may indeed be a fundamental limit to the interaction that can take place between mismatched interlocutors (such as humans and machines). However, it is concluded that interactions between native and non-native speakers, or between adults and children, or even between humans and dogs, might provide critical inspiration for the design of future speech-based human-machine interaction.
This article addresses the issue of evaluating Human-Robot spoken interactions in a social context by considering the engagement of the human participant. Our work regards the Communication Accommodation Theory (CAT) as a promising paradigm to consider for engagement, by its study of macro- and micro-contexts influencing the behaviour of dialogue participants (DPs), and by its effort in depicting the accommodation process underlying the behaviour of DPs. We draw links between the accommodation process described in this theory and human engagement that could be fruitfully used to evaluate Human-Robot social interactions. We present our goal which is to take into account a model of dialogue activities providing a convenient local interpretation context to assess human contributions (involving verbal and nonverbal channels) along with CAT to assess Human-Robot social interaction.
In the past 10 years, very few published studies include some kind of extrinsic evaluation of an NLG component in an end-to-end-system, be it for phone or mobile-based dialogues or social robotic interaction. This may be attributed to the fact that these types of evaluations are very costly to set-up and run for a single component. The question therefore arises whether there is anything to be gained over and above intrinsic quality measures obtained in off-line experiments? In this article, we describe a case study of evaluating two variants of an NLG surface realiser and show that there are significant differences in both extrinsic measures and intrinsic measures. These differences can be used to inform further iterations of component and system development.
It is becoming increasingly clear that social and interactive skills are necessary requirements in many application areas where robots interact, communicate and collaborate with humans or other connected devices. The social aspects of human-computer interaction and the connection between humans and robots has recently received considerable attention in the fields of artificial intelligence, cognitive science, healthcare, companion technologies, and industrial/commercial robotics. This article addresses some dimensions of near-term future HRI, with focus on engagement detection for conversational efficiency. We present some findings from HRI research conducted at the Speech Communication Lab at Trinity College Dublin, report our experiences with a publicly exhibited conversational robot, and discuss some future research trends.
Service robots such as vacuum-cleaning robots have already entered our homes . But in the future there will also be robots in public spaces. These robots may often reach their system limitations while performing their day-to-day work. To address this issue we suggest asking passersby for help. We enhanced an iRobot Roomba vacuum cleaning robot to set up a low-budget Wizard-of-Oz (WOZ) evaluation platform designed to investigate human-robot interaction (HRI). Furthermore, we suggest how HRI can be investigated in public spaces with a robot in need. An early evaluation shows that our prototype is a promising approach to explore how robots can cope with their limitations by asking somebody for help.
Service Robotics is finding solutions to enable effective interaction with users. Among the several issues, the need of adapting robots to the way humans usually communicate is becoming a key and challenging task. In this context the design of robots that understand and reply in Natural Language plays a central role, especially when interactions involve untrained users. In particular, this is even more stressed in the framework of Symbiotic Autonomy, where an interaction is always required for the robot to accomplish a given task. In this article, we propose a framework to model dialogues with robotic platforms, enabling effective and natural dialogic interactions. The framework relies on well-known theories as well as on perceptually informed spoken language understanding processors, giving rise to interactions that are tightly bound to the operating scenario.
We describe our work towards developing SamiTalk , a robot application for the North Sami language. With SamiTalk, users will hold spoken dialogues with a humanoid robot that speaks and recognizes North Sami. The robot will access information from the Sami Wikipedia, talk about requested topics using the Wikipedia texts, and make smooth topic shifts to related topics using the Wikipedia links. SamiTalk will be based on the existing WikiTalk system for Wikipedia-based spoken dialogues, with newly developed speech components for North Sami.
We propose a novel utterance selection method for chat-oriented dialogue systems . Many chat-oriented dialogue systems have huge databases of candidate utterances for utterance generation. However, many of these systems have a critical issue in that they select utterances that are inappropriate to the past conversation due to a limitation in contextual understanding. We solve this problem with our proposed method, which uses a discourse relation to the last utterance when selecting an utterance from candidate utterances. We aim to improve the performance of system utterance selection by preferentially selecting an utterance that has a discourse relation to the last utterance. Experimental results with human subjects showed that our proposed method was more effective than previous utterance selection methods.
Many different approaches for Interaction Quality (IQ) estimating of Spoken Dialogue Systems have been investigated. While dialogues clearly have a sequential nature, statistical classification approaches designed for sequential problems do not seem to work better on automatic IQ estimation than static approaches, i.e., regarding each turn as being independent of the corresponding dialogue. Hence, we analyse this effect by investigating the subset of temporal features used as input for statistical classification of IQ. We extend the set of temporal features to contain the system and the user view. We determine the contribution of each feature sub-group showing that temporal features contribute most to the classification performance. Furthermore, for the feature sub-group modeling the temporal effects with a window, we modify the window size increasing the overall performance significantly by \(+\)15.69 % achieving an Unweighted Average Recall of 0.562.
Getting a good estimation of the Interaction Quality (IQ) of a spoken dialogue helps to increase the user satisfaction as the dialogue strategy may be adapted accordingly. Therefore, some research has already been conducted in order to automatically estimate the Interaction Quality. This article adds to this by describing how Recurrent Neural Networks may be used to estimate the Interaction Quality for each dialogue turn and by evaluating their performance on this task. Here, we will show that RNNs may outperform non-recurrent neural networks.
Chat functionality is currently regarded as an important factor in spoken dialogue systems. In this article, we explore the architecture of a chat-oriented dialogue system that can continue a long conversation with users and can be used for a long time. To achieve this goal, we propose a method which combines various types of response generation modules, such as a statistical model-based module, a rule-based module and a topic transition-oriented module. The core of this architecture is a method for selecting the most appropriate response based on a breakdown index and a development index. The experimental results show that the weighted sum of these indexes can be used for evaluating system utterances from the viewpoint of continuous dialogue and long-term usage.
This article presents the design of a generic negotiation dialogue game between two or more players. The goal is to reach an agreement, each player having his own preferences over a shared set of options. Several simulated users have been implemented. An MDP policy has been optimised individually with Fitted Q-Iteration for several user instances. Then, the learnt policies have been cross evaluated with other users. Results show strong disparity of inter-user performances. This illustrates the importance of user adaptation in negotiation-based dialogue systems.
This research tried to estimate the user’s willingness to talk about the topic provided by the dialog system. Dialog management based on the user’s willingness is assumed to improve the satisfaction the user gets from the dialog with the system. We collected interview dialogs between humans to analyze the features for estimation, and found that significant differences of the statistics of \(F_0\) and power of the speech, and the degree of the facial movements by a statistical test. We conducted discrimination experiments by using multi-modal features with SVM, and obtained the best result when we used the audio-visual information. We obtained 80.4 \(\%\) of discrimination ratio under leave-one-out condition and 77.1 \(\%\) discrimination ratio under subject-open condition.
Automatic speech recognition (asr) is not only becoming increasingly accurate, but also increasingly adapted for producing timely, incremental output. However, overall accuracy and timeliness alone are insufficient when it comes to interactive dialogue systems which require stability in the output and responsivity to the utterance as it is unfolding. Furthermore, for a dialogue system to deal with phenomena such as disfluencies, to achieve deep understanding of user utterances these should be preserved or marked up for use by downstream components, such as language understanding, rather than be filtered out. Similarly, word timing can be informative for analyzing deictic expressions in a situated environment and should be available for analysis. Here we investigate the overall accuracy and incremental performance of three widely used systems and discuss their suitability for the aforementioned perspectives. From the differing performance along these measures we provide a picture of the requirements for incremental asr in dialogue systems and describe freely available tools for using and evaluating incremental asr.
Dialog state tracking is one of the key sub-tasks of dialog management , which defines the representation of dialog states and updates them at each moment on a given on-going conversation. To provide a common test bed for this task, three dialog state tracking challenges have been completed. In this fourth challenge, we focused on dialog state tracking on human-human dialogs. The challenge received a total of 24 entries from 7 research groups. Most of the submitted entries outperformed the baseline tracker based on string matching with ontology contents. Moreover, further significant improvements in tracking performances were achieved by combining the results from multiple trackers. In addition to the main task, we also conducted pilot track evaluations for other core components in developing modular dialog systems using the same dataset.
The main task of the fourth Dialog State Tracking Challenge (DSTC4) is to track the dialog state by filling in various slots, each of which represents a major subject discussed in the dialog. In this article we focus on the ‘INFO’ slot that tracks the general information provided in a sub-dialog segment, and propose an approach for this slot-filling using convolutional neural networks (CNNs). Our CNN model is adapted to multi-topic dialog by including a convolutional layer with general and topic-specific filters. The evaluation on DSTC4 common test data shows that our approach outperforms all other submitted entries in terms of overall accuracy of the ‘INFO’ slot.
This article presents our approach for the Dialog State Tracking Challenge 4, which focuses on a dialog state tracking task on human-human dialogs. The system works in an turn-taking manner. A probabilistic enhanced frame structure is maintained to represent the dialog state during the conversation. The utterance of each turn is processed by discriminative classification models to generate a similar semantic structure to the dialog state. Then a rule-based strategy is used to update the dialog state based on the understanding results of current utterance. We also introduce a slot-based score averaging method to build an ensemble of four trackers. The DSTC4 results indicate that despite the simple feature set, the proposed method is competitive and outperforms the baseline on all evaluation metrics.
The Dialog State Tracking Challenge 4 (DSTC 4) differentiates itself from the previous three editions as follows: the number of slot-value pairs present in the ontology is much larger, no spoken language understanding output is given, and utterances are labeled at the subdialog level. This article describes a novel dialog state tracking method designed to work robustly under these conditions, using elaborate string matching, coreference resolution tailored for dialogs and a few other improvements. The method can correctly identify many values that are not explicitly present in the utterance. On the final evaluation, our method came in first among 7 competing teams and 24 entries. The F1-score achieved by our method was 9 and 7 percentage points higher than that of the runner-up for the utterance-level evaluation and for the subdialog-level evaluation, respectively.
... Embodied conversational agents, intelligent virtual agents, and social robotics developed for various applications are examples of extensive research combining the virtual world, AI, and communication technologies to create intelligent social agents that could act as companions, counselors, or instructors in real-world applications. Overviews of the research topics, technologies, as well as opportunities and challenges for building applications, are provided by Jokinen and Wilcock (2017) [41] and Wan et al. (2020) [37]. ...
Article
Full-text available
One of the central social challenges of the 21st century is society’s aging. AI provides numerous possibilities for meeting this challenge. In this context, the concept of digital twins, based on Cyber-Physical Systems, offers an exciting prospect. The e-VITA project, in which a virtual coaching system for elderly people is being created, allows the same to be assessed as a model for development. This white paper collects and presents relevant findings from research areas around digital twin technologies. Furthermore, we address ethical issues. This paper shows that the concept of digital twins can be usefully applied to older adults. However, it also shows that the required technologies must be further developed and that ethical issues must be discussed in an appropriate framework. Finally, the paper explains how the e-VITA project could pave the way towards developing a Digital Twin for Ageing.
... Also, to rival existing state of the art dialogue systems for social robots [Jokinen and Wilcock, 2017] is not our goal. Hence we neither present a sophisticated model such as [Jokinen, 2018] nor compare ourselves to any such model. ...
... A significant body of work has been devoted to address the above challenges, focusing on the identification of the intended dialogue topic (domain selection problem) (Jokinen & Wilcock, 2017). To this end, several data-driven approaches have been developed to understand the discussion context and to trigger the necessary feedback to the users (Hiraoka, Neubig, Yoshino, Toda, & Nakamura, 2017;Lee, Jung, Kim, & Lee, 2009;Planells, Hurtado, Segarra, & Sanchis, 2013;Ryu et al., 2012;Wang et al., 2014). ...
Article
Full-text available
Dialogue-based systems often consist of several components, such as communication analysis, dialogue management, domain reasoning, and language generation. In this paper, we present Converness, an ontology-driven, rule-based framework to facilitate domain reasoning for conversational awareness in multimodal dialogue-based agents. Converness uses Web Ontology Language 2 (OWL 2) ontologies to capture and combine the conversational modalities of the domain, for example, deictic gestures and spoken utterances, fuelling conversational topic understanding, and interpretation using description logics and rules. At the same time, defeasible rules are used to couple domain and user-centred knowledge to further assist the interaction with end users, facilitating advanced conflict resolution and personalised context disambiguation. We illustrate the capabilities of the framework through its integration into a multimodal dialogue-based agent that serves as an intelligent interface between users (elderly, caregivers, and health experts) and an ambient assistive living platform in real home settings.
Chapter
The study of gesture-the movements people make with their hands when talking-has grown into a well-established field and research is still being pushed into exciting new directions. Bringing together a team of leading scholars, this Handbook provides a comprehensive overview of gesture studies, combining historical overviews as well as current, concise snapshots of state-of-the-art, multidisciplinary research. Organised into five thematic parts, it considers the roles of both psychological and interactional processes in gesture use, and considers the status of gesture in relation to language. Attention is given to different theoretical and methodological frameworks for studying gesture, including semiotic, linguistic, cognitive, developmental, and phenomenological theories and observational, experimental, corpus linguistic, ethnographic, and computational methods. It also contains practical guidelines for gesture analysis along with surveys of empirical research. Wide ranging yet accessible, it is essential reading for academic researchers and students in linguistics and cognitive sciences.
Article
Full-text available
This article explores the integration of digital games, specifically Minecraft, within Sámi educational contexts. The qualitative case study was based on a development project in Sámi teacher education, exploring key aspects highlighted by pre-service teachers when using Minecraft during their practice periods with primary school children. Given the significant role teachers play in instructional organisation, this article aims to identify specific areas where pre-service teachers may benefit from additional support and training to enhance their preparedness for the classroom. Incorporating Sámi educational frameworks and digital competencies into Sámi teacher education, we utilised the digital competence of future teachers (DCFT) model to guide data collection and analysis. This involved distributing anonymous online questionnaires to pre-service teachers (n = 17). Our findings indicate the transformative potential of digital games in Sámi education, particularly in the use of Sámi as a gaming language and Sámi cultural game content. The article emphasises the relevance of digital technologies in preserving and revitalising Indigenous languages and cultures to better understand how to leverage these tools effectively in culturally relevant ways. By utilising contemporary digital tools within an Indigenous education, educators can enhance cultural continuity and empower Indigenous communities in the digital age.
Chapter
Während im öffentlichen Diskurs Künstliche Intelligenz und Robotik häufig ganz selbstverständlich zusammengedacht werden, haben sich historisch zwei separate Disziplinen entwickelt: die Künstliche Intelligenz beschäftigt sich mit der Formalisierung und Algorithmisierung von Schlussfolgern und Problemlösen, und die Robotik befasst sich mit der maschinellen Wahrnehmung der Umwelt sowie der autonomen Ausführung von Handlungen. In diesem Kapitel wird die Soziale Robotik als ein Bindeglied zwischen den beiden Disziplinen aufgefasst. Soziale Roboter sind Roboter, die sich nach den Werten und Normen des sozialen Miteinanders richten. Damit dies möglich wird, müssen Verfahren aus der Robotik um eine soziale Dimension erweitert werden. Eine Möglichkeit besteht darin, Techniken aus der Künstlichen Intelligenz zu verwenden, um soziale Werte und Normen explizit zu modellieren. Der Beitrag thematisiert das sogenannte Value-Alignment-Problem in der Künstlichen Intelligenz im Allgemeinen und demonstriert anhand von sozialer Roboter-Navigation im Speziellen, wie Techniken aus der Künstlichen Intelligenz für die Soziale Robotik genutzt werden können, damit Roboter ihre Handlungen nach menschlichen Werten und Normen ausrichten.
Article
Full-text available
The first part of a two-part series of papers provides a survey on recent advances in Deep Reinforcement Learning (DRL) applications for solving partially observable Markov decision processes (POMDP) problems. Reinforcement Learning (RL) is an approach to simulate the human’s natural learning process, whose key is to let the agent learn by interacting with the stochastic environment. The fact that the agent has limited access to the information of the environment enables AI to be applied efficiently in most fields that require self-learning. Although efficient algorithms are being widely used, it seems essential to have an organized investigation—we can make good comparisons and choose the best structures or algorithms when applying DRL in various applications. In this overview, we introduce Markov Decision Processes (MDP) problems and Reinforcement Learning and applications of DRL for solving POMDP problems in games, robotics, and natural language processing. A follow-up paper will cover applications in transportation, communications and networking, and industries.
Article
Full-text available
Many speech emotion recognition systems have been designed using different features and classification methods. Still, there is a lack of knowledge and reasoning regarding the underlying speech characteristics and processing, i.e., how basic characteristics, methods, and settings affect the accuracy, to what extent, etc. This study is to extend physical perspective on speech emotion recognition by analyzing basic speech characteristics and modeling methods, e.g., time characteristics (segmentation, window types, and classification regions—lengths and overlaps), frequency ranges, frequency scales, processing of whole speech (spectrograms), vocal tract (filter banks, linear prediction coefficient (LPC) modeling), and excitation (inverse LPC filtering) signals, magnitude and phase manipulations, cepstral features, etc. In the evaluation phase the state-of-the-art classification method and rigorous statistical tests were applied, namely N-fold cross validation, paired t-test, rank, and Pearson correlations. The results revealed several settings in a 75% accuracy range (seven emotions). The most successful methods were based on vocal tract features using psychoacoustic filter banks covering the 0–8 kHz frequency range. Well scoring are also spectrograms carrying vocal tract and excitation information. It was found that even basic processing like pre-emphasis, segmentation, magnitude modifications, etc., can dramatically affect the results. Most findings are robust by exhibiting strong correlations across tested databases.
Article
Full-text available
Multi-domain spoken dialogue is a challenging field where the objective of the most proposed ideas is to mimic the human–human dialogue. This paper proposes to tackle the domain selection problem in the context of multi-domain spoken dialogue as a set theory problem to resolve. First, we built each dialogue domain as an ontology following an architecture with some rules to respect. Second, each ontology is considered as a set and its concepts are the elements. Third, an ontology-based classifier is used to map the user sentence into a set of ontologies concepts and to generate an intersection between these concepts. Finally, a new turn analysis and domain selection algorithm is proposed to infer the intended domain from the user sentence using the intersection set and three techniques, namely Domain Rewards, Dominant Concept, and Current Domain. To evaluate the proposed approach, a corpus of 120 simulated dialogues was built to cover four application domains. In our experiment, the assessment of the system is performed by considering all possibilities of a natural verbal interaction where a changing of semantic context occurs during the dialogue. The obtained results show that the system accuracy reaches a satisfactory performance of 83.13% while the average number of turns by dialogue is 6.79.
ResearchGate has not been able to resolve any references for this publication.