Article

Forming Real-World Human-Robot Cooperation for Tasks With General Goal

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In human-robot cooperation, the robot cooperates with humans to accomplish the task together. Existing approaches assume the human has a specific goal during the cooperation, and the robot infers and acts toward it. However, in real-world environments, a human usually only has a general goal (e.g., general direction or area in motion planning) at the beginning of the cooperation, which needs to be clarified to a specific goal (i.e., an exact position) during cooperation. The specification process is interactive and dynamic, which depends on the environment and the partner’s behavior. The robot that does not consider the goal specification process may cause frustration to the human partner, elongate the time to come to an agreement, and compromise team performance. This work presents the Evolutionary Value Learning approach to model the dynamics of the goal specification process with State-based Multivariate Bayesian Inference and goal specificity-related features. This model enables the robot to enhance the process of the human’s goal specification actively and find a cooperative policy in a Deep Reinforcement Learning manner. Our method outperforms existing methods with faster goal specification processes and better team performance in a dynamic ball balancing task with real human subjects.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... For evaluating robot performance, previous quantitative experiments remain effective, such as robot execution time [257], trajectory length [258], and movement precision [259]. Nevertheless, for the human-centric needs and the whole HRC system evaluation, there are different considerations. ...
Article
Full-text available
Human-Robot Collaboration (HRC) has a pivotal role in smart manufacturing for strict requirements of human-centricity, sustainability, and resilience. However, existing HRC development mainly undertakes either a human-dominant or robot-dominant manner, where human and robotic agents reactively perform operations by following pre-defined instructions, thus far from an efficient integration of robotic automation and human cognition. The stiff human-robot relations fail to be qualified for complex manufacturing tasks and cannot ease the physical and psychological load of human operators. In response to these realistic needs, this paper presents our arguments on the obvious trend, concept, systematic architecture, and enabling technologies of Proactive HRC, serving as a prospective vision and research topic for future work in the human-centric smart manufacturing era. Human-robot symbiotic relation is evolving with a 5C intelligence-from Connection, Coordination, Cyber, Cognition to Coevolution, and finally embracing mutual-cognitive, predictable, and self-organising intelligent capabilities, i.e., the Proactive HRC. With proactive robot control, multiple human and robotic agents collaboratively operate manufacturing tasks, considering each others' operation needs, desired resources, and qualified complementary capabilities. This paper also highlights current challenges and future research directions, which deserve more research efforts for real-world applications of Proactive HRC. It is hoped that this work can attract more open discussions and provide useful insights to both academic and industrial practitioners in their exploration of human-robot flexible production.
Article
Augmented intelligence (AuI) is a concept that combines human intelligence (HI) and artificial intelligence (AI) to leverage their respective strengths. While AI typically aims to replace humans, AuI integrates humans into machines, recognizing their irreplaceable role. Meanwhile, human-in-the-loop reinforcement learning (HITL-RL) is a semisupervised algorithm that integrates humans into the traditional reinforcement learning (RL) algorithm, enabling autonomous agents to gather inputs from both humans and environments, learn, and select optimal actions across various environments. Both AuI and HITL-RL are still in their infancy. Based on AuI, we propose and investigate three separate concept designs for HITL-RL: HI-AI , AI-HI , and parallel-HI-and-AI approaches, each differing in the order of HI and AI involvement in decision making. The literature on AuI and HITL-RL offers insights into integrating HI into existing concept designs. A preliminary study in an Atari game offers insights for future research directions. Simulation results show that human involvement maintains RL convergence and improves system stability, while achieving approximately similar average scores to traditional Q -learning in the game. Future research directions are proposed to encourage further investigation in this area.
Article
Full-text available
In real-world applications, inferring the intentions of expert agents (e.g., human operators) can be fundamental to understand how possibly conflicting objectives are managed, helping to interpret the demonstrated behavior. In this paper, we discuss how inverse reinforcement learning (IRL) can be employed to retrieve the reward function implicitly optimized by expert agents acting in real applications. Scaling IRL to real-world cases has proved challenging as typically only a fixed dataset of demonstrations is available and further interactions with the environment are not allowed. For this reason, we resort to a class of truly batch model-free IRL algorithms and we present three application scenarios: (1) the high-level decision-making problem in the highway driving scenario, and (2) inferring the user preferences in a social network (Twitter), and (3) the management of the water release in the Como Lake. For each of these scenarios, we provide formalization, experiments and a discussion to interpret the obtained results.
Article
Full-text available
To improve the efficiency of deep reinforcement learning (DRL) based methods for robotic trajectory planning in unstructured working environment with obstacles. Different from the traditional sparse reward function, this paper presents two brand-new dense reward functions. First, azimuth reward function is proposed to accelerate the learning process locally with a more reasonable trajectory by modeling the position and orientation constraints, which can reduce the blindness of exploration dramatically. To further improve the efficiency, a reward function at subtask-level is proposed to provide a global guidance for the agent in DRL. The subtask-level reward function is designed under the assumption that the task can be divided into several subtasks, which reduces the invalid exploration greatly. Extensive experiments show that the proposed reward functions are able to improve the convergence rate by up to three times with the state-of-the-art DRL methods. The percentage increase in convergence mean is 2.25%-13.22% and the percentage decreases with respect to standard deviation by 10.8%-74.5%.
Article
Full-text available
Inverse reinforcement learning (IRL) infers a reward function from demonstrations, allowing for policy improvement and generalization. However, despite much recent interest in IRL, little work has been done to understand the minimum set of demonstrations needed to teach a specific sequential decision-making task. We formalize the problem of finding maximally informative demonstrations for IRL as a machine teaching problem where the goal is to find the minimum number of demonstrations needed to specify the reward equivalence class of the demonstrator. We extend previous work on algorithmic teaching for sequential decision-making tasks by showing a reduction to the set cover problem which enables an efficient approximation algorithm for determining the set of maximally-informative demonstrations. We apply our proposed machine teaching algorithm to two novel applications: providing a lower bound on the number of queries needed to learn a policy using active IRL and developing a novel IRL algorithm that can learn more efficiently from informative demonstrations than a standard IRL approach.
Article
Full-text available
Machine learning (ML) is the fastest growing field in computer science, and health informatics is among the greatest challenges. The goal of ML is to develop algorithms which can learn and improve over time and can be used for predictions. Most ML researchers concentrate on automatic machine learning (aML), where great advances have been made, for example, in speech recognition, recommender systems, or autonomous vehicles. Automatic approaches greatly benefit from big data with many training sets. However, in the health domain, sometimes we are confronted with a small number of data sets or rare events, where aML-approaches suffer of insufficient training samples. Here interactive machine learning (iML) may be of help, having its roots in reinforcement learning, preference learning, and active learning. The term iML is not yet well used, so we define it as “algorithms that can interact with agents and can optimize their learning behavior through these interactions, where the agents can also be human.” This “human-in-the-loop” can be beneficial in solving computationally hard problems, e.g., subspace clustering, protein folding, or k-anonymization of health data, where human expertise can help to reduce an exponential search space through heuristic selection of samples. Therefore, what would otherwise be an NP-hard problem, reduces greatly in complexity through the input and the assistance of a human agent involved in the learning phase.
Article
Full-text available
Background: Scene supervision is a major tool to make medical robots safer and more intuitive. The paper shows an approach to efficiently use 3D cameras within the surgical operating room to enable for safe human robot interaction and action perception. Additionally the presented approach aims to make 3D camera-based scene supervision more reliable and accurate. Methods: A camera system composed of multiple Kinect and time-of-flight cameras has been designed, implemented and calibrated. Calibration and object detection as well as people tracking methods have been designed and evaluated. Results: The camera system shows a good registration accuracy of 0.05 m. The tracking of humans is reliable and accurate and has been evaluated in an experimental setup using operating clothing. The robot detection shows an error of around 0.04 m. Conclusions: The robustness and accuracy of the approach allow for an integration into modern operating room. The data output can be used directly for situation and workflow detection as well as collision avoidance.
Conference Paper
Full-text available
In this paper we propose a human-in-the-loop approach for teaching robots how to solve part assembly tasks. In the proposed setup the human tutor controls the robot through a haptic interface and a hand-held impedance control interface. The impedance control interface is based on a linear spring-return potentiometer that maps the button position to the robot arm stiffness. This setup allows the tutor to modulate the robot compliance based on the given task requirements. The demonstrated motion and stiffness trajectories are encoded using Dynamical Movement Primitives and learnt using Locally Weight Regression. To validate the proposed approach we performed experiments using Kuka Light Weight Robot and HapticMaster robot. The task of the experiment was to teach the robot how to perform an assembly task involving sliding a bolt fitting inside a groove in order to mount two parts together. Different stiffness was required in different stages of the task execution to accommodate the interaction of the robot with the environment and possible human-robot cooperation.
Article
Full-text available
As robots venture into new application domains as autonomous vehicles on the road or as domestic helpers at home, they must recognize human intentions and behaviors in order to operate effectively. This paper investigates a new class of motion planning problems with uncertainty in human intention. We propose a method for constructing a practical model by assuming a finite set of unknown in-tentions. We first construct a motion model for each intention in the set and then combine these models together into a single Mixed Observability Markov Decision Process (MOMDP), which is a structured variant of the more common Partially Ob-servable Markov Decision Process (POMDP). By leveraging the latest advances in POMDP/MOMDP approximation algorithms, we can construct and solve moder-ately complex models for interesting robotic tasks. Experiments in simulation and with an autonomous vehicle show that the proposed method outperforms common alternatives because of its ability in recognizing intentions and using the information effectively for decision making.
Article
Full-text available
Autonomous robots are capable of navigating on their own. Shared control approaches, however, allow humans to make some navigation decisions. This is typically executed either by overriding the human or the robot control at some specific situations. In this paper, we propose a method to allow cooperation between humans and robots at each point of any given trajectory so that both have some weight in the emergent behavior of the mobile robot. This is achieved by evaluating their efficiencies at each time instant and combining their commands into a single order. In order to achieve a seamless combination, this procedure is integrated into a bottom-up architecture via a reactive layer. We have tested the proposed method using a real robot and several volunteers, and results have been satisfactory both from a quantitative and qualitative point of view.
Conference Paper
Full-text available
As computational learning agents move into domains that incur real costs (e.g., autonomous driving or financial investment), it will be necessary to learn good policies without numerous high-cost learning trials. One promising approach to reducing sample complexity of learning a task is knowledge transfer from humans to agents. Ideally, methods of transfer should be accessible to anyone with task knowledge, regardless of that person's expertise in programming and AI. This paper focuses on allowing a human trainer to interactively shape an agent's policy via reinforcement signals. Specifically, the paper introduces "Training an Agent Manually via Evaluative Reinforcement," or TAMER, a framework that enables such shaping. Differing from previous approaches to interactive shaping, a TAMER agent models the human's reinforcement and exploits its model by choosing actions expected to be most highly reinforced. Results from two domains demonstrate that lay users can train TAMER agents without defining an environmental reward function (as in an MDP) and indicate that human training within the TAMER framework can reduce sample complexity over autonomous learning algorithms.
Article
Full-text available
As robots are gradually leaving highly structured factory environments and moving into human populated environments, they need to possess more complex cognitive abilities. They do not only have to operate efficiently and safely in natural, populated environments, but also be able to achieve higher levels of cooperation and communication with humans. Human–robot collaboration (HRC) is a research field with a wide range of applications, future scenarios, and potentially a high economic impact. HRC is an interdisciplinary research area comprising classical robotics, cognitive sciences, and psychology. This paper gives a survey of the state of the art of HRC. Established methods for intention estimation, action planning, joint action, and machine learning are presented together with existing guidelines to hardware design. This paper is meant to provide the reader with a good overview of technologies and methods for HRC.
Article
We present a data-driven shared control algorithm that can be used to improve a human operator’s control of complex dynamic machines and achieve tasks that would otherwise be challenging, or impossible, for the user on their own. Our method assumes no a priori knowledge of the system dynamics. Instead, both the dynamics and information about the user’s interaction are learned from observation through the use of a Koopman operator. Using the learned model, we define an optimization problem to compute the autonomous partner’s control policy. Finally, we dynamically allocate control authority to each partner based on a comparison of the user input and the autonomously generated control. We refer to this idea as model-based shared control (MbSC). We evaluate the efficacy of our approach with two human subjects studies consisting of 32 total participants (16 subjects in each study). The first study imposes a linear constraint on the modeling and autonomous policy generation algorithms. The second study explores the more general, nonlinear variant. Overall, we find that MbSC significantly improves task and control metrics when compared with a natural learning, or user only, control paradigm. Our experiments suggest that models learned via the Koopman operator generalize across users, indicating that it is not necessary to collect data from each individual user before providing assistance with MbSC. We also demonstrate the data efficiency of MbSC and, consequently, its usefulness in online learning paradigms. Finally, we find that the nonlinear variant has a greater impact on a user’s ability to successfully achieve a defined task than the linear variant.
Chapter
As intelligent systems gain autonomy and capability, it becomes vital to ensure that their objectives match those of their human users; this is known as the value-alignment problem. In robotics, value alignment is key to the design of collaborative robots that can integrate into human workflows, successfully inferring and adapting to their users’ objectives as they go. We argue that a meaningful solution to value alignment must combine multi-agent decision theory with rich mathematical models of human cognition, enabling robots to tap into people’s natural collaborative capabilities. We present a solution to the cooperative inverse reinforcement learning (CIRL) dynamic game based on well-established cognitive models of decision making and theory of mind. The solution captures a key reciprocity relation: the human will not plan her actions in isolation, but rather reason pedagogically about how the robot might learn from them; the robot, in turn, can anticipate this and interpret the human’s actions pragmatically. To our knowledge, this work constitutes the first formal analysis of value alignment grounded in empirically validated cognitive models.
Article
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.
Article
Objective: The goal of this paper is to achieve a novel 3-D-gaze-based human-robot-interaction modality, with which a user with motion impairment can intuitively express what tasks he/she wants the robot to do by directly looking at the object of interest in the real world. Toward this goal, we investigate 1) the technology to accurately sense where a person is looking in real environments and 2) the method to interpret the human gaze and convert it into an effective interaction modality. Looking at a specific object reflects what a person is thinking related to that object, and the gaze location contains essential information for object manipulation. Methods: A novel gaze vector method is developed to accurately estimate the 3-D coordinates of the object being looked at in real environments, and a novel interpretation framework that mimics human visuomotor functions is designed to increase the control capability of gaze in object grasping tasks. Results: High tracking accuracy was achieved using the gaze vector method. Participants successfully controlled a robotic arm for object grasping by directly looking at the target object. Conclusion: Human 3-D gaze can be effectively employed as an intuitive interaction modality for robotic object manipulation. Significance: It is the first time that 3-D gaze is utilized in a real environment to command a robot for a practical application. Three-dimensional gaze tracking is promising as an intuitive alternative for human-robot interaction especially for disabled and elderly people who cannot handle the conventional interaction modalities.
Conference Paper
Shared autonomy integrates user input with robot autonomy in order to control a robot and help the user to complete a task. Our work aims to improve the performance of such a human-robot team: the robot tries to guide the human towards an effective strategy, sometimes against the human's own preference, while still retaining his trust. We achieve this through a principled human-robot mutual adaptation formalism. We integrate a bounded-memory adaptation model of the human into a partially observable stochastic decision model, which enables the robot to adapt to an adaptable human. When the human is adaptable, the robot guides the human towards a good strategy, maybe unknown to the human in advance. When the human is stubborn and not adaptable, the robot complies with the human's preference in order to retain their trust. In the shared autonomy setting, unlike many other common human-robot collaboration settings, only the robot actions can change the physical state of the world, and the human and robot goals are not fully observable. We address these challenges and show in a human subject experiment that the proposed mutual adaptation formalism improves human-robot team performance, while retaining a high level of user trust in the robot, compared to the common approach of having the robot strictly following participants' preference.
Conference Paper
An algorithm called gaze-based multiple model intention estimator (G-MMIE) is presented for early prediction of the goal location (intention) of human reaching actions. To capture the complexity of human arm reaching motion, a neural network (NN) is used to represent the arm motion dynamics. The trajectories of the arm motion for reaching tasks are modeled by using a dynamical system with contracting behavior towards the goal location. The contraction behavior ensures that the model trajectories will converge to the goal location. The NN training is subjected to contraction analysis constraints. In order to use the motion model learned from a few demonstrations with new scenarios and multiple objects, an interacting multiple model (IMM) framework is used. The multiple models are obtained by translating the origin of the contracting system to different known object locations. Each model corresponds to the reaching motion that ends at a certain object location. Since humans tend to look in the direction of the object they are reaching for, the prior probabilities of the models are calculated based on the human eye gaze. The posterior probabilities of the models are calculated through interacting model matched filtering carried out using extended Kalman filters (EKFs). The model or the object location with the highest posterior probability is chosen to be the estimate of the goal location. Experimental results suggest that the G-MMIE algorithm is able to adapt to arbitrary sequences of reaching motions and the gaze-based prior outperforms the uniform prior in terms of intention inference accuracy and average time of inference.
Article
For an autonomous system to be helpful to humans and to pose no unwarranted risks, it needs to align its values with those of the humans in its environment in such a way that its actions contribute to the maximization of value for the humans. We propose a formal definition of the value alignment problem as {\em cooperative inverse reinforcement learning} (CIRL). A CIRL problem is a cooperative, partial-information game with two agents, human and robot; both are rewarded according to the human's reward function, but the robot does not initially know what this is. In contrast to classical IRL, where the human is assumed to act optimally in isolation, optimal CIRL solutions produce behaviors such as active teaching, active learning, and communicative actions that are more effective in achieving value alignment. We show that computing optimal joint policies in CIRL games can be reduced to solving a POMDP, prove that optimality in isolation is suboptimal in CIRL, and derive an approximate CIRL algorithm.
Conference Paper
From exploring planets to cleaning homes, the reach and versatility of robotics is vast. The integration of actuation, sensing and control makes robotics systems powerful, but complicates their simulation. This paper introduces a versatile, scalable, yet powerful general-purpose robot simulation framework called V-REP. The paper discusses the utility of a portable and flexible simulation framework that allows for direct incorporation of various control techniques. This renders simulations and simulation models more accessible to a general-public, by reducing the simulation model deployment complexity. It also increases productivity by offering built-in and ready-to-use functionalities, as well as a multitude of programming approaches. This allows for a multitude of applications including rapid algorithm development, system verification, rapid prototyping, and deployment for cases such as safety/remote monitoring, training and education, hardware control, and factory automation simulation.
Article
Performance degradation assessment has been proposed to realize equipment's near-zero downtime and maximum productivity. Exploring effective indices is crucial for it. In this study, rolling element bearing has been taken as a research object, spectral entropy is proposed to be as a complementary index for its performance degradation assessment, and its accelerated life test has been performed to collect vibration data over a whole lifetime (normal-fault-failure). Results of both simulation and experiment show that spectral entropy is an effective complementary index.
Article
Bayesian probability theory provides a unifying framework for data modeling. In this framework, the overall aims are to find models that are well matched to the data, and to use these models to make optimal predictions. Neural network learning is interpreted as an inference of the most probable parameters for the model, given the training data. The search in model space (i.e., the space of architectures, noise models, preprocessings, regularizers, and weight decay constants) also then can be treated as an inference problem, in which we infer the relative probability of alternative models, given the data. This provides powerful and practical methods for controlling, comparing, and using adaptive network models. This chapter describes numerical techniques based on Gaussian approximations for implementation of these methods.
Article
Flexibility and changeability of assembly processes require a close cooperation between the worker and the automated assembly system. The interaction between human and robots improves the efficiency of individual complex assembly processes, particularly when a robot serves as an intelligent assistant. The paper gives a survey about forms of human–machine cooperation in assembly and available technologies that support the cooperation. Organizational and economic aspects of cooperative assembly including efficient component supply and logistics are also discussed.
Conference Paper
Recent research has shown the benefit of framing problems of imitation learning as solutions to Markov Decision Prob- lems. This approach reduces learning to the problem of re- covering a utility function that makes the behavior induced by a near-optimal policy closely mimic demonstrated behav- ior. In this work, we develop a probabilistic approach based on the principle of maximum entropy. Our approach provides a well-defined, globally normalized distribution over decision sequences, while providing the same performance guarantees as existing methods. We develop our technique in the context of modeling real- world navigation and driving behaviors where collected data is inherently noisy and imperfect. Our probabilistic approach enables modeling of route preferences as well as a powerful new approach to inferring destinations and routes based on partial trajectories. employ the principle of maximum entropy to resolve the am- biguity in choosing a distribution over decisions. We pro- vide efficient algorithms for learning and inference for de- terministic MDPs. We rely on an additional simplifying as- sumption to make reasoning about non-deterministic MDPs tractable. The resulting distribution is a probabilistic model that normalizes globally over behaviors and can be under- stood as an extension to chain conditional random fields that incorporates the dynamics of the planning system and ex- tends to the infinite horizon. Our research effort is motivated by the problem of mod- eling real-world routing preferences of drivers. We apply our approach to route preference modeling using 100,000 miles of collected GPS data of taxi-cab driving, where the structure of the world (i.e., the road network) is known and the actions available (i.e., traversing a road segment) are characterized by road features (e.g., speed limit, number of lanes). In sharp contrast to many imitation learning tech- niques, our probabilistic model of purposeful behavior in- tegrates seamlessly with other probabilistic methods includ- ing hidden variable techniques. This allows us to extend our route preferences with hidden goals to naturally infer both future routes and destinations based on partial trajectories. A key concern is that demonstrated behavior is prone to noise and imperfect behavior. The maximum entropy ap- proach provides a principled method of dealing with this uncertainty. We discuss several additional advantages in modeling behavior that this technique has over existing ap- proaches to inverse reinforcement learning including margin methods (Ratliff, Bagnell, & Zinkevich 2006) and those that normalize locally over each state's available actions (Ra- machandran & Amir 2007; Neu & Szepesvri 2007).
Conference Paper
In this paper, we propose a probabilistic kernel approach to preference learning based on Gaussian processes. A new likelihood function is proposed to capture the preference relations in the Bayesian framework. The generalized formulation is also applicable to tackle many multiclass problems. The overall approach has the advantages of Bayesian methods for model selection and probabilistic prediction. Experimental results compared against the constraint classification approach on several benchmark datasets verify the usefulness of this algorithm.
Article
A collection of papers on the application of Markov decision processes is surveyed and classified according to the use of real life data, structural results and special computational schemes. Observations are made about various features of the applications.
Article
This paper addresses the problem of inverse reinforcement learning (IRL) in Markov decision processes, that is, the problem of extracting a reward function given observed, optimal behavior. IRL may be useful for apprenticeship learning to acquire skilled behavior, and for ascertaining the reward function being optimized by a natural system. We rst characterize the set of all reward functions for which a given policy is optimal. We then derive three algorithms for IRL. The rst two deal with the case where the entire policy is known; we handle tabulated reward functions on a nite state space and linear functional approximation of the reward function over a potentially innite state space. The third algorithm deals with the more realistic case in which the policy is known only through a nite set of observed trajectories. In all cases, a key issue is degeneracy|the existence of a large set of reward functions for which the observed policy is optimal. To remove deg...
The paired t test under artificial pairing
  • H A David
  • J L Gunnink
Apprenticeship learning via inverse reinforcement learning
  • P Abbeel
  • A Y Ng