Conference Paper

Real-World Human-Robot Collaborative Reinforcement Learning*

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... A crucial robot capability during collaboration is to support mutual learning and adaptation [2]. Deep Reinforcement Learning (RL) methods have recently presented very promising results in real-world learning problems [7], including in Human-Robot Collaboration scenarios [10]. Such frameworks provide the opportunity to study real-time how mutual learning and adaptation between humans and (embodied) Artificial Intelligence (AI) agents develop and which AI or human behaviour aspects can be manipulated to accelerate learning and adaptation. ...
... The purpose of this work is to investigate ways to accelerate collaborative learning between a human and an agent and thus to minimize the time spent by a human collaborator during training. First, we examine two variations of the Soft Actor-Critic [6] training algorithm; one that involves only off-line gradient updates (g/u) in fixed intervals [7] and one that also involves a single g/u after each state transition [10]. Subsequently, we explore the impact of the number of off-line g/u throughout a training. ...
... Subsequently, we explore the impact of the number of off-line g/u throughout a training. Finally, we provide a graphical RL framework for testing human-agent collaborative settings, similar to the game defined in [10]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. ...
... Physical interactions can also enable the human to provide online corrective actions during a RL process [94]. It is also possible to directly incorporate the coadaptation that takes place when a human and robot are learning to perform a new task collaboratively [95]. Here, a human and a robot collaborate to complete a ball maze task, where the teleoperated commands sent to the robot by the human are included in the RL state observations. ...
... When learning a control policy that involves interacting with a human partner that is also learning a policy, neglecting to consider the human adaptation could result in a covariate shift in the task distribution that makes the learned model ineffective at providing an accurate estimation of the human's intent [70,95], unless this co-adaptation is explicitly accounted for. There is much scope for investigating the dynamic interaction that takes place between partners learning a task, with previous works modelling this problem between a human and a robot as a noisy communication process [96][97][98]. ...
Article
Interaction control can take opportunities offered by contact robots physically interacting with their human user, such as assistance targeted to each human user, communication of goals to enable effective teamwork, and task-directed motion resistance in physical training and rehabilitation contexts. Here we review the burgeoning field of interaction control in the control theory and machine learning communities, by analysing the exchange of haptic information between the robot and its human user, and how they share the task effort. We first review the estimation and learning methods to predict the human user intent with the large uncertainty, variability and noise and limited observation of human motion. Based on this motion intent core, typical interaction control strategies are described using a homotopy of shared control parameters. Recent methods of haptic communication and game theory are then presented to consider the co-adaptation of human and robot control and yield versatile interactive control as observed between humans. Finally, the limitations of the presented state of the art are discussed and directions for future research are outlined.
... We previously demonstrated the applicability of human in-the-loop reinforcement learning to learn personalised collaborative policies through direct, real-world and real-time interaction with a human partner, in a synchronous collaborative task [10]. To extend our approach to physical interactions, we want to better understand the role of haptics in human-human and human-robot collaboration, and expose our reinforcement learning agent to the physical channel of communication. ...
... Not only should haptic information allow robots to infer more precise inference of human intention, but it also allows them to generate appropriate interaction forces and torques to signal their strategy to the human. This may make up for the gap of mutual observability and lack of theory of mind between collaborating humans and robots previously identified as a challenge in real-time human-robot collaboration [10]. ...
Preprint
Full-text available
Intuitive and efficient physical human-robot collaboration relies on the mutual observability of the human and the robot, i.e. the two entities being able to interpret each other's intentions and actions. This is remedied by a myriad of methods involving human sensing or intention decoding, as well as human-robot turn-taking and sequential task planning. However, the physical interaction establishes a rich channel of communication through forces, torques and haptics in general, which is often overlooked in industrial implementations of human-robot interaction. In this work, we investigate the role of haptics in human collaborative physical tasks, to identify how to integrate physical communication in human-robot teams. We present a task to balance a ball at a target position on a board either bimanually by one participant, or dyadically by two participants, with and without haptic information. The task requires that the two sides coordinate with each other, in real-time, to balance the ball at the target. We found that with training the completion time and number of velocity peaks of the ball decreased, and that participants gradually became consistent in their braking strategy. Moreover we found that the presence of haptic information improved the performance (decreased completion time) and led to an increase in overall cooperative movements. Overall, our results show that humans can better coordinate with one another when haptic feedback is available. These results also highlight the likely importance of haptic communication in human-robot physical interaction, both as a tool to infer human intentions and to make the robot behaviour interpretable to humans.
... Agents trained using deep reinforcement learning (DRL; i.e., the integration of reinforcement learning with deep neural networks), for example, have been successful in discovering adaptive behavior and strategies in individual [59] and group task contexts [60,61]. Within the context of working with humans in collaborative tasks, such agents can develop control policies that are either user-specific [62] or generalize to a distribution of human strategies during training [63]. By giving meaning to actions with the use of reward functions [64], black-box self-supervised approaches have the ability to provide a "direct fit" [65] between an agent and task-relevant states-assuming there is sufficient sampling of the task environment. ...
Article
Full-text available
Social animals have the remarkable ability to organize into collectives to achieve goals unobtainable to individual members. Equally striking is the observation that despite differences in perceptual-motor capabilities, different animals often exhibit qualitatively similar collective states of organization and coordination. Such qualitative similarities can be seen in corralling behaviors involving the encirclement of prey that are observed, for example, during collaborative hunting amongst several apex predator species living in disparate environments. Similar encirclement behaviors are also displayed by human participants in a collaborative problem-solving task involving the herding and containment of evasive artificial agents. Inspired by the functional similarities in this behavior across humans and non-human systems, this paper investigated whether the containment strategies displayed by humans emerge as a function of the task's underlying dynamics, which shape patterns of goal-directed corralling more generally. This hypothesis was tested by comparing the strategies naïve human dyads adopt during the containment of a set of evasive artificial agents across two disparate task contexts. Despite the different movement types (manual manipulation or locomotion) required in the different task contexts, the behaviors that humans display can be predicted as emergent properties of the same underlying task-dynamic model.
... The studies that use "co-learning" tend to take a more symmetrical approach by looking at agent or robot learning as well as human learning, and pay more attention to the learning process and changing strategies of the human as well, often looking at many repetitions of a task (Ramakrishnan, Zhang, and Shah 2017; C.-S. Lee et al., 2020;C. Lee et al., 2018;Shafti et al., 2020). Studies on co-evolution, on the other hand, monitor a long-term real-world application in which behavior of the human as well as the robot subtly changes over time (Döppner, Derckx, and Schoder 2019). ...
Article
Full-text available
Becoming a well-functioning team requires continuous collaborative learning by all team members. This is called co-learning, conceptualized in this paper as comprising two alternating iterative stages: partners adapting their behavior to the task and to each other (co-adaptation), and partners sustaining successful behavior through communication. This paper focuses on the first stage in human-robot teams, aiming at a method for the identification of recurring behaviors that indicate co-learning. Studying this requires a task context that allows for behavioral adaptation to emerge from the interactions between human and robot. We address the requirements for conducting research into co-adaptation by a human-robot team, and designed a simplified computer simulation of an urban search and rescue task accordingly. A human participant and a virtual robot were instructed to discover how to collaboratively free victims from the rubbles of an earthquake. The virtual robot was designed to be able to real-time learn which actions best contributed to good team performance. The interactions between human participants and robots were recorded. The observations revealed patterns of interaction used by human and robot in order to adapt their behavior to the task and to one another. Results therefore show that our task environment enables us to study co-learning, and suggest that more participant adaptation improved robot learning and thus team level learning. The identified interaction patterns can emerge in similar task contexts, forming a first description and analysis method for co-learning. Moreover, the identification of interaction patterns support awareness among team members, providing the foundation for human-robot communication about the co-adaptation (i.e., the second stage of co-learning). Future research will focus on these human-robot communication processes for co-learning.
Article
Despite the emergence of various human-robot collaboration frameworks, most are not sufficiently flexible to adapt to users with different habits. In this article, a Multimodal Reinforcement Learning Human-Robot Collaboration (MRLC) framework is proposed. It integrates reinforcement learning into human-robot collaboration and continuously adapts to the user's habits in the process of collaboration with the user to achieve the effect of human-robot cointegration. With the user's multimodal features as states, the MRLC framework collects the user's speech through natural language processing and employs it to determine the reward of the actions made by the robot. Our experiments demonstrate that the MRLC framework can adapt to the user's habits after repeated learning and better understand the user's intention compared to traditional solutions.
Article
Effective team performance often requires that individuals engage in team training exercises. However, organizing team-training scenarios presents economic and logistical challenges and can be prone to trainer bias and fatigue. Accordingly, a growing body of research is investigating the effectiveness of employing artificial agents (AAs) as synthetic teammates in team training simulations, and, relatedly, how to best develop AAs capable of robust, human-like behavioral interaction. Motivated by these challenges, the current study examined whether task dynamical models of expert human herding behavior could be embedded in the control architecture of AAs to train novice actors to perform a complex multiagent herding task. Training outcomes were compared to human-expert trainers, novice baseline performance, and AAs developed using deep reinforcement learning (DRL). Participants’ subjective preferences for the AAs developed using DRL or dynamical models of human performance were also investigated. The results revealed that AAs controlled by dynamical models of human expert performance could train novice actors at levels equivalent to expert human trainers and were also preferred over AAs developed using DRL. The implications for the development of AAs for robust human-AA interaction and training are discussed, including the potential benefits of employing hybrid Dynamical-DRL techniques for AA development.
Chapter
With emerging technologies like robots, mixed-reality systems or mobile devices, machine-provided capabilities are increasing, so is the complexity of their control and display mechanisms. To address this dichotomy, we propose optimal control as a framework to support users in achieving their high-level goals in human-computer tasks. We reason that it will improve user support over usual approaches for adaptive interfaces as its formalism implicitly captures the iterative nature of human-computer interaction. We conduct two case studies to test this hypothesis. First, we propose a model-predictive-control-based optimization scheme that supports end-users to plan and execute robotic aerial videos. Second, we introduce a reinforcement-learning-based method to adapt mixed-reality augmentations based on users’ preferences or tasks learned from their gaze interactions with a UI. Our results show that optimal control can better support users’ high-level goals in human-computer tasks than common approaches. Optimal control models human-computer interaction as a sequential decision problem which represents its nature and, hence, results in better predictability of user behavior than for other methods. In addition, our work highlights that optimization- and learning-based optimal control have complementary strengths with respect to interface adaptation.
Article
Full-text available
Rapid advances in the field of Deep Reinforcement Learning (DRL) over the past several years have led to artificial agents (AAs) capable of producing behavior that meets or exceeds human-level performance in a wide variety of tasks. However, research on DRL frequently lacks adequate discussion of the low-level dynamics of the behavior itself and instead focuses on meta-level or global-level performance metrics. In doing so, the current literature lacks perspective on the qualitative nature of AA behavior, leaving questions regarding the spatiotemporal patterning of their behavior largely unanswered. The current study explored the degree to which the navigation and route selection trajectories of DRL agents (i.e., AAs trained using DRL) through simple obstacle ridden virtual environments were equivalent (and/or different) from those produced by human agents. The second and related aim was to determine whether a task-dynamical model of human route navigation could not only be used to capture both human and DRL navigational behavior, but also to help identify whether any observed differences in the navigational trajectories of humans and DRL agents were a function of differences in the dynamical environmental couplings.
Article
Full-text available
Many recent studies found signatures of motor learning in neural beta oscillations (13–30 Hz), and specifically in the post-movement beta rebound (PMBR). All these studies were in controlled laboratory-tasks in which the task designed to induce the studied learning mechanism. Interestingly, these studies reported opposing dynamics of the PMBR magnitude over learning for the error-based and reward-based tasks (increase vs. decrease, respectively). Here, we explored the PMBR dynamics during real-world motor-skill-learning in a billiards task using mobile-brain-imaging. Our EEG recordings highlight the opposing dynamics of PMBR magnitudes (increase vs. decrease) between different subjects performing the same task. The groups of subjects, defined by their neural dynamics, also showed behavioral differences expected for different learning mechanisms. Our results suggest that when faced with the complexity of the real-world different subjects might use different learning mechanisms for the same complex task. We speculate that all subjects combine multi-modal mechanisms of learning, but different subjects have different predominant learning mechanisms.
Article
Full-text available
Deep reinforcement learning (RL) has achieved outstanding results in recent years. This has led to a dramatic increase in the number of applications and methods. Recent works have explored learning beyond single-agent scenarios and have considered multiagent learning (MAL) scenarios. Initial results report successes in complex multiagent domains, although there are several challenges to be addressed. The primary goal of this article is to provide a clear overview of current multiagent deep reinforcement learning (MDRL) literature. Additionally, we complement the overview with a broader analysis: (i) we revisit previous key components, originally presented in MAL and RL, and highlight how they have been adapted to multiagent deep reinforcement learning settings. (ii) We provide general guidelines to new practitioners in the area: describing lessons learned from MDRL works, pointing to recent benchmarks, and outlining open avenues of research. (iii) We take a more critical tone raising practical challenges of MDRL (e.g., implementation and computational demands). We expect this article will help unify and motivate future research to take advantage of the abundant literature that exists (e.g., RL and MAL) in a joint effort to promote fruitful research in the multiagent community.
Conference Paper
Full-text available
This paper describes a novel approach in human-robot interaction driven by ergonomics. With a clear focus on optimising ergonomics, the approach proposed here continuously observes a human user's posture and by invoking appropriate cooperative robot movements, the user's posture is, whenever required, brought back to an ergonomic optimum. Effectively, the new protocol optimises the human-robot relative position and orientation as a function of human ergonomics. An RGB-D camera is used to calculate and monitor human joint angles in real-time and to determine the current ergonomics state. A total of 6 main causes of low ergonomic states are identified, leading to 6 universal robot responses to allow the human to return to an optimal ergonomics state. The algorithmic framework identifies these 6 causes and controls the cooperating robot to always adapt the environment (e.g. change the pose of the workpiece) in a way that is ergonomically most comfortable for the interacting user. Hence, human-robot interaction is continuously re-evaluated optimizing ergonomics states. The approach is validated through an experimental study, based on established ergonomic methods and their adaptation for real-time application. The study confirms improved ergonomics using the new approach.
Conference Paper
Full-text available
While recent advances in deep reinforcement learning have allowed autonomous learning agents to succeed at a variety of complex tasks, existing algorithms generally require a lot of training data. One way to increase the speed at which agents are able to learn to perform tasks is by leveraging the input of human trainers. Although such input can take many forms, real-time, scalar-valued feedback is especially useful in situations where it proves difficult or impossible for humans to provide expert demonstrations. Previous approaches have shown the usefulness of human input provided in this fashion (e.g., the TAMER framework), but they have thus far not considered high-dimensional state spaces or employed the use of deep learning. In this paper, we do both: we propose Deep TAMER, an extension of the TAMER framework that leverages the representational power of deep neural networks in order to learn complex tasks in just a short amount of time with a human trainer. We demonstrate Deep TAMER's success by using it and just 15 minutes of human-provided feedback to train an agent that performs better than humans on the Atari game of Bowling - a task that has proven difficult for even state-of-the-art reinforcement learning methods.
Article
Full-text available
Multi-agent settings are quickly gathering importance in machine learning. Beyond a plethora of recent work on deep multi-agent reinforcement learning, hierarchical reinforcement learning, generative adversarial networks and decentralized optimization can all be seen as instances of this setting. However, the presence of multiple learning agents in these settings renders the training problem non-stationary and often leads to unstable training or undesired final results. We present Learning with Opponent-Learning Awareness (LOLA), a method that reasons about the anticipated learning of the other agents. The LOLA learning rule includes an additional term that accounts for the impact of the agent's policy on the anticipated parameter update of the other agents. We show that the LOLA update rule can be efficiently calculated using an extension of the likelihood ratio policy gradient update, making the method suitable for model-free reinforcement learning. This method thus scales to large parameter and input spaces and nonlinear function approximators. Preliminary results show that the encounter of two LOLA agents leads to the emergence of tit-for-tat and therefore cooperation in the infinitely iterated prisoners' dilemma, while independent learning does not. In this domain, LOLA also receives higher payouts compared to a naive learner, and is robust against exploitation by higher order gradient-based methods. Applied to infinitely repeated matching pennies, only LOLA agents converge to the Nash equilibrium. We also apply LOLA to a grid world task with an embedded social dilemma using deep recurrent policies. Again, by considering the learning of the other agent, LOLA agents learn to cooperate out of selfish interests.
Article
Full-text available
Matching the dexterity, versatility and robustness of the human hand is still an unachieved goal in bionics, robotics and neural engineering. A major limitation for hand prosthetics lies in the challenges of reliably decoding user intention from muscle signals when controlling complex robotic hands. Most of the commercially available prosthetic hands use musclerelated signals to decode a finite number of predefined motions and some offer proportional control of open/close movements of the whole hand. Here, in contrast, we aim to offer users flexible control of individual joints of their artificial hand. We propose a novel framework for decoding neural information that enables a user to independently control 11 joints of the hand in a continuous manner - much like we control our natural hands. Towards this end, we instructed 6 able-bodied subjects to perform everyday object manipulation tasks combining both dynamic, free movements (e.g. grasping) and isometric force tasks (e.g. squeezing). We recorded the electromyographic (EMG) and mechanomyographic (MMG) activities of 5 extrinsic muscles of the hand in the forearm, while simultaneously monitoring 11 joints of hand and fingers using a sensorised data glove that tracked the joints of the hand. Instead of learning just a direct mapping from current muscle activity to intended hand movement, we formulated a novel autoregressive approach that combines the context of previous hand movements with instantaneous muscle activity to predict future hand movements. Specifically, we evaluated a linear Vector AutoRegressive Moving Average model with Exogenous inputs (VARMAX) and a novel Gaussian Process (GP) autoregressive framework to learn the continuous mapping from hand joint dynamics and muscle activity to decode intended hand movement. Our GP approach achieves high levels of performance (RMSE of 8/s and = 0:79). Crucially, we use a small set of sensors that allows us to control a larger set of independently actuated degrees of freedom of a hand. This novel undersensored control is enabled through the combination of nonlinear autoregressive continuous mapping between muscle activity and joint angles: The system evaluates the muscle signals in the context of previous natural hand movements. This enables us to resolve ambiguities in situations where muscle signals alone cannot determine the correct action as we evaluate the muscle signals in their context of natural hand movements. GP autoregression is a particularly powerful approach which makes not only a prediction based on the context but also represents the associated uncertainty of its predictions, thus enabling the novel notion of risk-based control in neuroprosthetics. Our results suggest that GP autoregressive approaches with exogenous inputs lend themselves for natural, intuitive and continuous control in neurotechnology, with the particular focus on prosthetic restoration of natural limb function, where high dexterity is required for complex movements.
Article
Full-text available
For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent's interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time. These behaviors and environments are considerably more complex than any that have been previously learned from human feedback.
Article
Full-text available
In shared autonomy, a user and autonomous system work together to achieve shared goals. To collaborate effectively, the autonomous system must know the user's goal. As such, most prior works follow a predict-then-act model, first predicting the user's goal with high confidence, then assisting given that goal. Unfortunately, confidently predicting the user's goal may not be possible until they have nearly achieved it, causing predict-then-act methods to provide little assistance. However, the system can often provide useful assistance even when confidence for any single goal is low (e.g. move towards multiple goals). In this work, we formalize this insight by modelling shared autonomy as a Partially Observable Markov Decision Process (POMDP), providing assistance that minimizes the expected cost-to-go with an unknown goal. As solving this POMDP optimally is intractable, we use hindsight optimization to approximate. We apply our framework to both shared-control teleoperation and human-robot teaming. Compared to predict-then-act methods, our method achieves goals faster, requires less user input, decreases user idling time, and results in fewer user-robot collisions.
Conference Paper
Full-text available
We present a wearable head-tracking device using inexpensive inertial sensors as an alternative head movement tracking system. This can be used as indicator of human movement intentions for Brain-Machine Interface (BMI) applications. Our system is capable of tracking head movements at high rates (100 Hz) and achieves R2 = 0.99 with a 2.5° RMSE against a ground-truth motion tracking system. The system tracks head movements over periods in the order of tens of minutes with little drift. The accuracy and precision of our system, together with its low response latency of ≈ 20 ms make it an unconventional but effective system for human-computer interfacing: the "head mouse" controls the mouse cursor on a display based on head orientation alone, so that it matches the centre of a straight-onward looking user. Our head mouse is suitable for amputees and spinal chord injury patients who have lost control of their upper extremities. We show that naive test subjects are capable to write text using our system and an on-screen keyboard at a rate of 4.65 words/minute, compared to able bodied users using a physical computer mouse which reached 7.85 words/minute. Crucially we measure the natural head movements of able bodied computer users, and show that our approach falls within the range of natural head movement parameters.
Conference Paper
Full-text available
Recent research has shown the benefit of framing problems of imitation learning as solutions to Markov Decision Prob- lems. This approach reduces learning to the problem of re- covering a utility function that makes the behavior induced by a near-optimal policy closely mimic demonstrated behav- ior. In this work, we develop a probabilistic approach based on the principle of maximum entropy. Our approach provides a well-defined, globally normalized distribution over decision sequences, while providing the same performance guarantees as existing methods. We develop our technique in the context of modeling real- world navigation and driving behaviors where collected data is inherently noisy and imperfect. Our probabilistic approach enables modeling of route preferences as well as a powerful new approach to inferring destinations and routes based on partial trajectories. employ the principle of maximum entropy to resolve the am- biguity in choosing a distribution over decisions. We pro- vide efficient algorithms for learning and inference for de- terministic MDPs. We rely on an additional simplifying as- sumption to make reasoning about non-deterministic MDPs tractable. The resulting distribution is a probabilistic model that normalizes globally over behaviors and can be under- stood as an extension to chain conditional random fields that incorporates the dynamics of the planning system and ex- tends to the infinite horizon. Our research effort is motivated by the problem of mod- eling real-world routing preferences of drivers. We apply our approach to route preference modeling using 100,000 miles of collected GPS data of taxi-cab driving, where the structure of the world (i.e., the road network) is known and the actions available (i.e., traversing a road segment) are characterized by road features (e.g., speed limit, number of lanes). In sharp contrast to many imitation learning tech- niques, our probabilistic model of purposeful behavior in- tegrates seamlessly with other probabilistic methods includ- ing hidden variable techniques. This allows us to extend our route preferences with hidden goals to naturally infer both future routes and destinations based on partial trajectories. A key concern is that demonstrated behavior is prone to noise and imperfect behavior. The maximum entropy ap- proach provides a principled method of dealing with this uncertainty. We discuss several additional advantages in modeling behavior that this technique has over existing ap- proaches to inverse reinforcement learning including margin methods (Ratliff, Bagnell, & Zinkevich 2006) and those that normalize locally over each state's available actions (Ra- machandran & Amir 2007; Neu & Szepesvri 2007).
Conference Paper
Full-text available
As computational learning agents move into domains that incur real costs (e.g., autonomous driving or financial investment), it will be necessary to learn good policies without numerous high-cost learning trials. One promising approach to reducing sample complexity of learning a task is knowledge transfer from humans to agents. Ideally, methods of transfer should be accessible to anyone with task knowledge, regardless of that person's expertise in programming and AI. This paper focuses on allowing a human trainer to interactively shape an agent's policy via reinforcement signals. Specifically, the paper introduces "Training an Agent Manually via Evaluative Reinforcement," or TAMER, a framework that enables such shaping. Differing from previous approaches to interactive shaping, a TAMER agent models the human's reinforcement and exploits its model by choosing actions expected to be most highly reinforced. Results from two domains demonstrate that lay users can train TAMER agents without defining an environmental reward function (as in an MDP) and indicate that human training within the TAMER framework can reduce sample complexity over autonomous learning algorithms.
Article
Full-text available
The exploits of Martina Navratilova and Roger Federer represent the pinnacle of motor learning. However, when considering the range and complexity of the processes that are involved in motor learning, even the mere mortals among us exhibit abilities that are impressive. We exercise these abilities when taking up new activities - whether it is snowboarding or ballroom dancing - but also engage in substantial motor learning on a daily basis as we adapt to changes in our environment, manipulate new objects and refine existing skills. Here we review recent research in human motor learning with an emphasis on the computational mechanisms that are involved.
Article
Full-text available
Distributed Artificial Intelligence (DAI) has existed as a subfield of AI for less than two decades. DAI is concerned with systems that consist of multiple independent entities that interact in a domain. Traditionally, DAI has been divided into two sub-disciplines: Distributed Problem Solving (DPS) focuses on the information management aspects of systems with several components working together towards a common goal; Multiagent Systems (MAS) deals with behavior management in collections of several independent entities, or agents. This survey of MAS is intended to serve as an introduction to the field and as an organizational framework. A series of general multiagent scenarios are presented. For each scenario, the issues that arise are described along with a sampling of the techniques that exist to deal with them. The presented techniques are not exhaustive, but they highlight how multiagent systems can be and have been used to build complex systems. When options exist, the tec...
Article
While recent advances in deep reinforcement learning have allowed autonomous learning agents to succeed at a variety of complex tasks, existing algorithms generally require a lot oftraining data. One way to increase the speed at which agent sare able to learn to perform tasks is by leveraging the input of human trainers. Although such input can take many forms, real-time, scalar-valued feedback is especially useful in situations where it proves difficult or impossible for humans to provide expert demonstrations. Previous approaches have shown the usefulness of human input provided in this fashion (e.g., the TAMER framework), but they have thus far not considered high-dimensional state spaces or employed the use of deep learning. In this paper, we do both: we propose DeepTAMER, an extension of the TAMER framework that leverages the representational power of deep neural networks inorder to learn complex tasks in just a short amount of time with a human trainer. We demonstrate Deep TAMER’s success by using it and just 15 minutes of human-provided feedback to train an agent that performs better than humans on the Atari game of Bowling - a task that has proven difficult for even state-of-the-art reinforcement learning methods.
Article
Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods to complex, real-world domains. In this paper, we propose soft actor-critic, an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy - that is, succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods. By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving very similar performance across different random seeds.
Conference Paper
Abstract— This paper gives an overview of ROS, an open- source robot operating,system. ROS is not an operating,system in the traditional sense of process management,and scheduling; rather, it provides a structured communications layer above the host operating,systems,of a heterogenous,compute,cluster. In this paper, we discuss how ROS relates to existing robot software frameworks, and briefly overview some of the available application software,which,uses ROS.
Article
Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond. (Thorndike, 1911) The idea of learning to make appropriate responses based on reinforcing events has its roots in early psychological theories such as Thorndike's "law of effect" (quoted above). Although several important contributions were made in the 1950s, 1960s and 1970s by illustrious luminaries such as Bellman, Minsky, Klopf and others (Farley and Clark, 1954; Bellman, 1957; Minsky, 1961; Samuel, 1963; Michie and Chambers, 1968; Grossberg, 1975; Klopf, 1982), the last two decades have wit- nessed perhaps the strongest advances in the mathematical foundations of reinforcement learning, in addition to several impressive demonstrations of the performance of reinforcement learning algo- rithms in real world tasks. The introductory book by Sutton and Barto, two of the most influential and recognized leaders in the field, is therefore both timely and welcome. The book is divided into three parts. In the first part, the authors introduce and elaborate on the es- sential characteristics of the reinforcement learning problem, namely, the problem of learning "poli- cies" or mappings from environmental states to actions so as to maximize the amount of "reward"
Learning to play the piano with the supernumerary robotic 3rd thumb
  • A Shafti
  • S Haar
  • R M Zaldivar
  • P Guilleminot
A Need for Speed : Adapting Agent Action Speed to Improve Task Learning from Non-Expert Humans Categories and Subject Descriptors
  • J Macglashan
  • R Loftin
  • M L Littman
  • D L Roberts
  • M E Taylor
Human visual attention prediction boosts learning & performance of autonomous driving agents
  • A Makrigiorgos
  • A Shafti
  • A Harston
  • J Gerard
Learning to walk via deep reinforcement learning
  • T Haarnoja
  • A Zhou
  • S Ha
  • J Tan
  • G Tucker
  • S Levine
Soft actor-critic algorithms and applications
  • haarnoja
PyTorch: An Imperative Style, High-Performance Deep Learning Library
  • paszke
Learning to play the piano with the supernumerary robotic 3rd thumb
  • shafti
A Need for Speed : Adapting Agent Action Speed to Improve Task Learning from Non-Expert Humans Categories and Subject Descriptors
  • macglashan
Human visual attention prediction boosts learning & performance of autonomous driving agents
  • makrigiorgos