PreprintPDF Available

Real-World Human-Robot Collaborative Reinforcement Learning

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

The intuitive collaboration of humans and intelligent robots (embodied AI) in the real-world is an essential objective for many desirable applications of robotics. Whilst there is much research regarding explicit communication, we focus on how humans and robots interact implicitly, on motor adaptation level. We present a real-world setup of a human-robot collaborative maze game, designed to be non-trivial and only solvable through collaboration, by limiting the actions to rotations of two orthogonal axes, and assigning each axes to one player. This results in neither the human nor the agent being able to solve the game on their own. We use a state-of-the-art reinforcement learning algorithm for the robotic agent, and achieve results within 30 minutes of real-world play, without any type of pre-training. We then use this system to perform systematic experiments on human/agent behaviour and adaptation when co-learning a policy for the collaborative game. We present results on how co-policy learning occurs over time between the human and the robotic agent resulting in each participant's agent serving as a representation of how they would play the game. This allows us to relate a person's success when playing with different agents than their own, by comparing the policy of the agent with that of their own agent.
Left: Learning curve of a single human player training with the agent, including both online and offline updates of the agent. Tests are performed at 500-step intervals, scores averaged over ten trials. Plot shows mean score and standard error of the mean. Right: Results of ten participants playing the game with their trained agent, and with a human expert. Mean score and standard error of the mean is shown. communicate in any way. Score results for this are averaged over ten trials and reported in Figure 3, right. Results from the single-subject experiment show the human-agent team as able to solve the interaction task within the time provided. Furthermore, the inconsistency of the performance is decreasing as the human-robot team learn to collaborate. This is possibly the effect of both the agent learning, and human motor adaptation. In the second experiment, the RL-agent's ability to collaborate with humans was compared to how humans collaborate with each other. In five out of the ten preliminary participants (S1, S4, S5, S6, S7) there was no significant differences in performance between the two scenarios. The remaining half of the participants exhibit worse performance when collaborating with the agent. We observed that the players having worse results with their agents, also failed to reach the goal, or at most reached it one time, in the first 500 control frames of the game, which affects the agent's representation of the game's goal. This might be due to these participants being inherently worse players at the game, and would perhaps have been resolved with longer training.
… 
Content may be subject to copyright.
Real-World Human-Robot Collaborative Reinforcement Learning*
Ali Shafti1, Jonas Tjomsland1, William Dudley1and A. Aldo Faisal1,2
Abstract The intuitive collaboration of humans and intel-
ligent robots (embodied AI) in the real-world is an essential
objective for many desirable applications of robotics. Whilst
there is much research regarding explicit communication, we
focus on how humans and robots interact implicitly, on motor
adaptation level. We present a real-world setup of a human-
robot collaborative maze game, designed to be non-trivial and
only solvable through collaboration, by limiting the actions to
rotations of two orthogonal axes, and assigning each axes to
one player. This results in neither the human nor the agent
being able to solve the game on their own. We use a state-
of-the-art reinforcement learning algorithm for the robotic
agent, and achieve results within 30 minutes of real-world
play, without any type of pre-training. We then use this system
to perform systematic experiments on human/agent behaviour
and adaptation when co-learning a policy for the collaborative
game. We present results on how co-policy learning occurs over
time between the human and the robotic agent resulting in
each participant’s agent serving as a representation of how
they would play the game. This allows us to relate a person’s
success when playing with different agents than their own, by
comparing the policy of the agent with that of their own agent.
I. INTRODUCTION
Human-Machine Interaction methods are changing. Ef-
forts were previously focused on creating “user-friendly”
interfaces, so that human users can better learn to work
with a system that is persistent in its behaviour. With
the ever-increasing success of artificially intelligent agents,
however, the possibilities for creating a fluid, adaptive and
ever improving interaction are increasing. Instead of the
conventional paradigm of the human adapting to the ma-
chine, we want machines that can adapt to humans – a
mutual adaptation happening over time, leading to more
intuitive interactions. To achieve this, we need intelligent
control agents that can learn as they interact with a human
user. Within Robotics, collaborative robots that learn through
human interactions, are a topic of active research. A common
tool for this is reinforcement learning (RL) as it follows
the same learning mechanism driving human learning [1].
We are interested in implementing Human-in-the-Loop RL,
i.e. having an agent that interacts and learns directly from a
human counterpart.
Human in-the-loop RL can also be mapped as a specific
case of a multi-agent system, which has been an ongoing
area of research for the past two decades [2], [3]. However,
complications arise from having a human in-the-loop, mainly
due to the stochastic nature of human behaviour, and limited
observability of human intent, reasoning and theory of mind.
1AS, JT, WD and AAF. are with the Brain and Behaviour Lab,
Dept. of Bioengineering and Dept. of Computing, Imperial College London
{a.shafti,a.faisal}@imperial.ac.uk
Fig. 1: Our Human-Robot co-learning setup: A ball and maze
game is designed to require two players for success; one player
per rotation axis of the tray. One axis is tele-operated by a human
player, and the other axis by a deep reinforcement learning agent.
The game can only be solved through collaboration.
Similarly, the agent is not fully observable for the human
(e.g. lack of explainability), causing challenges for interac-
tive learning.
In this paper, we present a real-world setup for studies
on how humans and intelligent robotic agents can learn
and adapt together for the completion of a non-trivial col-
laborative motor task. We have designed a human-agent
collaborative maze game, see Figure 1, where a tray needs
to be tilted to navigate a ball to a goal. The human controls
one axis of tilt, and the agent controls the other. Hence, the
agent and the human need to learn to collaborate together.
We report the methods used in creating the setup, followed
by experiments investigating the possibility, and results of
human-robot real-time, real-world collaborative learning.
II. RELATED WORK
There is extensive literature on human-robot collaborative
learning, where usually the human explicitly communicates
with the agent to enhance its learning. In [4], the authors
present TAMER, where the agent learns via real-time qualita-
tive feedback from a human rather than environment reward.
This is extended in [5] to work with deep RL, showing
an example of human guidance in a scenario with a high
dimensional state-space. This outperforms both humans and
state-of-the-art RL algorithms in ATARI Bowling within 15
minutes. Sparse human feedback is investigated in [6], with
less than 1% of agent actions being provided with human
feedback. This method is shown to still effectively perform in
ATARI and MuJoCo environments within an hour of human
time. Further examples can be found in the field of shared
CONFIDENTIAL. Limited circulation. For review only.
Manuscript 1384 submitted to 2020 IEEE/RSJ International Conference
on Intelligent Robots and Systems. Received March 1, 2020.
autonomy; environments in which multiple agents (human or
artificial) act at the same time, to achieve shared or individual
goals. Chen et al. [7] developed a robot which was able to
exhibit socially compliant behaviour using deep RL. Reddy
et al. [8] utilised deep RL, to augment a human’s actions
to achieve better performance in a drone-flying task. They
simulated general human models as pilots to handle the large
amount of training samples required. Other works involving
multi-agent systems has shown that multiple artificial agents
can collaborate in complex computer games and outperform
human teams [9].
Other approaches impose heuristics on agents to increase
their aptitude in interacting and communicating with humans.
Deep RL is used in [10] to implicitly infer social norms
regarding pedestrian behaviour to improve motion planning.
In [11], the agent’s speed is modulated based on how unsure
of an action it is, prompting a human advisor as to when
to provide feedback. In [12] human data is used to learn
to predict human gaze behaviour while driving. This is
then used to train better performing self-driving agents,
through prediction of human visual attention. Opponent
modelling and theory of mind have been leveraged to gain
insights regarding RL performance in multi-agent scenarios.
A learning algorithm is developed in [13] that is aware of
the other learning agent, leading to better performance for
both agents if playing the iterated prisoner’s dilemma with
an unaware agent. In [14], meta-learning is used to learn
different strategies for different species of agents, that is,
agents with markedly different behaviours.
All of the aforementioned approaches are either trained in
simulation, operate solely in a simulated world or are based
on sequential or interval-based interactions involving explicit
communication. In this work we are interested instead in
real-world, real-time collaborative learning between a human
and an agent, with implicit communication.
III. METHODS
A. Robotic Setup
We use a Universal Robots UR10 (Universal Robots A/S,
Odense, Denmark) as the robotic manipulator. A 50cm ×
50cm square tray is built out of cardboard material, and
attached to the UR10 end-effector, through a 3D printed
mechanical interface. The tray has barrier walls on all four
sides to keep the ball from falling off as well as two obstacle
walls, positioned diagonally, with a 9cm opening in the
centre (refer to Figure 1 and Figure 2). A 5cm-diameter
hole is cut near one of the board’s corners, representing
the goal for a rolling ball to fall into. The ball is 6cm in
diameter and made out of transparent acrylic. The game,
i.e. the task of rolling the ball from a given start point on
the tray, to the goal, is solved purely by rotating the tray
around its x-yaxes (two orthogonal axes on the tray plane,
with the centre of the square tray as the origin – see Figure
2); no rotation around the z-axis, nor translations along any
axes is allowed. The human player’s commands are sent via
a smaller tray that they hold and rotate (Figure 1 and 2).
The human tray’s orientation is tracked with three optical
Fig. 2: Overview of our collaborative maze game setup. The human
and the RL agent’s actions are mapped to orthogonal rotation axes
of the tray. The individual effects of human and agent actions are
marked on the ball with green and red arrows respectively. The
states fed to the RL agent are xand yposition of the ball in the
tray frame, xand yball velocity, rotation angles along the two axes
(θfor xand φfor yrotations), and respective rotational velocities.
markers placed on top of it, through a motion capture system
consisting of Optitrack Flex 13 cameras (NaturalPoint, Inc.
DBA OptiTrack, Corvallis, Oregon, USA). The position of
the ball on the tray is similarly tracked via optical markers
placed inside it (Figure 1 and 2).
To integrate the above, we used the Robot Operating
System (ROS) [15], running on a Linux workstation (ROS
Melodic, Ubuntu 18.04). We added the tray and its attach-
ment interface to the UR10’s Unified Robot Description
Format (URDF1) file within ROS so that its pose can be
continuously tracked through ROS’s transform library (tf22).
The motion capture software was running on a Windows
10 workstation transmitting the motion data through the
NatNet protocol3over the network. We used the NatNet 3
ROS driver4to communicate between the motion capture
system and ROS, allowing the pose of the human tray
and the ball to be tracked with ROS’s transform library,
and with respect to the tray frame. Human commands are
then calculated with a P-control approach, the error defined
as the difference of the human tray and the game tray
angles, along the defined axis. To send motion commands
to the robot, we used the jog_arm5ROS package which
simplifies the communication of smooth velocity commands
to ROS-enabled robots, allowing us to send real-time jogging
commands.
B. Reinforcement Learning Setup
We apply a PyTorch [16] implementation of the Soft
Actor-Critic algorithm (SAC) [17], based on OpenAI’s open-
1http://wiki.ros.org/urdf
2http://wiki.ros.org/tf2
3https://optitrack.com/products/natnet-sdk/
4https://github.com/mje-nz/natnet_ros
5https://github.com/UTNuclearRoboticsPublic/jog_arm
CONFIDENTIAL. Limited circulation. For review only.
Manuscript 1384 submitted to 2020 IEEE/RSJ International Conference
on Intelligent Robots and Systems. Received March 1, 2020.
source “Spinning Up” implementation [18]. SAC is an off-
policy, maximum entropy method. Running off-policy allows
for the reuse of state-action transitions sampled in previous
trials, which is crucial when few interaction steps are feasi-
ble. The maximum entropy framework [19] adds an entropy
maximisation term to the RL reward function, encouraging
exploration. This exploration/exploitation relationship can be
balanced by a temperature parameter, α, where a larger α
is used to encourage more exploration, and a smaller α
corresponds to more exploitation. αacts as an important
hyperparameter for SAC, and by using the automatic entropy
tuning method introduced by Haarnoja et al. [20], the policy’s
entropy can be constrained to a desired value throughout
the learning process. This removes the need for intricate
hyperparameter tuning, allowing for a very sample-efficient
training process.
To interface the robotic setup explained in III-A, we define
a eight-dimensional state space. It consists of the position and
velocity of the ball along the xand yaxes in the game tray
frame, along with the rotation angles and rotation velocities
of the game tray about its xand yaxes (see Figure 2. The
human behaviour is included in the state space through the
game tray rotation around its y-axis, which is mimicking the
human’s tray via the tele-operation interface. The RL agent’s
action space is one-dimensional, a continuous value between
1and 1which is mapped to rotational velocity commands
along the game tray’s xaxis. For every time step t, the
motion capture system calculates the position and orientation
of both the ball and the human’s tray, while the ROS
transform library gives the robot’s tray orientation. Given
the observation of the current state, st, the policy network
outputs a distribution of actions, from which an action, at,
is sampled during training. During testing, the mean of
the distribution is used, thereby removing the stochasticity,
having the policy fully exploited. The action, being a velocity
command, is executed on the robot for 200ms. Limits are set
for both the rotational velocity, and the angles of the tray to
keep the workspace safe. The resulting state st+1, reward
rtand whether or not the state was terminal, d, are then
extracted and stored in a replay buffer of past transitions
used to update the policy. A sparse reward function is used
to penalise the agent with 1for every time step and a
reward of +10 if the target is reached. This means that the
agent does not have explicit knowledge of the goal position,
and thus experiencing goal reaches is crucial to it forming a
representation of state values with respect to the goal.
C. Experimental Setup
We now have a foundation for applying human and agent
actions together on the robot manipulator. Each velocity
action is applied constantly for 200ms. We refer to this as
a single control frame. The control frame approach, along
with the loop delay, means that the human will observe a
delay in their intended action being executed. We measured
this to be a maximum of 300ms – actual value depending on
timing of human action and how it fits with the control frame
sequence. While efforts can be made to reduce this delay,
we see it as an interesting component of the system, as it
adds complications to the system dynamics from the human’s
point of view, making the interaction with an untrained RL
agent more fair.
For our experimental setup, we define each trial to consist
of 200 control frames, a total of 40 seconds. A trial ends
immediately if the ball reaches the goal, and otherwise times
out in 40 seconds – i.e. after 200 control frames have been
applied. Each trial, therefore, consists of 200 state transitions
for the RL agent, which are stored in its replay buffer, to
be used for network updates. We set the size of the agent’s
replay buffer to be 5 trials, meaning a buffer of 1000 (5×200)
state transitions. The game always starts with the ball in one
of the corners of the side of the tray opposite the goal-side,
alternating between the three corners on each trial. For trial
results, scores are defined on a linear scale with a maximum
score of 200, and one point lost for each applied control
frame – i.e. if the goal is not reached by the end of the trial,
the score will be zero.
IV. EXPERIMENTS
A. Preliminary study and results
Pilot experiments of the system were conducted to evaluate
its functionality and plan out for the main experiments
described in the next sub-section. To closely follow the
original application of SAC [21], training is counted in terms
of control frames (i.e. RL agent’s state transitions), each
frame followed by a single gradient update of all networks.
All agent transitions are stored in the buffer without limit.
Offline updates of the network based on the stored buffer are
also performed to accelerate learning.
Two sets of tests were performed in this format. First,
a single participant interacted with a previously untrained
agent. The training process consisted of 3,500 control frames
and 140,000 offline gradient updates. Offline updates were
divided throughout training, running 20,000 offline updates
for every 500 control frames. After completion of each
of these offline updates, performance was tested in trial-
based format, for 10 trials, as described before, with results
reported averaged over the 10 trials, shown in Figure 3, left.
For the second sets of tests, 10 participants trained with
a previously untrained agent on a trial-based manner. Partic-
ipants first trained for 8 trials, with all transitions recorded
in the replay buffer. The agent then undergoes 30,000 offline
gradient updates on that buffer. This is followed by a second
training set of 7 trials, with transitions added to the original
buffer, resulting in a 15-trial long buffer. Another 30,000
offline gradient updates are applied based on the new buffer.
Ten trials of testing with the agent follow, with scores
averaged and reported. Each participant was then asked to
do another ten trials, this time with a human “expert", one
of the system’s designers who had the most experience with
the game, acting as the interaction partner for human-human
trials, controlling the axis previously under control of the
RL agent. During the human-human trials we ensured the
two players cannot see each other, and that they do not
CONFIDENTIAL. Limited circulation. For review only.
Manuscript 1384 submitted to 2020 IEEE/RSJ International Conference
on Intelligent Robots and Systems. Received March 1, 2020.
Fig. 3: Left: Learning curve of a single human player training with
the agent, including both online and offline updates of the agent.
Tests are performed at 500-step intervals, scores averaged over ten
trials. Plot shows mean score and standard error of the mean. Right:
Results of ten participants playing the game with their trained agent,
and with a human expert. Mean score and standard error of the mean
is shown.
communicate in any way. Score results for this are averaged
over ten trials and reported in Figure 3, right.
Results from the single-subject experiment show the
human-agent team as able to solve the interaction task within
the time provided. Furthermore, the inconsistency of the per-
formance is decreasing as the human-robot team learn to col-
laborate. This is possibly the effect of both the agent learning,
and human motor adaptation. In the second experiment, the
RL-agent’s ability to collaborate with humans was compared
to how humans collaborate with each other. In five out of
the ten preliminary participants (S1, S4, S5, S6, S7) there
was no significant differences in performance between the
two scenarios. The remaining half of the participants exhibit
worse performance when collaborating with the agent. We
observed that the players having worse results with their
agents, also failed to reach the goal, or at most reached it
one time, in the first 500 control frames of the game, which
affects the agent’s representation of the game’s goal. This
might be due to these participants being inherently worse
players at the game, and would perhaps have been resolved
with longer training.
B. Co-learning experiments
Having confirmed the feasibility of Human-in-the-Loop
learning with our system through the preliminary study
above, we move to experiments on collaborative learning.
We ran 7 participants. To accelerate learning, participants
start their training on a common pre-trained agent. The pre-
trained agent is the result of 8 trials of interactions by the
expert player from the preliminary study, followed by 30,000
offline gradient updates – total elapsed time is about 15
minutes. This is effectively half of the training done on the
agents in the preliminary study. The pre-trained agent is able
to navigate the ball towards the general direction of the goal,
with coarse movements and low precision.
Participants are completely naive to the experimental
setup. Before training starts, a description of the system is
given to the participant. They are told about the RL agent,
with a brief description of how RL works. They are also
told about what they can control, and how to do it. Each
participant is allowed to try out the interface and rotate the
tray for 40 seconds. This is without the ball on the tray, and
without the RL agent acting.
Training is done in a trial-based manner, allowing us to
observe performance results during training. The experience
replay buffer’s length is limited to 5 trials. Training consists
of 80 trials, performed in blocks of 10, with the participant
given a chance to rest briefly in between blocks. The agent’s
policy is not updated during the trial. At the end of each
trial, the agent undergoes 200 gradient updates. No offline
gradient updates are performed. Before each trial starts, the
participant is alerted by three beeps played over speakers,
and a trial’s end is similarly announced, by a single beep.
Score results and the full state space of the agent’s data are
recorded for analysis.
Once 80 trials of training are complete, the participants
are tested with their own final agent, as well as four agents
trained with different players. The agents are frozen during
testing, and are not learning any more. Three of the four
agents are selected from those of the preliminary study,
namely that of S1, S5 and S7 (see Figure 3, right), which
showed a performance at the same level as human-human
performance. The fourth agent is the expert player’s agent,
trained for 160 trials with online updates, followed by
256,000 offline gradient updates on the full buffer of 160
trials. Participants start testing by playing 10 trials with their
own agent, then 10 trials each with the four other models (S1,
S5, S7 and expert, randomised), and finally playing another
set of 10 trials on their own agent. They are not told that
their own agent is among the testing agents, but are rather
told that they are being tested with 6 unspecified agents.
Game score and observed data are recorded for analysis.
V. RESULTS & DISCUSSION
To interpret the results, it is important to better understand
what each of the agents the participants are tested with
represent. Agent S1 has an issue with one of the corners of
the starting side of the tray (bottom right corner of starting
side in Figure 2), causing difficulties in reaching the goal side
of the tray. Once on the goal side, S1 implements movements
in the correct direction, but too fast, making it difficult for
a human to collaborate effectively with the agent. Agent S5
does not have issues with the starting side of the tray but, like
S1, has issues with high velocity motions on the goal side.
Agent S7 does not have any such issues and is therefore
easier to collaborate with, although its motions are not as
fine-tuned as those of the expert agent. Finally, the expert
agent, having been trained on a large buffer of interactions,
for a high number of iterations, implements large, yet well-
controlled motions on the starting side of the tray to, with
help from the human, lead the ball to the goal side. Once
on the goal side, the expert agent performs very fine-tuned
motions around the goal, making it easier for the human
player to drop the ball into the goal, if they are capable of
applying fine motions themselves.
Figure 4 shows the results of all 7 players during testing,
when playing the game with all the above agents. We
see a divide in the participants’ results. Looking at when
CONFIDENTIAL. Limited circulation. For review only.
Manuscript 1384 submitted to 2020 IEEE/RSJ International Conference
on Intelligent Robots and Systems. Received March 1, 2020.
Fig. 4: Boxplots of game scores of all 7 participants (P1 to P7)
playing with different agents: S1, S5, S7, expert, and their own
agent twice. The white squares indicate the mean.
participants play with their own model, particularly on round
2, we see 3 participants that are performing consistently well
(P1, P2, and P3). P4 and P5 have medium performances,
whereas the others have bad performances (P6 and P7). This
divide seems to persist with some of the other agents, e.g.
when playing with the expert agent, we see that, again, P1,
P2 and P3 have more consistent performance than the others.
This can be be explained with P5, P6, and P7 being generally
bad at the game – but this does not explain the results when
playing with S1. In this case, P1, P2 and P3 show very
inconsistent and mediocre performance, significantly lower
than their performance with their own agents, whereas P5,
P6 and P7 retain their average to high performance that they
showed with the expert agent, and outperform their results
with their own agents.
This result fits well with the hypothesis that co-learning
is occurring, and that personal models are important. P1, P2
and P3 have managed to develop a consistent collaborative
policy through their 80 trials, whereas this has occurred less
so for P4, and even less so for P5, P6 and almost not at all
for P7. However, we can already see from the results that the
issue with P5, P6 and P7 is the agent they developed, and
not an inherent skill issue, i.e. there exist agents that improve
their game. As an example, see P7’s performance with S7,
which is on level with the highest performances achieved
by any participant with any agent. Perhaps this could have
been achieved with their own agent with longer training, or
further policy update iterations.
To further analyse this, we compare the different trained
agents, independent of the human interacting with them. To
do this, we “test-drive” our agents offline, by feeding them
state iterations, evenly distributed to cover a fair sample of
all possible state ranges for all 8 state parameters. We iterate
Fig. 5: Left: Correlations between the behaviour of all participant-
trained agents, as well as the models they tested against, S1, S5,
S7 and expert. Right: Spatial representation of trained agents’
behaviour correlation with that of the pre-model, for participants
P2, P3, P6 and P7. The goal is marked as a green circle – see Figure
2 for reference. A higher correlation in a given position means that
the final agent’s policy has changed less from the original pre-model
on which training started.
xand ywith 5.5cm intervals, ˙xand ˙ywith 30cm/s intervals,
tray angles along the two axes with 0.05rad intervals and
respective angular velocities with 0.2rad/s intervals. Cross-
iterating all the state parameters, we record output actions of
the agents. This results in an output action vector of length
1,265,625 which can then be used to compare the behaviour
of different agents, through correlation analysis. We look at
how the participants’ policies, and the testing policies they
tried out relate to each other. For this we check correlations
between the different agents’ action outputs when fed the
same iterations of states as inputs. The result of this can be
seen in Figure 5-Left. Note that, in between the participants,
the expert agent has the highest correlations with those of P1,
P2 and P3 – same participants that have the best performance
with it. Generally, the expert and S7 agents have the highest
correlations with the participants’ agents, and they are also
the agents that get the best performance from the participants,
aside from their own agents - see Figure 4. S1 and S5 have
the lowest correlations overall with our participants’ agents,
and again this fits with the performance plots of Figure 4. The
general trend observed by looking at individual participants’
agents and how they correlate with test agents, is that the
higher a test agent is correlated with the participant’s own
agent, the better the participant’s performance will be with it.
Note that the actions of an RL agent in isolation are relating
to the behaviour of the person that trained RL agent, when
facing other RL agents. This is an indication that our human-
in-the-loop system is leading to co-learning, creating agents
that can serve as a representation of the human that trained
them, in terms of their skill in this game.
In order to show the meaning of these correlations more
intuitively, we present a spatial representation for 4 of the
participants across the spectrum. We take P2, P3, P6 and
P7. P2 and P3 show generally good performance on their
own models, the expert model and S7. P6 and P7 have poor
performance overall, though P7 plays well with S7. Figure 5-
Right, shows how these four participants’ agents, developed
their policies from the pre-model, in a spatial sense. The
figure depicts the game tray, with the heatmap values re-
CONFIDENTIAL. Limited circulation. For review only.
Manuscript 1384 submitted to 2020 IEEE/RSJ International Conference
on Intelligent Robots and Systems. Received March 1, 2020.
Fig. 6: Success rate (reaching the goal) for all 7 participants as
they went through 80 trials of training, in 10-trial blocks, serving
as the learning curve. Mean and standard error of the mean across
all participants shown.
flecting the correlation of the participant’s agent’s behaviour
in each position, with that of the pre-model on which they
started the training. A high correlation means that the pre-
model’s policy has been retained, whereas lower correlations
correspond to higher degrees of change in policy. The pre-
model has a good policy around the barrier, and is capable
of helping participants get to the goal side of the game. On
the goal side however, and particularly the corner closest
to the goal, it does not have the best policy: it implements
very coarse actions that are hard to coordinate with. We see
these reflected in the four participants’ policy changes: P2
and P3 have made bigger changes to the policy near the goal,
and smaller changes around the barrier, whereas P6 and P7
have done the reverse. This is reflected in their performance
results.
Finally, figure 6 is learning curve based on success rate,
i.e. number of times reaching the goal, within 10-trial blocks.
Mean values and standard error of the mean across all 7
participants are shown. We see the beginning of a plateau
occurring towards the end of the training phase.
VI. CONCLUSIONS
We presented a real-world, human-in-the-loop, reinforce-
ment learning setup for studies on human-robot collaborative
learning. The setup consists of a non-trivial ball and maze
game, which can only be solved through effective collabora-
tion. We initially tested out the system on pilot experiments,
to confirm feasibility of real-world learning with the setup.
Tested with 1 subject over a long period of iterations we see
constant improvement and a plateau in the learning curve.
Tested with 10 participants for a shorter period, we see that
half of the participants reach human-human collaborative
performance levels.
Based on the above outcomes, we designed experiments
for investigation into human-robot co-learning. We tested 7
participants, for 80 trials, training with an RL agent with
minimal pre-training. Our results show that with a human in-
the-loop it is possible to settle on an effective collaborative
policy that leads to consistent success in the game. This is,
however, variable across participants, and highly dependent
on the particular participant’s behaviour during training with
the RL agent. We see this confirmed through analysis on
how agents of different participants correlate across, and with
the test agents. Effectively, we are able to relate a human
player’s performance with new agents that are not their own,
by looking at how similar the new agents’ policy is to that of
their own agent. We intend to continue experimenting with
this setup to further explore the intricacies of human-robot
collaborative learning and motor adaptation.
REFERENCES
[1] D. M. Wolpert, J. Diedrichsen, and J. R. Flanagan, “Principles of
sensorimotor learning,” Nature Reviews Neuroscience, vol. 12, no. 12,
pp. 739–751, 2011.
[2] P. Stone and M. Veloso, “Multiagent Systems : A Survey from a
Machine Learning Perspective 1 Introduction 2 Multiagent Systems,
Autonomous Robots, vol. 8, no. 3, p. 345–383, 1997.
[3] P. Hernandez-Leal, B. Kartal, and M. E. Taylor, A survey and critique
of multiagent deep reinforcement learning, vol. 33. Springer US, 2019.
[4] W. B. Knox and P. Stone, “Interactively shaping agents via human
reinforcement: The TAMER framework,” K-CAP’09 - Proceedings of
the 5th International Conference on Knowledge Capture, pp. 9–16,
2009.
[5] G. Warnell, N. Waytowich, V. Lawhern, and P. Stone, “Deep TAMER:
Interactive agent shaping in high-dimensional state spaces,32nd AAAI
Conference on Artificial Intelligence, AAAI 2018, pp. 1545–1553,
2018.
[6] P. F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and
D. Amodei, “Deep reinforcement learning from human preferences,”
Advances in Neural Information Processing Systems, vol. 2017-
Decem, no. Nips, pp. 4300–4308, 2017.
[7] Y. F. Chen, M. Everett, M. Liu, and J. P. How, “Socially aware
motion planning with deep reinforcement learning,” in 2017 IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS),
pp. 1343–1350, IEEE, 2017.
[8] S. Reddy, A. D. Dragan, and S. Levine, “Shared autonomy via deep
reinforcement learning,” in Robotics: Science and Systems (RSS) 2018
conference, 2018.
[9] C. Berner, G. Brockman, B. Chan, V. Cheung, P. D˛ebiak, C. Dennison,
D. Farhi, Q. Fischer, S. Hashme, C. Hesse, et al., “Dota 2 with large
scale deep reinforcement learning,” arXiv preprint arXiv:1912.06680,
2019.
[10] Y. F. Chen, M. Everett, M. Liu, and J. P. How, “Socially aware motion
planning with deep reinforcement learning,” IEEE International Con-
ference on Intelligent Robots and Systems, vol. 2017-Septe, pp. 1343–
1350, 2017.
[11] J. Macglashan, R. Loftin, M. L. Littman, D. L. Roberts, and M. E.
Taylor, “A Need for Speed : Adapting Agent Action Speed to Improve
Task Learning from Non-Expert Humans Categories and Subject
Descriptors,” Aamas 2016, pp. 957–965, 2016.
[12] A. Makrigiorgos, A. Shafti, A. Harston, J. Gerard, and A. A. Faisal,
“Human visual attention prediction boosts learning & performance of
autonomous driving agents,arXiv preprint arXiv:1909.05003, 2019.
[13] J. Foerster, R. Y. Chen, and P. Abbeel, “Learning with Opponent-
Learning Awareness,” pp. 122–130, 2018.
[14] N. C. Rabinowitz, F. Perbet, H. F. Song, C. Zhang, and M. Botvinick,
“Machine Theory of mind,” 35th International Conference on Machine
Learning, ICML 2018, vol. 10, pp. 6723–6738, 2018.
[15] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs,
R. Wheeler, and A. Y. Ng, “Ros: an open-source robot operating
system,” in ICRA workshop on open source software, vol. 3, p. 5,
Kobe, Japan, 2009.
[16] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “Pytorch: An
imperative style, high-performance deep learning library,” in Advances
in Neural Information Processing Systems, pp. 8024–8035, 2019.
[17] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic:
Off-policy maximum entropy deep reinforcement learning with a
stochastic actor,” in Proceedings of the 35th International Conference
on Machine Learning, 2018.
[18] J. Achiam, “Spinning Up in Deep Reinforcement Learning,” 2018.
[19] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey, “Maximum
entropy inverse reinforcement learning,” in Proceedings of the Twenty-
Third AAAI Conference on Artificial Intelligence, 2008.
[20] T. Haarnoja, A. Zhou, S. Ha, J. Tan, G. Tucker, and S. Levine,
“Learning to walk via deep reinforcement learning,” in Robotics:
Science and Systems (RSS) 2019 conference, 2019.
[21] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan,
V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al., “Soft actor-critic
algorithms and applications,” arXiv preprint arXiv:1812.05905, 2018.
CONFIDENTIAL. Limited circulation. For review only.
Manuscript 1384 submitted to 2020 IEEE/RSJ International Conference
on Intelligent Robots and Systems. Received March 1, 2020.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Deep reinforcement learning (RL) has achieved outstanding results in recent years. This has led to a dramatic increase in the number of applications and methods. Recent works have explored learning beyond single-agent scenarios and have considered multiagent learning (MAL) scenarios. Initial results report successes in complex multiagent domains, although there are several challenges to be addressed. The primary goal of this article is to provide a clear overview of current multiagent deep reinforcement learning (MDRL) literature. Additionally, we complement the overview with a broader analysis: (i) we revisit previous key components, originally presented in MAL and RL, and highlight how they have been adapted to multiagent deep reinforcement learning settings. (ii) We provide general guidelines to new practitioners in the area: describing lessons learned from MDRL works, pointing to recent benchmarks, and outlining open avenues of research. (iii) We take a more critical tone raising practical challenges of MDRL (e.g., implementation and computational demands). We expect this article will help unify and motivate future research to take advantage of the abundant literature that exists (e.g., RL and MAL) in a joint effort to promote fruitful research in the multiagent community.
Conference Paper
Full-text available
While recent advances in deep reinforcement learning have allowed autonomous learning agents to succeed at a variety of complex tasks, existing algorithms generally require a lot of training data. One way to increase the speed at which agents are able to learn to perform tasks is by leveraging the input of human trainers. Although such input can take many forms, real-time, scalar-valued feedback is especially useful in situations where it proves difficult or impossible for humans to provide expert demonstrations. Previous approaches have shown the usefulness of human input provided in this fashion (e.g., the TAMER framework), but they have thus far not considered high-dimensional state spaces or employed the use of deep learning. In this paper, we do both: we propose Deep TAMER, an extension of the TAMER framework that leverages the representational power of deep neural networks in order to learn complex tasks in just a short amount of time with a human trainer. We demonstrate Deep TAMER's success by using it and just 15 minutes of human-provided feedback to train an agent that performs better than humans on the Atari game of Bowling - a task that has proven difficult for even state-of-the-art reinforcement learning methods.
Article
Full-text available
Multi-agent settings are quickly gathering importance in machine learning. Beyond a plethora of recent work on deep multi-agent reinforcement learning, hierarchical reinforcement learning, generative adversarial networks and decentralized optimization can all be seen as instances of this setting. However, the presence of multiple learning agents in these settings renders the training problem non-stationary and often leads to unstable training or undesired final results. We present Learning with Opponent-Learning Awareness (LOLA), a method that reasons about the anticipated learning of the other agents. The LOLA learning rule includes an additional term that accounts for the impact of the agent's policy on the anticipated parameter update of the other agents. We show that the LOLA update rule can be efficiently calculated using an extension of the likelihood ratio policy gradient update, making the method suitable for model-free reinforcement learning. This method thus scales to large parameter and input spaces and nonlinear function approximators. Preliminary results show that the encounter of two LOLA agents leads to the emergence of tit-for-tat and therefore cooperation in the infinitely iterated prisoners' dilemma, while independent learning does not. In this domain, LOLA also receives higher payouts compared to a naive learner, and is robust against exploitation by higher order gradient-based methods. Applied to infinitely repeated matching pennies, only LOLA agents converge to the Nash equilibrium. We also apply LOLA to a grid world task with an embedded social dilemma using deep recurrent policies. Again, by considering the learning of the other agent, LOLA agents learn to cooperate out of selfish interests.
Article
Full-text available
For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent's interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time. These behaviors and environments are considerably more complex than any that have been previously learned from human feedback.
Article
Full-text available
For robotic vehicles to navigate safely and efficiently in pedestrian-rich environments, it is important to model subtle human behaviors and navigation rules. However, while instinctive to humans, socially compliant navigation is still difficult to quantify due to the stochasticity in people's behaviors. Existing works are mostly focused on using feature-matching techniques to describe and imitate human paths, but often do not generalize well since the feature values can vary from person to person, and even run to run. This work notes that while it is challenging to directly specify the details of what to do (precise mechanisms of human navigation), it is straightforward to specify what not to do (violations of social norms). Specifically, using deep reinforcement learning, this work develops a time-efficient navigation policy that respects common social norms. The proposed method is shown to enable fully autonomous navigation of a robotic vehicle moving at human walking speed in an environment with many pedestrians.
Conference Paper
Full-text available
Recent research has shown the benefit of framing problems of imitation learning as solutions to Markov Decision Prob- lems. This approach reduces learning to the problem of re- covering a utility function that makes the behavior induced by a near-optimal policy closely mimic demonstrated behav- ior. In this work, we develop a probabilistic approach based on the principle of maximum entropy. Our approach provides a well-defined, globally normalized distribution over decision sequences, while providing the same performance guarantees as existing methods. We develop our technique in the context of modeling real- world navigation and driving behaviors where collected data is inherently noisy and imperfect. Our probabilistic approach enables modeling of route preferences as well as a powerful new approach to inferring destinations and routes based on partial trajectories. employ the principle of maximum entropy to resolve the am- biguity in choosing a distribution over decisions. We pro- vide efficient algorithms for learning and inference for de- terministic MDPs. We rely on an additional simplifying as- sumption to make reasoning about non-deterministic MDPs tractable. The resulting distribution is a probabilistic model that normalizes globally over behaviors and can be under- stood as an extension to chain conditional random fields that incorporates the dynamics of the planning system and ex- tends to the infinite horizon. Our research effort is motivated by the problem of mod- eling real-world routing preferences of drivers. We apply our approach to route preference modeling using 100,000 miles of collected GPS data of taxi-cab driving, where the structure of the world (i.e., the road network) is known and the actions available (i.e., traversing a road segment) are characterized by road features (e.g., speed limit, number of lanes). In sharp contrast to many imitation learning tech- niques, our probabilistic model of purposeful behavior in- tegrates seamlessly with other probabilistic methods includ- ing hidden variable techniques. This allows us to extend our route preferences with hidden goals to naturally infer both future routes and destinations based on partial trajectories. A key concern is that demonstrated behavior is prone to noise and imperfect behavior. The maximum entropy ap- proach provides a principled method of dealing with this uncertainty. We discuss several additional advantages in modeling behavior that this technique has over existing ap- proaches to inverse reinforcement learning including margin methods (Ratliff, Bagnell, & Zinkevich 2006) and those that normalize locally over each state's available actions (Ra- machandran & Amir 2007; Neu & Szepesvri 2007).
Conference Paper
Full-text available
As computational learning agents move into domains that incur real costs (e.g., autonomous driving or financial investment), it will be necessary to learn good policies without numerous high-cost learning trials. One promising approach to reducing sample complexity of learning a task is knowledge transfer from humans to agents. Ideally, methods of transfer should be accessible to anyone with task knowledge, regardless of that person's expertise in programming and AI. This paper focuses on allowing a human trainer to interactively shape an agent's policy via reinforcement signals. Specifically, the paper introduces "Training an Agent Manually via Evaluative Reinforcement," or TAMER, a framework that enables such shaping. Differing from previous approaches to interactive shaping, a TAMER agent models the human's reinforcement and exploits its model by choosing actions expected to be most highly reinforced. Results from two domains demonstrate that lay users can train TAMER agents without defining an environmental reward function (as in an MDP) and indicate that human training within the TAMER framework can reduce sample complexity over autonomous learning algorithms.
Conference Paper
Abstract— This paper gives an overview of ROS, an open- source robot operating,system. ROS is not an operating,system in the traditional sense of process management,and scheduling; rather, it provides a structured communications layer above the host operating,systems,of a heterogenous,compute,cluster. In this paper, we discuss how ROS relates to existing robot software frameworks, and briefly overview some of the available application software,which,uses ROS.