Conference PaperPDF Available

Replacing reward function with user feedback

Authors:

Abstract and Figures

Reinforcement learning refers to powerful algorithms for solving goal related problems by maximizing the reward over many time steps. By incorporating them into the dynamic movement primitives (DMPs) which are now widely used parametric representations in robotics, movements obtained from a single human demonstration can be adapted so that a robot learns how to execute different variations of the same task. Reinforcement learning algorithms require carefully designed cost function, which in most cases uses additional sensors to evaluate some environment criteria. In this paper we explore possibilities of learning robotic actions using only user feedback as a reward function. Two reward functions have been used and their results are presented and compared. These were user feedback and a simplified reward function. Experimental results show that for the simple actions where only the terminal reward is given, these 2 reward functions work almost as good as having a reward function based on the exact measurement.
Content may be subject to copyright.
Replacing reward function with user feedback
Zvezdan Lonˇ
carevi´
c, Rok Pahiˇ
c, Aleˇ
s Ude, Bojan Nemec, Andrej Gams
All authors are with the Department of Automatics, Biocybernetics and Robotics, Joˇ
zef Stefan Institute,
and with the Joˇ
zef Stefan International Postgraduate School, Jamova 39, 1000 Ljubljana, Slovenia
e-mail: zvezdan.loncarevic@ijs.si
Abstract
Reinforcement learning refers to powerful algorithms
for solving goal related problems by maximizing the
reward over many time steps. By incorporating them
into the dynamic movement primitives (DMPs) which are
now widely used parametric representations in robotics,
movements obtained from a single human demonstration
can be adapted so that a robot learns how to execute
different variations of the same task. Reinforcement
learning algorithms require carefully designed cost
function, which in most cases uses additional sensors to
evaluate some environment criteria.
In this paper we explore possibilities of learning
robotic actions using only user feedback as a reward
function. Two reward functions have been used and their
results are presented and compared. These were user
feedback and a simplified reward function. Experimental
results show that for the simple actions where only the
terminal reward is given, these 2 reward functions work
almost as good as having a reward function based on the
exact measurement.
1 Introduction
As robots become a mass consumer product, besides
standard industrial environments, they are expected to be
able to perform in unstructured environments, such as
households or some other real-life situations. One of the
biggest barriers to wider implementation of robots in our
home environment is the lack of environment models.
In most cases, a complex sensory system is required
to estimate relevant environment parameters. That is
why standard ways of programming robotic movements,
that need an expert programmer for every variation of
the task, are not sufficient. With humanoid robots with
many degrees of freedom, autonomy is one of the main
unresolved issues in contemporary robotics [1]. Imitation
and reinforcement learning are the two most common
approaches for solving this issue. Because robot actions
are mainly recorded as parametric representations with
many parameters search space for reinforcement learning
algorithms that use gradient methods (finite difference
gradient, natural gradient, vanilla gradient, etc.) is large.
Only recently, probabilistic algorithms such as PI2and
PoWER have been developed and able to deal with high
dimensional search space [2, 3].
Figure 1: Experimental setup with Mitsubishi PA-10 robot
One of the major problems in reinforcement learning
is determining the reward function. It was usually solved
by modifying reward function according to the teachers
behavior [4, 5, 6].
The main goal of this paper is to show that a sim-
plified, reduced reward function can be used directly for
some robot actions. An example of such is throwing a
ball, i. e. the robot learns how to perform an accurate
throwing action relying only on feedback from naive user
or cheap imprecise sensors.
The paper is organized as follows: In the next sec-
tion, we briefly present dynamical movement primitives
that are already a widely used method in programming
by demonstration, and a probabilistic algorithm called
PoWER. We used it to learn new DMP parameters for
the task. In Section III we introduce the reduced reward
functions that were used. Next section presents experi-
mental setup and results obtained in simulation and on
the real robot. The paper concludes with a short outlook
on the obtained results and suggestions for future work.
2 Reinforcement Learning
In this section we provide the DMPs and PoWER basics.
2.1 Dynamic Movement Primitives
The basic idea of Dynamic Movement Primitives (DMPs)
[7] is to represent the trajectory with the well known dy-
namical system described with equations for the mass on
spring-damper system. For a single degree of freedom
(DOF) denoted by y, in our case one of the joint task-
space coordinates, DMP is based on the following second
order differential equation:
τ2¨y=αz(βz(gy)τ˙y) + f(x),(1)
where τis the time constant and it is used for time scaling
(time in which the trajectory needs to be reproduced), αz
and βzare damping constants (βz=αz/4) that make
system critically damped and xis the phase variable. The
nonlinear term f(x)contains free parameters that enable
the robot to follow any smooth point-to-point trajectory
from the initial position y0to the final configuration g
[8]. The phase and kernel functions are given by
f(x) = PN
i=1 ψ(x)ωi
PN
i=1 ψi(x)x, (2)
ψi(x) = exp (1
2δ2
i
(xci)2),(3)
where ciare the centers of radial basis functions (ψi(x))
distributed along the trajectory and 1
2δ2
i
(also labeled as hi
[9]) their widths. Phase xmakes the forcing term f(x)
disappear when the goal is reached because it exponen-
tially converges to 0. Its dynamics are given by
x= exp (αxt/τ),(4)
where αxis a positive constant and xstarts from 1 and
converges to 0 as the goal is reached. Multiple DOFs are
realized by maintaining separate sets of (1–3), while a
single canonical system given by (4) is used to synchro-
nize them. The weight vector w, which is composed of
weights wi, defines the shape of the encoded trajectory.
Learning of the weight vector using a batch approach has
been described in [10] and [7] and it is based on solving
a system of linear equations in a least-squares sense.
2.2 PoWER
Before the robots become more autonomous, they would
have to be capable of autonomous learning and adjusting
their control policies based on the feedback from the in-
teraction with varying environment. Policy learning is an
optimization process by which we want to maximize the
state value of cost function:
J(θ) = Eh
H
X
k=0
αkrk(θ)i.(5)
with varying policy parameters θRn[11]. In the
last equation (5), Eis the expectation value, rkis re-
ward given for time step kand it depends of parame-
ters that are chosen (θ). His the number of time steps
in which reward is given and αkare the time step de-
pendent weighting factors. Although there are numerous
methods that can be used to optimize this function, the
main problem is high dimensionality of parameters θ(for
DMP it is typically 20-50 per joint, we use 25 per joint).
Stochastic methods as PI2and PoWER were recently in-
troduced. It was proven that PI2and PoWER perform
identically when only terminal reward for periodic learn-
ing was available. With PoWER it is easy to incorporate
also other policy parameters [1] such as starting (y0) and
ending point (g) and time of execution (τ) of the trajec-
tory for each DOF and not only DMP weights (w) so in
this research the parameters that are learned are:
θ=w, g, y0.(6)
During the learning process this parameters are updated
using the rule:
θm+1 =θm+PL
k=1(θik θm)rk
PL
k=1 rk
(7)
where θm+1 and θmare parameters after and before up-
date, Lis number of parameters in importance sampler
matrix and they represent Lexecutions with highest re-
wards (rk). θiis selected using stochastic exploration
policy
θ
i=θ
m+i,(8)
where iis Gaussian zero noise. Variance of the noise
(σ2) is the only tuning parameter of this method. Be-
cause θconsists of three DMP parameters that cannot be
searched with the same variance, it requires three differ-
ent variances to be chosen. In general, higher σ2would
lead to faster, and lower σ2to more precise convergence.
3 Reward function
As the goal of this paper is to examine the possibility to
avoid need for expensive and precise sensors that can be
set and calibrated only in laboratories for some actions to
be successful, and enable a user to train a robot to per-
form variation of some action with simple instructions,
thus discretized reward function is introduced. Two re-
ward functions are compared between each other and to
the exact reward function based on the distance measured
with sensors.
In the first reward function (unsigned), rewards
are given on a five-star scale where the terms {“one
star”,“two stars”,“three stars”,“four stars” and “five
stars”}have the corresponding rewards: r={1/5, 2/5, 3/5,
4/5, 1}. This means that the robot did not know in which
direction to change its throws. Similar is presented in
[12].
In the second reward function (signed), robot con-
verged to its target using the feedback in which reward
was formed using five possible rewards:“too short” (r=
1/3), “short” (r=2/3), “hit” (r= 1), “long” (r=
2/3), and “too long” (r= 1/3). This allowed us to al-
ways put in importance sampler matrix the shots that are
on the different sides of the target.
This means that both functions are of the same com-
plexity because they have only ve possible rewards. Al-
though the five-star system can discretize better, the sec-
ond should be able to compensate its lower discretization
with the new way of choosing movements that will be in
the importance sampler.
4 Experimental evaluation
4.1 Experimental setup
The experiment was conducted in simulation and on a
real-system. People as well as computer-simulated hu-
man reward systems were used.
The participants used a GUI to rate (give terminal re-
ward) to the shots until the robot manages to hit the target.
In order to evaluate statistical parameters describing
the success of learning, we have also created the program
that should simulate human ratings with discretized re-
ward function and with the variance between the reward
borders (to simulate human uncertainty). The uncertainty
was determined empirically. Simulation and this program
allowed us to make much more trials without human par-
ticipants. This way we tested our algorithm for ten dif-
ferent positions of the target with the diameter of 10 cm.
(30 trials for each position) within the possible range of
the robot (between 2.4m to 4m providing that the robot
base was mounted on a stand of 1 m height).
Finally, the functionality of the algorithm was
confirmed in the real-world using the Mitsubishi PA-10
robot. It needed to hit the basket with a 20 cm diameter
with a 13 cm diameter ball. The experimental setup with
the real robot is shown In Fig. 1.
4.2 User study
In the following, human-robot interaction (HRI) study
with volunteer participants that are naive to the learning
algorithm is described. Participants were using graphi-
cal user interface (GUI) in which they could rate the suc-
cess of the shot in the five-star rating system (first reward
function) and in the GUI where they could choose if the
shot was “too short”, “short”, “hit”, “long”, or “too long”
(second reward function) in a randomized order. They
were not informed on how the reward function works,
but only that learning was finished after they rate a shot
with five stars in the first reward function system, or af-
ter they pressed “hit” in the second reward function sys-
tem. The maximum number of iterations was 120. Tests
with users were conducted using a simulation created in
Matlab, where the robot model was made using the same
measurements and limitations as on the real robot.
4.3 Results
Figure 2 shows throwing error convergence, which tends
towards zero for all experiments. The top plot shows
the results where human judgment was simulated using
Matlab environment and presents statistics of 300 trials.
Each trial had update in 120 iterations. Bottom figure
represents error convergence from the results with hu-
man judgment. Although human judgment criteria differs
among participants, the robot was still able to converge to
its goal.
In Fig. 3 rewards that were given to the executed shot
are shown. The top plot shows rewards (r) that were com-
puter generated and the bottom one shows rewards that
0 10 20 30 40 50 60 70
0
10
20
0 10 20 30 40 50 60 70
0
10
20
Figure 2: The top picture shows mean error convergence caused
by computer simulated human judgment and the bottom one by
real human judgment. Solid line denotes the results based on the
real measurement reward function, dotted denotes the results for
the unsigned reward function and dashed denotes results for the
signed reward function.
were given using human judgment. Statistics for com-
puter generated rewards is taken from 300 trials and for
human rewards from 10 participants that did the experi-
ment.
0 10 20 30 40 50 60 70
0.5
1
0 10 20 30 40 50 60 70
0.5
1
Figure 3: The top picture shows computer simulated reward and
the bottom one real human mean reward convergence. Solid line
denotes the results based on the real measurement reward func-
tion, dotted denotes results for the unsigned reward function and
dashed denotes results for the signed reward function.
Figure 5 (top) shows the statistics of average and the
last (bottom) iteration in which the first shot happened
in computer simulated judgment (shaded bars) and in the
experiment with people (white bars). Results are shown
for the cases where exact distance was measured, for five-
star reward function (unsigned discrete), and for the re-
ward function where people were judging according to
the side of the basket where the ball fell (signed discrete).
It shows that with computer simulated human judgment
(shaded bars), unsigned works only slightly better, but
with the signed one, the worst case scenario is better. Sur-
prisingly, with the human judgment, signed reward func-
tion drastically outperformed the unsigned. This can be
explained by the fact that people tend to rate the shots
in comparison to the previous one and that way uninten-
tionally form a gradient. This is a fact similar to what was
discussed in [13].
Figure 4: Experiment on the real robot: In the first row, exact distance was measured, in the second row unsigned reward system
was used and in the third one signed reward function was used. The robot was throwing the ball from the right.
Figure 5: Average first shot and its deviation in Nakagami dis-
tribution (top) and worst case first shot (bottom): computer
(shaded bars), human (white bars)
5 Conclusion
Results show that reinforcement learning of some sim-
ple tasks can be done with the reduced reward function
almost equally good as with exact measurement based
reward function. Even so, the problems related to rein-
forcement learning in general stay the same: it is very
difficult and time-consuming task to set the appropriate
noise variance (especially if more parameters are varied
like in this case). Note that excessive noise variance re-
sults in jerky robot trajectories, which might even damage
the robot itself.
That is why in the future, we will test this using the
latent space of the neural network, where we have less
different search variances to tune and also a possibility to
train neural network on executable shots so that too big
variance cannot lead to the trajectories that differ a lot
from the original, demonstrated one.
References
[1] B. Nemec, D. Forte, R. Vuga, M. Tamosiunaite, F. Wor-
gotter, and A. Ude, Applying statistical generalization to
determine search direction for reinforcement learning of
movement primitives, IEEE-RAS International Confer-
ence on Humanoid Robots, pp. 65–70, 2012.
[2] E. Theodorou, J. Buchli, and S. Schaal, “A Generalized
Path Integral Control Approach to Reinforcement Learn-
ing,” Journal of Machine Learning Research, vol. 11,
pp. 3137–3181, 2010.
[3] J. Kober and J. Peters, “Learning motor primitives
for robotics,” 2009 IEEE International Conference on
Robotics and Automation, pp. 2112–2118, 2009.
[4] P. Abbeel and A. Y. Ng, Apprenticeship learning via in-
verse reinforcement learning, Twenty-first international
conference on Machine learning - ICML ’04, p. 1, 2004.
[5] W. Knox, C. Breazeal, and P. Stone, “Learning from
feedback on actions past and intended,” In Proceedings
of 7th ACM/IEEE International Conference on Human-
Robot Interaction, Late-Breaking Reports Session (HRI
2012), 2012.
[6] S. Griffith, K. Subramanian, and J. Scholz, “Policy Shap-
ing: Integrating Human Feedback with Reinforcement
Learning,” Advances in Neural Information Processing
Systems (NIPS), pp. 1–9, 2013.
[7] A. Ijspeert, J. Nakanishi, and S. Schaal, “Movement im-
itation with nonlinear dynamical systems in humanoid
robots,” Proceedings 2002 IEEE International Confer-
ence on Robotics and Automation (Cat. No.02CH37292),
vol. 2, no. May, pp. 1398–1403, 2002.
[8] D. Forte, A. Gams, J. Morimoto, and A. Ude, “On-
line motion synthesis and adaptation using a trajectory
database,” Robotics and Autonomous Systems, vol. 60,
no. 10, pp. 1327–1339, 2012.
[9] A. Gams, A. J. Ijspeert, S. Schaal, and J. Lenarˇ
ciˇ
c,
“On-line learning and modulation of periodic movements
with nonlinear dynamical systems,” Autonomous Robots,
vol. 27, no. 1, pp. 3–23, 2009.
[10] A. Ude, A. Gams, T. Asfour, and J. Morimoto, “Task-
specific generalization of discrete and periodic dynamic
movement primitives, IEEE Transactions on Robotics,
vol. 26, no. 5, pp. 800–815, 2010.
[11] B. Nemec, A. Gams, and A. Ude, “Velocity adaptation for
self-improvement of skills learned from user demonstra-
tions,” IEEE-RAS International Conference on Humanoid
Robots, vol. 2015-February, no. February, pp. 423–428,
2015.
[12] A.-L. Vollmer and N. J. Hemion, “A User Study on Robot
Skill Learning Without a Cost Function: Optimization
of Dynamic Movement Primitives via Naive User Feed-
back,” Frontiers in Robotics and AI, vol. 5, no. July, 2018.
[13] A. L. Thomaz and C. Breazeal, “Teachable robots: Under-
standing human teaching behavior to build more effective
robot learners,” Artificial Intelligence, vol. 172, no. 6-7,
pp. 716–737, 2008.
... One of the common methods for autonomous refinement of skills is reinforcement learning (RL), which offers a framework and a set of tools for the design of sophisticated and hard-to-engineer behaviors [24]. RL refers to powerful algorithms used for solving problems by maximizing a reward over many time steps [33]. Reward gets maximized by iterative trials that are used in order to explore the different variations (different parameters also known as a search space) of the already learned knowledge and observing corresponding reward in such cases. ...
Thesis
Full-text available
In order to be able to operate in an everyday human environments, humanoid robots will have to be able to autonomously adapt their actions, using among other things reinforcement learning methods. However, determination of an appropriate reward function for reinforcement learning remains a complex problem even for domain experts. In the thesis we investigate the possibility of utilizing a simple, qualitatively determined reward, which enables the extension of current algorithms to include human-agent into the learning process, while keeping their original formulation intact. Even with a working human reward system that is able to lead a robot to successful skill refinement, the current methods of RL are not appropriate for practical use on increased complexity high degree-of-freedom robots due to the high number of parameters that should be learned. Different methods of reducing the dimensionality have been proposed in the literature. The most suitable approach for our work is to use of special neural networks called autoencoders. They are capable of extracting only the important features of each action and thus reducing the dimensionality of the parameters that need to be found by reinforcement learning. As with all neural networks, the biggest problem that prevents their more extensive use is how to obtain a large enough database for learning. We test the number of required database samples and architecture of the network that would enable us to achieve the desired precision of trajectories while still keeping the number of parameters low. We extend this problem into real-world problems and analyze the possibilities of database extension without executing a huge amount of actions on the real system. We use the generalization method for this purpose and inspect the influence of the error introduced by such methods.
Article
Full-text available
Enabling users to teach their robots new tasks at home is a major challenge for research in personal robotics. This work presents a user study in which participants were asked to teach the robot Pepper a game of skill. The robot was equipped with a state-of-the-art skill learning method, based on dynamic movement primitives (DMPs). The only feedback participants could give was a discrete rating after each of Pepper's movement executions (“very good,” “good,” “average,” “not so good,” “not good at all”). We compare the learning performance of the robot when applying user-provided feedback with a version of the learning where an objectively determined cost via hand-coded cost function and external tracking system is applied. Our findings suggest that (a) an intuitive graphical user interface for providing discrete feedback can be used for robot learning of complex movement skills when using DMP-based optimization, making the tedious definition of a cost function obsolete; and (b) un-experienced users with no knowledge about the learning algorithm naturally tend to apply a working rating strategy, leading to similar learning performance as when using the objectively determined cost. We discuss insights about difficulties when learning from user provided feedback, and make suggestions how learning continuous movement skills from non-expert humans could be improved.
Article
Full-text available
The paper presents a two-layered system for (1) learning and encoding a periodic signal without any knowledge on its frequency and waveform, and (2) modulating the learned periodic trajectory in response to external events. The system is used to learn periodic tasks on a humanoid HOAP-2 robot. The first layer of the system is a dynamical system responsible for extracting the fundamental frequency of the input signal, based on adaptive frequency oscillators. The second layer is a dynamical system responsible for learning of the waveform based on a built-in learning algorithm. By combining the two dynamical systems into one system we can rapidly teach new trajectories to robots without any knowledge of the frequency of the demonstration signal. The system extracts and learns only one period of the demonstration signal. Furthermore, the trajectories are robust to perturbations and can be modulated to cope with a dynamic environment. The system is computationally inexpensive, works on-line for any periodic signal, requires no additional signal processing to determine the frequency of the input signal and can be applied in parallel to multiple dimensions. Additionally, it can adapt to changes in frequency and shape, e.g. to non-stationary signals, such as hand-generated signals and human demonstrations.
Conference Paper
Full-text available
The acquisition and self-improvement of novel motor skills is among the most important problems in robotics. Motor primitives offer one of the most promising frameworks for the application of machine learning techniques in this context. Employing an improved form of the dynamic systems motor primitives originally introduced by Ijspeert et al. [2], we show how both discrete and rhythmic tasks can be learned using a concerted approach of both imitation and reinforcement learning. For doing so, we present both learning algorithms and representations targeted for the practical application in robotics. Furthermore, we show that it is possible to include a start-up phase in rhythmic primitives. We show that two new motor skills, i.e., Ball-in-a-Cup and Ball-Paddling, can be learned on a real Barrett WAM robot arm at a pace similar to human learning while achieving a significantly more reliable final performance.
Article
Full-text available
Acquisition of new sensorimotor knowledge by imitation is a promising paradigm for robot learning. To be effective, action learning should not be limited to direct replication of movements obtained during training but must also enable the generation of actions in situations a robot has never encountered before. This paper describes a methodology that enables the generalization of the available sensorimotor knowledge. New actions are synthesized by the application of statistical methods, where the goal and other characteristics of an action are utilized as queries to create a suitable control policy, taking into account the current state of the world. Nonlinear dynamic systems are employed as a motor representation. The proposed approach enables the generation of a wide range of policies without requiring an expert to modify the underlying representations to account for different task-specific features and perceptual feedback. The paper also demonstrates that the proposed methodology can be integrated with an active vision system of a humanoid robot. 3-D vision data are used to provide query points for statistical generalization. While 3-D vision on humanoid robots with complex oculomotor systems is often difficult due to the modeling uncertainties, we show that these uncertainties can be accounted for by the proposed approach.
Article
Full-text available
With the goal to generate more scalable algorithms with higher efficiency and fewer open parameters, reinforcement learning (RL) has recently moved towards combining classical techniques from optimal control and dynamic programming with modern learning techniques from statistical estimation theory. In this vein, this paper suggests to use the framework of stochastic optimal control with path integrals to derive a novel approach to RL with parameterized policies. While solidly grounded in value function estimation and optimal control based on the stochastic Hamilton-Jacobi-Bellman (HJB) equations, policy improvements can be transformed into an approximation problem of a path integral which has no open algorithmic parameters other than the exploration noise. The resulting algorithm can be conceived of as model-based, semi-model-based, or even model free, depending on how the learning problem is structured. The update equations have no danger of numerical instabilities as neither matrix inversions nor gradient learning rates are required. Our new algorithm demonstrates interesting similarities with previous RL research in the framework of probability matching and provides intuition why the slightly heuristically motivated probability matching approach can actually perform well. Empirical evaluations demonstrate significant performance improvements over gradient-based policy learning and scalability to high-dimensional control problems. Finally, a learning experiment on a simulated 12 degree-of-freedom robot dog illustrates the functionality of our algorithm in a complex robot learning scenario. We believe that Policy Improvement with Path Integrals (PI2) offers currently one of the most efficient, numerically robust, and easy to implement algorithms for RL based on trajectory roll-outs.
Article
A long term goal of Interactive Reinforcement Learning is to incorporate nonexpert human feedback to solve complex tasks. Some state-of-the-art methods have approached this problem by mapping human information to rewards and values and iterating over them to compute better control policies. In this paper we argue for an alternate, more effective characterization of human feedback: Policy Shaping. We introduce Advise, a Bayesian approach that attempts to maximize the information gained from human feedback by utilizing it as direct policy labels. We compare Advise to state-of-the-art approaches and show that it can outperform them and is robust to infrequent and inconsistent human feedback.
Article
Autonomous robots cannot be programmed in advance for all possible situations. Instead, they should be able to generalize the previously acquired knowledge to operate in new situations as they arise. A possible solution to the problem of generalization is to apply statistical methods that can generate useful robot responses in situations for which the robot has not been specifically instructed how to respond. In this paper we propose a methodology for the statistical generalization of the available sensorimotor knowledge in real-time. Example trajectories are generalized by applying Gaussian process regression, using the parameters describing a task as query points into the trajectory database. We show on real-world tasks that the proposed methodology can be integrated into a sensory feedback loop, where the generalization algorithm is applied in real-time to adapt robot motion to the perceived changes of the external world.
Article
While Reinforcement Learning (RL) is not traditionally designed for interactive supervisory input from a human teacher, several works in both robot and software agents have adapted it for human input by letting a human trainer control the reward signal. In this work, we experimentally examine the assumption underlying these works, namely that the human-given reward is compatible with the traditional RL reward signal. We describe an experimental platform with a simulated RL robot and present an analysis of real-time human teaching behavior found in a study in which untrained subjects taught the robot to perform a new task. We report three main observations on how people administer feedback when teaching a Reinforcement Learning agent: (a) they use the reward channel not only for feedback, but also for future-directed guidance; (b) they have a positive bias to their feedback, possibly using the signal as a motivational channel; and (c) they change their behavior as they develop a mental model of the robotic learner. Given this, we made specific modifications to the simulated RL robot, and analyzed and evaluated its learning behavior in four follow-up experiments with human trainers. We report significant improvements on several learning measures. This work demonstrates the importance of understanding the human-teacher/robot-learner partnership in order to design algorithms that support how people want to teach and simultaneously improve the robot's learning behavior.
Conference Paper
We consider reinforcement learning in systems with unknown dynamics. Algorithms such as E3 (Kearns and Singh, 2002) learn near-optimal policies by using "exploration policies" to drive the system towards poorly modeled states, so as to encourage exploration. But this makes these algorithms impractical for many systems; for ex- ample, on an autonomous helicopter, overly ag- gressive exploration may well result in a crash. In this paper, we consider the apprenticeship learn- ing setting in which a teacher demonstration of the task is available. We show that, given the initial demonstration, no explicit exploration is necessary, and we can attain near-optimal per- formance (compared to the teacher) simply by repeatedly executing "exploitation policies" that try to maximize rewards. In finite-state MDPs, our algorithm scales polynomially in the num- ber of states; in continuous-state linear dynami- cal systems, it scales polynomially in the dimen- sion of the state. These results are proved using a martingale construction over relative losses.