Conference PaperPDF Available

Comparative Evaluation of Reinforcement Learning with Scalar Rewards and Linear Regression with Multidimensional Feedback

Authors:
Comparative Evaluation of Reinforcement
Learning with Scalar Rewards and Linear
Regression with Multidimensional Feedback
Petar Kormushev and Darwin G. Caldwell
Department of Advanced Robotics,
Istituto Italiano di Tecnologia,
Via Morego 30, 16163 Genova, Italy
{petar.kormushev,darwin.caldwell}@iit.it
Abstract. This paper presents a comparative evaluation of two learning
approaches. The first approach is a conventional reinforcement learning
algorithm for direct policy search which uses scalar rewards by definition.
The second approach is a custom linear regression based algorithm that
uses multidimensional feedback instead of a scalar reward. The two ap-
proaches are evaluated in simulation on a common benchmark problem:
an aiming task where the goal is to learn the optimal parameters for aim-
ing that result in hitting as close as possible to a given target. The com-
parative evaluation shows that the multidimensional feedback provides
a significant advantage over the scalar reward, resulting in an order-of-
magnitude speed-up of the convergence. A real-world experiment with a
humanoid robot confirms the results from the simulation and highlights
the importance of multidimensional feedback for fast learning.
Keywords: reinforcement learning, multidimensional feedback, linear
regression
1 Introduction
A well-established tradition in Reinforcement Learning (RL) is to use a single
scalar reward as a feedback signal [1]. Almost all existing RL algorithms and
techniques rely on the assumption that trials are evaluated with a single numeric
value. This paper aims to challenge the established tradition by showing that
a state-of-the-art RL algorithm for direct policy search is easily outperformed
by a simple linear regression algorithm that relies on multidimensional feedback
instead of a simple scalar reward.
The paper presents a comparative evaluation of the following two learning ap-
proaches. The first approach is a conventional reinforcement learning algorithm
for direct policy search which uses scalar rewards by definition. As a representa-
tive of this approach we use a state-of-the-art Expectation-Maximization based
RL algorithm described in Section 2.
The second approach in this comparison is a custom linear regression based
algorithm that we have proposed which uses multidimensional feedback instead
2 Petar Kormushev and Darwin G. Caldwell
of a scalar reward. The algorithm is based on iterative vector regression with
shrinking support region as explained in Section 3.2.
The two approaches are evaluated in simulation on a common benchmark
problem that we have proposed. It is an aiming task where the goal is to learn the
optimal parameters for aiming that result in hitting as close as possible to a given
target. This problem was chosen because of its simplicity and ease of defining the
feedback signal. For example, the natural measure for the performance of a trial
in this task is the distance between the given target and the trial hit location.
The comparative evaluation between the two approaches shows that the mul-
tidimensional feedback provides a significant advantage over the scalar reward,
resulting in an order-of-magnitude speed-up of the convergence. This is demon-
strated in Section 4.1. We have also conducted a real-world experiment with a
humanoid robot that confirms the results from the simulation and highlights the
importance of multidimensional feedback for fast learning.
The following section gives a brief overview of RL algorithms for direct policy
search, in order to position the comparative evaluation in the context of the
existing RL literature.
2 Reinforcement Learning for Direct Policy Search
In conventional RL, the goal is to find a policy πthat maximizes the expected
future return, calculated based on a scalar reward function R:I → R, where I
is an input space that depends on the problem. The input space can be defined
in different ways, e.g. it could be a state s, or a state transition, or a state-action
pair, or a whole trial as in the case of episodic RL, etc. The policy πdetermines
what actions will be performed by the RL agent.
Very often, the RL problem is formulated in terms of a Markov Decision
Process (MDP) or Partially Observable MDP (POMDP). In this formulation,
the policy πis viewed as a direct mapping function (π:s7−a) from state sS
to action aA. Alternatively, instead of trying to learn the explicit mapping
from states to actions, it is possible to perform direct policy search, as shown
in [2]. In this case, the policy πis considered to depend on some parameters
θRN, and is written as a parameterized function π(θ). The episodic reward
function becomes R(τ(π(θ))), where τis a trial performed by following the policy.
The reward can be abbreviated as R(τ(θ)) or even as R(θ), which reflects the
idea that the behaviour of the RL agent can be influenced by only changing
the values of the policy parameters θ. Therefore, the outcome of the behaviour,
which is represented by the reward R(θ), can be optimized by only optimizing the
values θ. This way, the RL problem is transformed into a black-box optimization
problem with cost function R(θ), as shown in [3].
The following is a non-exhaustive list of state-of-the-art direct policy search
RL approaches:
– Policy Gradient based RL - in which the RL algorithm is trying to
estimate the gradient of the policy with respect to the policy parameters,
Comparative Evaluation of Reinforcement Learning and Linear Regression 3
and to perform gradient descent in policy space. The Episodic Natural Actor-
Critic (eNAC), in [4], and Episodic REINFORCE in [5] are two of the well-
established approaches of this type.
Expectation-Maximization based RL - in which the EM algorithm is
used to derive an update rule for the policy parameters at each step, trying
to maximize a lower bound on the expected return of the policy. A state-of-
the-art RL algorithm of this type is PoWER (Policy learning by Weighting
Exploration with the Returns), in [6], as well as its generalization MCEM,
in [7].
Path Integral based RL - in which the learning of the policy parameters
is based on the framework of stochastic optimal control with path integrals.
A state-of-the-art RL algorithm of this type is PIˆ2 (Policy Improvement
with Path Integrals), in [8].
– Regression based RL - in which regression is used to calculate updates
to the RL policy parameters using the rewards as weights. One of the ap-
proaches that are used extensively is LWPR (Locally Weighted Projection
Regression), in [9].
– Model-based policy search RL - in which a model of the transition
dynamics is learned and used for long-term planning. Policy gradients are
computed analytically for policy improvement using approximate inference.
A state-of-the-art RL algorithm of this type is PILCO (Probabilistic Infer-
ence for Learning COntrol), in [10].
Direct policy search is one of the preferred methods when applying reinforcement
learning in robotics [11]. This is due to the robots’ inherently high-dimensional
continuous action spaces for which direct policy search tends to scale up better
than MDP-based methods.
3 Two Learning Approaches for Comparative Evaluation
In this section we present the two different learning algorithms that are evaluated
on the common aiming task.
3.1 Learning algorithm 1: PoWER
As a first approach for learning the aiming task, we use the state-of-the-art EM-
based RL algorithm PoWER by Kober et al [6]. We selected PoWER algorithm
because it does not need a learning rate (unlike policy-gradient methods) and
also because it can be combined with importance sampling to make better use
of the previous experience of the agent in the estimation of new exploratory
parameters. Moreover, PoWER has demonstrated superior performance in tasks
learned directly on real robots, such as the ball-in-a-cup task [12] and the pancake
flipping task [13].
PoWER uses a parameterized policy and tries to find values for the param-
eters which maximize the expected return of trials (also called trials) under the
4 Petar Kormushev and Darwin G. Caldwell
corresponding policy. For the aiming task the policy parameters are represented
by the elements of a 3D vector corresponding to the aiming direction and initial
velocity of the projectile.
We define the return of a shooting trial τto be:
R(τ) = e−||ˆrTˆrA||,(1)
where ˆrTis the estimated 2D position of the center of the target on the target’s
plane, ˆrAis the estimated 2D position of the projectile, and || · || is Euclidean
distance.
As an instance of EM algorithm, PoWER estimates the policy parameters
θto maximize a lower bound on the expected return from following the policy.
The policy parameters θnat the current iteration nare updated to produce the
new parameters θn+1 using the following rule (as described in [12]):
θn+1 =θn+D(θkθn)R(τk)Ew(τk)
DR(τk)Ew(τk)
.(2)
In Eq. (2), (θkθn) = ∆θk,n is a vector difference which gives the relative
exploration between the policy parameters used on the k-th trial and the current
ones. Each relative exploration ∆θk,n is weighted by the corresponding return
R(τk) of trial τkand the result is normalized using the sum of the same returns.
Intuitively, this update rule can be thought of as a weighted sum of parameter
vectors where higher weight is given to these vectors which result in better
returns.
In order to minimize the number of trials which are needed to estimate new
policy parameters, we use a form of importance sampling technique adapted
for RL [1][6] and denoted by h·iw(τk)in Eq. (2). It allows the RL algorithm to
re-use previous trials τkand their corresponding policy parameters θkduring
the estimation of the new policy parameters θn+1. The importance sampler is
defined as:
Df(θk, τk)Ew(τk)=
σ
X
k=1
f(θind(k), τind(k)),(3)
where σis a fixed parameter denoting how many trials the importance sampler
is to use, and ind(k) is an index function which returns the index of the k-th
best trial in the list of all past trials sorted by their corresponding returns, i.e.
for k= 1 we have:
ind(1) = argmax
i
R(τi),(4)
and the following holds: R(τind(1))R(τind(2))... R(τind(σ)). The impor-
tance sampler allows the RL algorithm to calculate new policy parameters using
the top-σbest trials so far. This reduces the number of required trials to converge
and makes this RL algorithm applicable to online learning.
Comparative Evaluation of Reinforcement Learning and Linear Regression 5
3.2 Learning algorithm 2: ARCHER
^
3D parameter space
2D reward space
θ3
θ2
θ1
r2
r1
r3
θΤ
rΤ
f ( . )
wi
Fig. 1. The conceptual idea underlying the linear regression based algorithm
(ARCHER). The goal is to find the optimal parameters from the 3D parameter space
that result in hitting the given target.
For a second learning approach we propose to use a custom algorithm devel-
oped and optimized specifically for problems like the aiming task, which have
a smooth solution space and prior knowledge about the goal to be achieved.
We will refer to it as the ARCHER algorithm (Augmented Reward CHainEd
Regression). The motivation for ARCHER is to make use of richer feedback in-
formation about the result of a trial. Such information is ignored by the PoWER
RL algorithm because it uses scalar feedback which only depends on the distance
to the target’s center. ARCHER, on the other hand, is designed to use the prior
knowledge we have on the optimum reward possible. In this case, we know that
hitting the center corresponds to the maximum reward we can get. Using this
prior information about the task, we can view the position of the projectile as
an augmented reward. In this case, it consists of a 2-dimensional vector giving
the horizontal and vertical displacement of the projectile with respect to the
target’s center. This information is obtained either directly from the simulated
experiment in Section 4.1 or calculated by an image processing algorithm for the
real-world experiment. Then, ARCHER uses a chained local regression process
that iteratively estimates new policy parameters which have a greater probabil-
ity of leading to the achievement of the goal of the task, based on the experience
so far.
Each trial τi, where i∈ {1, . . . , N}, is initiated by input parameters θi
R3, which is the vector describing the relative position of the hands and is
6 Petar Kormushev and Darwin G. Caldwell
produced by the learning algorithms. Each trial has an associated observed result
(considered as a 2-dimensional reward) ri=f(θi)R2, which is the relative
position of the projectile with respect to the target’s center rT= (0,0)T. The
unknown function fis considered to be non-linear due to air friction, wind flow,
and etc. A schematic figure illustrating the idea of the ARCHER algorithm is
shown in Fig. 1.
Without loss of generality, we assume that the trials are sorted in descending
order by their scalar return calculated by Eq. 1, i.e. R(τi)R(τi+1), i.e. that
r1is the closest to rT. For convenience, we define vectors ri,j =rjriand
θi,j =θjθi. Then, we represent the vector r1,T as a linear combination of
vectors using the Nbest results:
r1,T =
N1
X
i=1
wir1,i+1.(5)
Under the assumption that the original parameter space can be linearly ap-
proximated in a small neighborhood, the calculated weights wiare transferred
back to the original parameter space. Then, the unknown vector to the goal
parameter value θ1,T is approximated with ˆ
θ1,T as a linear combination of the
corresponding parameter vectors using the same weights:
ˆ
θ1,T =
N1
X
i=1
wiθ1,i+1.(6)
In a matrix form, we have r1,T =W U , where Wcontains the weights {wi}N
i=2,
and Ucontains the collected vectors {r1,i}N
i=2 from the observed rewards of N
trials. The least-norm approximation of the weights is given by ˆ
W=r1,T U,
where Uis the pseudoinverse of U.1By repeating this regression process when
adding a new couple {θi, ri}to the dataset at each iteration, the algorithm refines
the solution by selecting at each iteration the Nclosest points to rT. ARCHER
can thus be viewed as a linear vector regression with a shrinking support region.
In order to find the optimal value for N(the number of samples to use for
the regression), we have to consider both the observation errors and the function
approximation error. The observation errors are defined by θ=||˜
θθ|| and
r=||˜rr||, where ˜
θand ˜rare the real values, and θand rare the observed
values. The function approximation error caused by non-linearities is defined by
f=||f||, where Ais the linear approximation.
On the one hand, if the observations are very noisy (rfand θf), it
is better to use bigger values for N, in order to reduce the error when estimating
the parameters wi. On the other hand, for highly non-linear functions f(fr
and fθ), it is better to use smaller values for N, i.e. to use a small subset of
points which are closest to rTin order to minimize the function approximation
error f. For the experiments presented in this paper we used N= 3 in both
1In this case, we used a least-squares estimate. For more complex solution spaces,
ridge regression or other regularization scheme can be considered.
Comparative Evaluation of Reinforcement Learning and Linear Regression 7
the simulation and the real-world, because the observation errors were kept very
small in both cases.
The ARCHER algorithm can also be used for other tasks, provided that:
(1) a-priori knowledge about the desired target reward is known; (2) the reward
can be decomposed into separate dimensions; (3) the task has a smooth solution
space.
4 Comparative Evaluation
4.1 Simulation Experiment
The two proposed learning algorithms (PoWER and ARCHER) are evaluated
and compared in a simulation experiment. The aiming task in this case is an
archery-based task, where the goal is specified as the center of the archery tar-
get. Even though the archery task is hard to model explicitly (e.g., due to the
unknown parameters of the bow and arrow used), the trajectory of the arrow
can be modeled as a simple ballistic trajectory, ignoring air friction, wind ve-
locity and etc. A typical experimental result for each algorithm is shown in Fig.
2. In both simulations, the same initial parameters are used. The simulation is
terminated when the arrow hits inside the innermost ring of the target, i.e. the
distance to the center becomes less than 5 cm.
For a statistically significant observation, the same experiment was repeated
40 times with a fixed number of trials (60) in each session. The averaged exper-
imental result is shown in Fig. 3. The ARCHER algorithm clearly outperforms
the PoWER algorithm for the archery task. This is due to the use of 2D feedback
information which allows ARCHER to make better estimations/predictions of
good parameter values, and to the prior knowledge concerning the maximum
reward that can be achieved. PoWER, on the other hand, achieves reasonable
performance despite using only 1D feedback information.
Based on the results from the simulated experiment, the ARCHER algorithm
was chosen to conduct the following real-world experiment.
4.2 Robot Experiment
The real-world robot experimental setup is shown in Fig.4. The experiment was
conducted using the iCub humanoid robot[14].
In the experiment, we used the torso, arms, and hands of the robot. The
torso has 3 DOF (yaw, pitch, and roll). Each arm has 7 DOF, three in shoulder,
one in the elbow and three in the wrist. Each hand of the iCub has 5 fingers and
19 joints although with only 9 drive motors several of these joints are coupled.
We manually set the orientation of the neck, eyes and torso of the robot to
turn it towards the target. The finger positions of both hands were also set
manually to allow the robot to grip the bow and release the string suitably. We
used one joint in the index finger to release the string. It was not possible to
use two fingers simultaneously to release the string because of difficulties with
8 Petar Kormushev and Darwin G. Caldwell
−0.4
−0.2
0
0.2
−2
−1.5
−1
−0.5
0
0.2
0.4
x2
x1
x3
(a) PoWER
−0.4
−0.2
0
−2
−1.5
−1
−0.5
0
0.2
0.4
0.6
x2
x1
x3
(b) ARCHER
Fig. 2. Simulation of the archery task. Learning is performed under the same start-
ing conditions with two different algorithms. The red tra jectory is the final trial. (a)
PoWER algorithm needs 19 trials to reach the center. (b) ARCHER algorithm needs
5 trials to do the same.
Comparative Evaluation of Reinforcement Learning and Linear Regression 9
PoWER
ARCHER
Fig. 3. Comparison of the speed of convergence for the PoWER and ARCHER algo-
rithms. Statistics are collected from 40 learning sessions with 60 trials in each session.
The first 3 trials of ARCHER are done with large random exploratory noise, which
explains the big variance at the beginning.
synchronizing their motion. The posture of the left arm (bow side) was controlled
by the proposed system, as well as the orientation of the right arm (string side).
The position of the right hand was kept within a small area, because the limited
range of motion of the elbow joint did not permit pulling the string close to the
torso.
The real-world experiment was conducted using the proposed ARCHER al-
gorithm. The 2-dimensional reward needed by ARCHER after each trial was
estimated by tracking the target and the arrow in the camera image.
For the learning part, the number of trials until convergence in the real
world is higher than the numbers in the simulated experiment. This is caused by
the high level of noise (e.g. physical bow variability, measurement uncertainties,
robot control errors, etc.). Fig. 5 visualizes the results of a learning session
performed with the real robot. In this session, the ARCHER algorithm needed
10 trials to converge to the center.
Another point for comparison is that with a RL algorithm it is possible to
incorporate a bias/preference in the reward. For ARCHER, a similar effect could
be achieved using a regularizer in the regression.
5 Conclusion
We have presented a comparative evaluation of two learning approaches: a state-
of-the-art RL algorithm for direct policy search (PoWER) which uses scalar re-
10 Petar Kormushev and Darwin G. Caldwell
Fig. 4. Real-world experiment using the iCub humanoid robot. The distance between
the target and the robot is 2.2 meters. The diameter of the target is 50 cm.
wards, and a custom linear regression based algorithm (ARCHER) that uses
multidimensional feedback instead of a scalar reward. The comparative evalua-
tion indicates that the multidimensional feedback provides a significant advan-
tage over the scalar reward, resulting in an order-of-magnitude speed-up of the
convergence. The real-world experiment with the iCub humanoid robot confirms
these results.
Acknowledgments. This work is partially supported by the EU-funded project
PANDORA: ”Persistent Autonomy through learNing, aDaptation, Observation
and ReplAnning”, under contract FP7-ICT-288273.
References
1. Sutton, R.S., Barto, A.G.: Reinforcement learning: an introduction. Adaptive
computation and machine learning. MIT Press, Cambridge, MA, USA (1998)
2. Rosenstein, M., Barto, A.: Robot weightlifting by direct policy search. In: Inter-
national Joint Conference on Artificial Intelligence. Volume 17., Citeseer (2001)
839–846
3. uckstieß, T., Sehnke, F., Schaul, T., Wierstra, D., Sun, Y., Schmidhuber, J.: Ex-
ploring parameter space in reinforcement learning. Paladyn. Journal of Behavioral
Robotics 1(1) (2010) 14–24
4. Peters, J., Schaal, S.: Natural actor-critic. Neurocomput. 71(7-9) (2008) 1180–1190
5. Williams, R.J.: Simple statistical gradient-following algorithms for connectionist
reinforcement learning. Mach. Learn. 8(3-4) (1992) 229–256
Comparative Evaluation of Reinforcement Learning and Linear Regression 11
−0.4
−0.2
0
0.2
−2
−1.5
−1
−0.5
0
0.2
0.4
x2
x1
x3
Fig. 5. Results from a real-world experiment with the iCub robot. The arrow tra-
jectories are depicted as straight dashed lines, because we do not record the actual
trajectories from the real-world experiment, only the final position of the arrow on the
target’s plane. In this session the ARCHER algorithm needed 10 trials to converge to
the innermost ring of the target.
6. Kober, J., Peters, J.: Learning motor primitives for robotics. In: Proc. IEEE Intl
Conf. on Robotics and Automation (ICRA). (May 2009) 2112–2118
7. Vlassis, N., Toussaint, M., Kontes, G., Piperidis, S.: Learning model-free robot
control by a Monte Carlo EM algorithm. Autonomous Robots 27(2) (2009) 123–
130
8. Theodorou, E., Buchli, J., Schaal, S.: A Generalized Path Integral Control Ap-
proach to Reinforcement Learning. The Journal of Machine Learning Research 11
(December 2010) 3137–3181
9. Vijayakumar, S., Schaal, S.: Locally weighted projection regression: An O(n) al-
gorithm for incremental real time learning in high dimensional spaces. In: Proc.
Intl Conf. on Machine Learning (ICML), Haifa, Israel (2000) 288–293
10. Deisenroth, M.P., Rasmussen, C.E.: PILCO: A Model-Based and Data-Efficient
Approach to Policy Search. In Getoor, L., Scheffer, T., eds.: Proceedings of the
28th International Conference on Machine Learning, Bellevue, WA, USA (June
2011)
11. Kormushev, P., Calinon, S., Caldwell, D.G.: Reinforcement learning in robotics:
Applications and real-world challenges. Robotics 2(3) (2013) 122–148
12. Kober, J.: Reinforcement learning for motor primitives. Master’s thesis, University
of Stuttgart, Germany (August 2008)
13. Kormushev, P., Calinon, S., Caldwell, D.G.: Robot motor skill coordination with
EM-based reinforcement learning. In: Proc. IEEE/RSJ Intl Conf. on Intelligent
Robots and Systems (IROS), Taipei, Taiwan (October 2010) 3232–3237
14. Tsagarakis, N.G., Metta, G., Sandini, G., Vernon, D., Beira, R., Becchi, F.,
Righetti, L., Santos-Victor, J., Ijspeert, A.J., Carrozza, M.C., Caldwell, D.G.:
iCub: The design and realization of an open humanoid platform for cognitive and
neuroscience research. Advanced Robotics 21(10) (2007) 1151–1175
Article
Full-text available
Closed-loop neurofeedback has sparked great interest since its inception in the late 1960s. However, the field has historically faced various methodological challenges. Decoded fMRI neurofeedback may provide solutions to some of these problems. Notably, thanks to the recent advancements of machine learning approaches, it is now possible to target unconscious occurrences of specific multivoxel representations. In this Tools of the trade paper, we discuss how to implement these interventions in rigorous double-blind placebo-controlled experiments. We aim to provide a step-by-step guide to address some of the most common methodological and analytical considerations. We also discuss tools that can be used to facilitate the implementation of new experiments. We hope that this will encourage more researchers to try out this powerful new intervention method.
Article
The aim of this study was to improve reinforcement learning algorithm by combining artificial bee colony algorithm. The traditional method of reinforcement learning algorithm has a very low convergence rate due to random choices. An ant algorithm will help to make random choices in reinforcement learning more appropriate. This hybrid algorithm called the bee colony reinforcement (BCR) algorithm. The tip of the arm must reach a predetermined purpose by BCR algorithm. The results show that the BCR algorithm in the model has been able to reduce the time to reach the goal than the reinforcement learning algorithm (In average 12 steps faster). Also, the path for reaching the goal in the BCR algorithm was far more direct and shorter than the reinforcement learning algorithm. This method also detects the optimal path towards the goal.
Article
Full-text available
In robotics, the ultimate goal of reinforcement learning is to endow robots with the ability to learn, improve, adapt and reproduce tasks with dynamically changing constraints based on exploration and autonomous learning. We give a summary of the state-of-the-art of reinforcement learning in the context of robotics, in terms of both algorithms and policy representations. Numerous challenges faced by the policy representation in robotics are identified. Three recent examples for the application of reinforcement learning to real-world robots are described: a pancake flipping task, a bipedal walking energy minimization task and an archery-based aiming task. In all examples, a state-of-the-art expectation-maximization-based reinforcement learning is used, and different policy representations are proposed and evaluated for each task. The proposed policy representations offer viable solutions to six rarely-addressed challenges in policy representations: correlations, adaptability, multi-resolution, globality, multi-dimensionality and convergence. Both the successes and the practical difficulties encountered in these examples are discussed. Based on insights from these particular cases, conclusions are drawn about the state-of-the-art and the future perspective directions for reinforcement learning in robotics.
Article
Full-text available
This paper discusses parameter-based exploration methods for reinforcement learning. Parameter-based methods perturb parameters of a general function approximator directly, rather than adding noise to the resulting actions. Parameter-based exploration unifies reinforcement learning and black-box optimization, and has several advantages over action perturbation. We review two recent parameter-exploring algorithms: Natural Evolution Strategies and Policy Gradients with Parameter-Based Exploration. Both outperform state-of-the-art algorithms in several complex high-dimensional tasks commonly found in robot control. Furthermore, we describe how a novel exploration method, State-Dependent Exploration, can modify existing algorithms to mimic exploration in parameter space. Keywordsreinforcement learning-optimization-exploration-policy gradients
Conference Paper
Full-text available
The acquisition and self-improvement of novel motor skills is among the most important problems in robotics. Motor primitives offer one of the most promising frameworks for the application of machine learning techniques in this context. Employing an improved form of the dynamic systems motor primitives originally introduced by Ijspeert et al. [2], we show how both discrete and rhythmic tasks can be learned using a concerted approach of both imitation and reinforcement learning. For doing so, we present both learning algorithms and representations targeted for the practical application in robotics. Furthermore, we show that it is possible to include a start-up phase in rhythmic primitives. We show that two new motor skills, i.e., Ball-in-a-Cup and Ball-Paddling, can be learned on a real Barrett WAM robot arm at a pace similar to human learning while achieving a significantly more reliable final performance.
Conference Paper
Full-text available
We present an approach allowing a robot to acquire new motor skills by learning the couplings across motor control variables. The demonstrated skill is first encoded in a compact form through a modified version of Dynamic Movement Primitives (DMP) which encapsulates correlation information. Expectation-Maximization based Reinforcement Learning is then used to modulate the mixture of dynamical systems initialized from the user's demonstration. The approach is evaluated on a torque-controlled 7 DOFs Barrett WAM robotic arm. Two skill learning experiments are conducted: a reaching task where the robot needs to adapt the learned movement to avoid an obstacle, and a dynamic pancake-flipping task.
Article
Full-text available
In this paper, we suggest a novel reinforcement learning architecture, the Natural Actor-Critic. The actor updates are achieved using stochastic policy gradients employing Amari's natural gradient approach, while the critic obtains both the natural policy gradient and additional parameters of a value function simultaneously by linear regression. We show that actor improvements with natural policy gradients are particularly appealing as these are independent of coordinate frame of the chosen policy representation, and can be estimated more efficiently than regular policy gradients. The critic makes use of a special basis function parameterization motivated by the policy-gradient compatible function approximation. We show that several well-known reinforcement learning methods such as the original Actor-Critic and Bradtke's Linear Quadratic Q-Learning are in fact Natural Actor-Critic algorithms. Empirical evaluations illustrate the effectiveness of our techniques in comparison to previous methods, and also demonstrate their applicability for learning control on an anthropomorphic robot arm.
Conference Paper
Full-text available
In this paper, we introduce PILCO, a practical, data-efficient model-based policy search method. PILCO reduces model bias, one of the key problems of model-based reinforcement learning, in a principled way. By learning a probabilistic dynamics model and explicitly incorporating model uncertainty into long-term planning, PILCO can cope with very little data and facilitates learning from scratch in only a few trials. Policy evaluation is performed in closed form using state-of-the-art approximate inference. Furthermore, policy gradients are computed analytically for policy improvement. We report unprecedented learning efficiency on challenging and high-dimensional control tasks.
Article
Full-text available
With the goal to generate more scalable algorithms with higher efficiency and fewer open parameters, reinforcement learning (RL) has recently moved towards combining classical techniques from optimal control and dynamic programming with modern learning techniques from statistical estimation theory. In this vein, this paper suggests to use the framework of stochastic optimal control with path integrals to derive a novel approach to RL with parameterized policies. While solidly grounded in value function estimation and optimal control based on the stochastic Hamilton-Jacobi-Bellman (HJB) equations, policy improvements can be transformed into an approximation problem of a path integral which has no open algorithmic parameters other than the exploration noise. The resulting algorithm can be conceived of as model-based, semi-model-based, or even model free, depending on how the learning problem is structured. The update equations have no danger of numerical instabilities as neither matrix inversions nor gradient learning rates are required. Our new algorithm demonstrates interesting similarities with previous RL research in the framework of probability matching and provides intuition why the slightly heuristically motivated probability matching approach can actually perform well. Empirical evaluations demonstrate significant performance improvements over gradient-based policy learning and scalability to high-dimensional control problems. Finally, a learning experiment on a simulated 12 degree-of-freedom robot dog illustrates the functionality of our algorithm in a complex robot learning scenario. We believe that Policy Improvement with Path Integrals (PI2) offers currently one of the most efficient, numerically robust, and easy to implement algorithms for RL based on trajectory roll-outs.
Article
Full-text available
Abstract We address the problem of learning robot control by model-free reinforcement learning (RL). We adopt the probabilistic model for model-free RL of Vlassis and Toussaint (Proceedings of the international conference on machine learning, Montreal, Canada, 2009), and we propose a Monte Carlo EM algorithm (MCEM) for control learning that searches directly in the space of controller parameters using information obtained from randomly generated robot trajectories. MCEM is related to, and generalizes, the PoWER algorithm of Kober and Peters (Proceedings of the neural information processing systems, 2009). In the finite-horizon case MCEM reduces precisely to PoWER, but MCEM can also handle the discounted infinite-horizon case. An interesting result is that the infinite-horizon case can be viewed as a ‘randomized’ version of the finite-horizon case, in the sense that the length of each sampled trajectory is a random draw from an appropriately constructed geometric distribution. We provide some preliminary experiments demonstrating the effects of fixed (PoWER) vs randomized (MCEM) horizon length in two simulated and one real robot control tasks.
Article
This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units. These algorithms, called REINFORCE algorithms, are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reinforcement tasks, and they do this without explicitly computing gradient estimates or even storing information from which such estimates could be computed. Specific examples of such algorithms are presented, some of which bear a close relationship to certain existing algorithms while others are novel but potentially interesting in their own right. Also given are results that show how such algorithms can be naturally integrated with backpropagation. We close with a brief discussion of a number of additional issues surrounding the use of such algorithms, including what is known about their limiting behaviors as well as further considerations that might be used to help develop similar but potentially more powerful reinforcement learning algorithms.
Article
The development of robotic cognition and the advancement of understanding of human cognition form two of the current greatest challenges in robotics and neuroscience, respectively. The RobotCub project aims to develop an embodied robotic child (iCub) with the physical (height 90 cm and mass less than 23 kg) and ultimately cognitive abilities of a 2.5-year-old human child. The iCub will be a freely available open system which can be used by scientists in all cognate disciplines from developmental psychology to epigenetic robotics to enhance understanding of cognitive systems through the study of cognitive development. The iCub will be open both in software, but more importantly in all aspects of the hardware and mechanical design. In this paper the design of the mechanisms and structures forming the basic 'body' of the iCub are described. The papers considers kinematic structures dynamic design criteria, actuator specification and selection, and detailed mechanical and electronic design. The paper concludes with tests of the performance of sample joints, and comparison of these results with the design requirements and simulation projects.