Conference PaperPDF Available

Learning Continuous State/Action Models for Humanoid Robots


Abstract and Figures

Reinforcement learning (RL) is a popular choice for solving robotic control problems. However, applying RL techniques to controlling humanoid robots with high degrees of freedom remains problematic due to the difficulty of acquiring sufficient training data. The problem is compounded by the fact that most real-world problems involve continuous states and actions. In order for RL to be scalable to these situations it is crucial that the algorithm be sample efficient. Model-based methods tend to be more data efficient than model-free approaches and have the added advantage that a single model can generalize to multiple control problems. This paper proposes a model approximation algorithm for continuous states and actions that integrates case-based reasoning (CBR) and Hidden Markov Models (HMM) to generalize from a small set of state instances. The paper demonstrates that the performance of the learned model is close to that of the system dynamics it approximates, where performance is measured in terms of sampling error.
Content may be subject to copyright.
Learning Continuous State/Action Models for Humanoid Robots
Astrid Jackson and Gita Sukthankar
Department of Computer Science
University of Central Florida
Orlando, FL, U.S.A.
{ajackson, gitars}
Reinforcement learning (RL) is a popular choice for
solving robotic control problems. However, applying
RL techniques to controlling humanoid robots with high
degrees of freedom remains problematic due to the diffi-
culty of acquiring sufficient training data. The problem
is compounded by the fact that most real-world prob-
lems involve continuous states and actions. In order for
RL to be scalable to these situations it is crucial that
the algorithm be sample efficient. Model-based meth-
ods tend to be more data efficient than model-free ap-
proaches and have the added advantage that a single
model can generalize to multiple control problems. This
paper proposes a model approximation algorithm for
continuous states and actions that integrates case-based
reasoning (CBR) and Hidden Markov Models (HMM)
to generalize from a small set of state instances. The
paper demonstrates that the performance of the learned
model is close to that of the system dynamics it approxi-
mates, where performance is measured in terms of sam-
pling error.
1 Introduction
In recent years, reinforcement learning (RL) has emerged
as a popular choice for solving complex sequential decision
making problems in stochastic environments. Value learning
approaches attempt to compute the optimal value function,
which represents the cumulative reward achievable from a
given state when following the optimal policy (Sutton and
Barto 1998). In many real-world robotics domains acquir-
ing the experiences necessary for learning can be costly and
time intensive. Sample complexity, which is defined as the
number of actions it takes to learn an optimal policy, is a ma-
jor concern. Kuvayev and Sutton (1996) have shown signifi-
cant performance improvements in cases where a model was
learned online and used to supplement experiences gathered
during the experiment. Once the robot has an accurate model
of the system dynamics, value iteration can be performed on
the learned model, without requiring further training exam-
ples. Hester and Stone (2012) also demonstrated how the
model can be used to plan multi-step exploration trajecto-
ries where the agent is guided to explore high uncertainty
regions of the model.
Copyright c
2016, Association for the Advancement of Artificial
(a) (b)
Figure 1: Approximation of the transition probability den-
sity function for the joint motion of a humanoid robot. (a)
The model is learned from the trajectory of the robot’s left
wrist. (b) The continuous state space is represented by the
x-y-z position of the robot’s end effectors; the continuous
action space is composed of the forces on these effectors in
each direction.
Many RL techniques also implicitly model the system as
a Markov Decision Process (MDP). A standard MDP as-
sumes that the set of states Sand actions Aare finite and
known; in motion planning problems, the one-step transi-
tion probabilities P(siS|si1S, ai1A)are
known, whereas in learning problems the transition prob-
abilities are unknown. Many practical problems, how-
ever, involve continuous valued states and actions. In this
scenario it is impossible to enumerate all states and ac-
tions, and equally difficult to identify the full set of tran-
sition probabilities. Discretization has been a viable op-
tion for dealing with this issue (Hester and Stone 2009;
Diuk, Li, and Leffler 2009). Nonetheless, fine discretiza-
tion leads to an intractably large state space which cannot
be accurately learned without a large number of training ex-
amples, whereas coarse discretization runs the risk of losing
In this paper we develop a model learner suitable for con-
tinuous problems. It integrates case-based reasoning (CBR)
and Hidden Markov Models (HMM) to approximate the suc-
Intelligence ( All rights reserved.
cessor state distribution from a finite set of experienced in-
stances. It thus accounts for the infinitely many successor
states ensuing from actions drawn from an infinite set. We
apply the method to a humanoid robot, the Aldebaran Nao,
to approximate the transition probability density function for
a sequence of joint motions (Figure 1). The results show that
the model learner is capable of closely emulating the envi-
ronment it represents.
In the next section we provide formal definitions relevant
to the paper. In Section 3 we give a detailed description of
our Continuous Action and State Model Learner (CASML)
algorithm. CASML is an extension of CASSL (Continu-
ous Action and State Space Learner) (Molineaux, Aha, and
Moore 2008) designed for learning the system dynamics of a
humanoid robot. CASML distinguishes itself from CASSL
in the way actions are selected. While CASSL selects ac-
tions based on quadratic regression methods, CASML at-
tempts to estimate the state transition model utilizing an Hid-
den Markov Model (HMM). Section 4 describes our exper-
iments, and in Section 5 we analyze the performance of our
model approximation algorithm. Section 6 summarizes how
our method relates to other work in the literature. We con-
clude in Section 7 with a summary of our findings and future
2 Background
2.1 Markov Decision Process
A Markov Decision Process (MDP) is a mathematical
framework modeling decision making in stochastic environ-
ments (Kolobov 2012). Typically it is represented as a tuple
(S, A, T, γ, D, R), where
Sis a finite set of states.
Ais a finite set of actions.
T={Psa}is the set of transition probabilities, where
Psa is the transition distribution; i.e. the distribution over
probabilities among possible successor states, of taking
action aAin state sS.
γ(0,1] is the discount factor which models the impor-
tance of future rewards.
D:S7→ [0,1] is an initial-state distribution, from which
the initial state s0is drawn.
R:S×A7→ Ris a reward function specifying the imme-
diate numeric reward value for executing aAin sS.
2.2 Case Based Reasoning
Case-based reasoning (CBR) is a problem solving technique
that leverages solutions from previously experienced prob-
lems (cases). By remembering previously solved cases,
solutions can be suggested for novel but similar problems
(Aamodt and Plaza 1994). A CBR cycle consists of four
Retrieve the most similar case(s); i.e. the cases most rele-
vant to the current problem.
Reuse the case(s) to adapt the solutions to the current
(a) (b)
(c) (d)
Figure 2: Process of extracting the successor states by uti-
lizing the transition case base. (a) Retrieve similar states
(s1, ..., sn) of kNN. (b) Determine the actions previously
performed in these states. (c) Reuse by identifying similar
actions (a1, ..., ak) using cosine similarity. (d) Select si+1
as possible successor states.
Revise the proposed solution if necessary (this step is sim-
ilar to supervised learning in which the solution is exam-
ined by an oracle).
Retain the new solution as part of a new case.
A distinctive aspect of CBR systems is that no attempt is
made to generalize the cases they learn. Generalization is
only performed during the reuse stage.
3 Method
This section describes Continuous Action and State Model
Learner (CASML)1, an instance-based model approxima-
tion algorithm for continuous domains. CASML integrates
a case-based reasoner and a Gaussian HMM and returns the
successor state probability distribution. At each time step,
the model learner receives as input an experience of the
form hsi1, asi1, sii, where si1Sis the prior state,
ai1Ais the action that was taken in state si1and
siSis the successor state. Both the states in Sand the
actions in Aare real-valued feature vectors.
Much like in Molineaux, Aha, and Moore (2008) the case
base models the effects of applying actions by maintaining
the experienced transitions in the case base CT:S×A×
S. To reason about possible successor states, CTretains
the observations in the form:
c=hs, a, si,
1Code available at
where srepresents the change from the prior state to the
current state.
The case base supports a case-based reasoning cycle as
described in (Aamodt and Plaza 1994) consisting of re-
trieval, reuse, and retention. Since the transition model is fit
to the experiences acquired from the environment directly,
CTis never revised.
In addition to the case base, a Gaussian HMM is trained
with the state siof each received experience. While the
case base allows for the identification of successor states,
the HMM models the transition probability distribution for
state sequences of the form hsi1, sii.
At the beginning of each trial, the transition case base is
initialized to the empty set. The HMM’s initial state distri-
bution and the state transition matrix are determined using
the posterior of a Gaussian Mixture Model (GMM) fit to a
subset of observations, and the emission probabilities are set
to the mean and covariance of the GMM.
New cases are retrieved, retained and reused according
to the CBR policies regarding each method. A new case
c(i)=hsi1, ai1,si, where s=sisi1is retained
only if it is not correctly predicted by CT, which is the case
if one of the following conditions holds:
1. The distance between c(i)and its nearest neighbor in CT
is above the threshold:
d(c(i),1NN(CT, ci)) > τ.
2. The distance between the actual and the estimated transi-
tions is greater than the error permitted:
s, CT(si1, ai1)) > σ.
The similarity measure for case retention in CTis the Eu-
clidean distance of the state features of the case’s state.
Algorithm 1 Approximation of the successor state probabil-
ity distribution for continuous states and actions
1: CT: Transition case base hS×A×Si
2: H: HMM trained with state features s
3: T: Transition probabilities T:S×A×S7→ [0,1]
4: ———————————————————
5: procedure P(si,ai)
6: CSretrieve(CT, si)
7: CAreuse(CS, ai)
8: T(ai, si)
9: cCA:T(si, ai)
T(si, ai)∪ hcs,predict(H, hsi, si+csi)i
10: return normalize(T(si, ai))
11: end procedure
Given a set of kepisodes, the initial state distribution is
defined as D={s(0)
0, s(1)
0, . . . , s(k)
0}. Therefore the initial
state is added to Dat the beginning of each episode, such
that D=Ds(i)
0. Initial states are drawn uniformly from
To infer the successor state probability distribution for an
action aitaken in state si, CASML consults both the tran-
sition case base and the HMM. The process is detailed in
Figure 3: Estimation of the state transition probabilities for
a given state and a given action. The forward algorithm
of the HMM is used to calculate the probability for each
one-step sequence hsi, si+1i, that was identified utilizing the
case base (Figure 2(d)).
Algorithm 1. First the potential successor states must be
identified. This is accomplished by performing case re-
trieval on the case base, which collects the states similar to
siinto the set CS(see Figure 2(a)). Depending on the do-
main, the retrieval method is either a k-nearest neighbor or
a radius-based neighbor algorithm using the Euclidean dis-
tance as the metric. The retrieved cases are then updated
into the set CAby the reuse stage (see Figure 2(b) and 2(c)).
CAconsists of all those cases ck=hsk, ak,ski ∈ CS
whose action akare cosine similar to ai, which is the case
if dcos(ak, ai)ρ. At this point all successor states si+1
can be predicted by the calculation of the vector addition
si+ ∆sk(see Figure 2(d)).
The next step is to ascertain the probabilities for transi-
tioning into each of these successor states (Figure 3). The
log likelihood of the evidence; i.e. the one-step sequence
hsi, si+1i, for each of the potential state transitions can be
induced by performing the forward algorithm of the HMM.
The successor state probability distribution is then obtained
by normalizing over the exponential of the log likelihood of
all observation sequences.
4 Experiments
Our goal was for a humanoid robot, the Aldebaran Nao, to
learn the system dynamics involved in performing a joint
motion. For our experiment we first captured a motion us-
ing an Xbox controller to manipulate the robot’s end effec-
tors . The x-, y-, and z-positions of the end effectors serve
as data points for the joint trajectory and the continuous ac-
tions are the forces applied to the end effectors in each of the
coordinate directions. The force is determined by the posi-
tional states of the controller’s left and right thumb sticks
which produce a continuous range in the interval [-1, +1]
and generate a displacement that does not exceed 4 millime-
ters. We record the actions applied at each time step as the
policy that will be sampled from the model. All experiments
were performed in the Webots simulator from Cyberbotics2,
however they are easily adaptable to work on the physical
robot. Henceforth, the trajectories which are acquired from
the simulator will be referred to as the sampled trajectories
and the trajectories sampled from the model will be referred
to as approximated trajectories.
(a) (b)
Figure 4: Performance of the model as quantified by the ap-
proximation error. (a) The accuracy of the model identified
by various error metrics. (b) Comparison of the estimated
trajectory and the actual trajectory obtained from the simula-
tor. The error measurements and the estimated trajectory are
averaged over 50 instances, each estimated from a model in-
stance that was trained with a set of 50 trajectories sampled
in the simulator by following the action sequence.
For our first set of experiments we generated one action
sequence for the movement of the robot’s left arm, which
resulted in a 3-dimensional state feature vector for the joint
position of the wrist. The model was trained with 50 tra-
jectories sampled by following the previously acquired pol-
icy in the simulator. Subsequently, a trajectory was ap-
proximated using the trained model. Figure 4 illustrates
the performance of the algorithm in terms of a set of er-
ror measurements. Given the n-length sampled trajectory
T rs={ss,0, ss,1, . . . , ss,n}, the n-length approximated tra-
jectory T ra={sa,0, sa,1, . . . , sa,n}, and the effect of each
action; i.e. the true action, stated as ss,t =ss,t ss,t1
and sa,t =sa,t sa,t1for the sampled and the approxi-
mated trajectories respectively, the error metrics are defined
as follows:
The action metric identifies the divergence of the true ac-
tion from the intended action. Specifically, we denote the
action metric by:
The minmax metric measures whether an approximated
state is within the minimum and maximum bounds
of the sampled state. Assume a set of msampled
trajectories {s(i)
0, s(i)
i=1 and that errmin =
|| min(s([1:m])
t)sa,t||2and errmax =|| max(s([1:m])
sa,t||2are defined as the differences for sa,t from the min-
imum and the maximum bounds respectively, then
minmax =
t)sa,t max(s([1:m])
The delta metric measures the difference of the approxi-
(a) (b)
Figure 5: Correlation between accuracy of the model, the
number of cases retained in the case base and the number
of approximation failures, where accuracy is traded for effi-
ciency. (a) Accuracy of the approximated trajectory and (b)
correlation between the approximation errors, the number of
approximation failures, and the number of cases are plotted
as a function of the radius employed by the neighborhood
mated and the sampled true action. It is given by:
||ss,t sa,t||2
The position metric analyzes similarity of the resulting
states and is given by:
||ss,t sa,t||2
We set τ= 0.005, σ = 0.001 and ρ= 0.97. Further-
more, we set the retrieval method to the radius-based algo-
rithm with the radius set to 1 mm.
5 Results
The results in Figure 4(a) show that the model captures the
amount of deviation from the intended action quite accu-
rately (see action metric). The fact that the minmax error
is relatively small proves that the one-step sequences man-
age to stay in close proximity of the experienced bounds.
However, by requiring the joint displacement to be within
the bounds of the sampled states, the minmax metric makes
a stronger claim than the action metric which only concedes
that the motion deviated from the intended action. An even
stronger claim is made by the delta metric which analyzes
the accuracy of the motion. As expected the delta error is
comparatively larger since in a stochastic environment it is
less likely to achieve the exact same effect multiple times.
Though not shown in Figure 4(a) the position error is even
more restrictive as it requires the location of the joints to be
in the exact same place and accumulates rapidly after the
first deviation of motion. Nevertheless, as can be seen in
Figure 4(b), the approximated trajectory is almost identical
to that of the sampled trajectory.
The case retrieval method used for this experiment was
the radius-based neighbor algorithm, a variant of k-nearest
Figure 6: Comparison of the trajectory extracted from controlling the left arm in the simulator and the trajectory approximated
by the model after it was trained with ten distinctive trajectories. The deviations of the wrist joint position in the x-, y-, and
z-directions are scaled to cm. The approximated trajectory shown is an average over 20 instances.
neighbor that finds the cases within a given radius of the
query case. By allowing the model to only select cases
within the current state’s vicinity, joint limits can be mod-
eled. The drawback is that, in the case of uncertainty in the
model, it is less likely to find any solution. To gain a better
understanding, we compared the effect of varying the train-
ing radius on the model’s performance. Otherwise the setup
was similar to that of the previous experiment. Figure 5(a)
shows that as the size of the radius increases, the accuracy of
the model drops. This outcome is intuitive considering that
as the radius increases, the solution includes more cases that
diverge from the set of states representing this step in the tra-
jectory. At the same time the number of cases retained by the
case base decreases. Therefore, the found solutions are less
likely to be exact matches to that state. Interestingly, even
though the delta measure is more restrictive than the minmax
metric, it does not grow as rapidly. This can be explained by
the reuse method of the case base, which requires cases to
have similar actions to be considered part of the solution.
This further suggests that the model correctly predicts the
effects of the actions and thus generalizes effectively. As
is evident from Figure 5, the accuracy of the model is cor-
related with the number of approximation failures, defined
as incomplete trajectories. The effect is that more attempts
have to be made to estimate the trajectory from the model
with smaller radii. This makes sense since deviations from
the sampled trajectories are more likely to result in a sce-
nario where no cases with similar states and actions can be
found. In this experiment the problem is compounded since
no attempt at exploration is made. There is an interesting
spike in the number of failures when the radius is set to 10
mm (see Figure 5(b)). An explanation for this phenomenon
is that for small radii, there is a lesser likelihood for large de-
viations. Therefore, the solution mostly consists of the set of
cases that were sampled at this step in the trajectory. On the
other hand with larger radii, the solution may include more
cases which allows for generalization from other states. For
medium radii, there is a greater potential for the occurrence
of larger deviations when the radius does not encompass suf-
ficient cases for good generalization.
A third experiment illustrates the performance of the al-
Figure 7: The accuracy of the model after ten training sam-
ples. The results are averages over 20 approximation tra-
jectories. The minmax error was calculated by sampling 20
trajectories in the simulator for the validation.
gorithm on estimating a trajectory that the model had not
previously encountered during learning. For this purpose
the model was trained on a set of ten distinctive trajectories
which were derived from following unique policies. The
policies were not required to be of the same length, however
each trajectory started from the same initial pose. The model
was then used to estimate the trajectory of a previously un-
seen policy. We continued to utilize the radius-based neigh-
boring algorithm as the similarity measure and set the radius
to 15 mm. We also set ρ= 0.95 to allow for more deviation
from the action. This was necessary, as the model needed to
generalize more on unseen states than were required in the
previous experiments. There were also far fewer instances
to generalize from since the model was only trained on 10
trajectories. Furthermore, only 30% of the experienced in-
stances were retained by the model’s case base. A com-
parison of the approximated and the sampled trajectory is
depicted in Figure 6 which shows that the model is capa-
ble of capturing the general path of the trajectory. Figure 7
quantitatively supports this claim. Even the position metric,
which is the most restrictive, shows that the location of the
robot’s wrist deviates on average by only 3.45 mm from the
expected location.
6 Related Work
Prior work on model learning has utilized a variety of rep-
resentations. Hester and Stone (2009) use decision trees
to learn states and model their relative transitions. They
demonstrate that decision trees provide good generalization
to unseen states. Diuk, Li, and Leffler (2009) learn the struc-
ture of a Dynamic Bayesian Network (DBN) and the condi-
tional probabilities. The possible combinations of the input
features are enumerated as elements and the relevance of the
elements are measured in order to make a prediction. Both
these model learning algorithms only operate in discretized
state and action spaces and do not easily scale to problems
with inherently large or continuous domains.
Most current algorithms for reinforcement learning prob-
lems in the continuous domain rely on model-free tech-
niques. This is largely due to the fact that even if a model is
known, a usable policy is not easily extracted for all pos-
sible states (Van Hasselt 2012). However, research into
model-based techniques has become more prevalent as the
need to solve complex real-world problems has risen. Or-
moneit and Sen (2002) describe a method for learning an
instance-based model using the kernel function. All expe-
rienced transitions are saved to predict the next state based
on an average of nearby transitions, weighted using the ker-
nel function. Deisenroth and Rasmussen (2011) present an
algorithm called Probabilistic Inference for Learning Con-
trol (PI LCO) which uses Gaussian Process (GP) regression
to learn a model of the domain and to generalize to unseen
states. Their method alternately takes batches of actions in
the world and then re-computes its model and policy. Jong
and Stone (2007) take an instance-based approach to solve
for continuous-state reinforcement learning problems. By
utilizing the inductive bias that actions have similar effect
in similar states their algorithm derives the probability dis-
tribution through Gaussian weighting of similar states. Our
approach differs from theirs by deriving the successor states
while taking the actions performed in similar states into ac-
count. The probability distribution is then determined by a
Hidden Markov Model.
7 Conclusion and Future Work
In this paper, we presented CASML, a model learner for
continuous states and actions that generalizes from a small
set of training instances. The model learner integrates a case
base reasoner and a Hidden Markov Model to derive succes-
sor state probability distributions for unseen states and ac-
tions. Our experiments demonstrate that the learned model
effectively expresses the system dynamics of a humanoid
robot. All the results indicate that the model learned with
CASML was able to approximate the desired trajectory quite
In future work we are planning to add a policy for revis-
ing cases in the model’s case base to account for the innate
bias towards the initially seen instances. We believe this will
lead to an improvement in the case base’s performance since
fewer cases will be required to achieve good generalization.
In addition we plan on evaluating the performance of the
model learner in combination with a value iteration planner
for generating optimal policies in continuous state and ac-
tion spaces.
Aamodt, A., and Plaza, E. 1994. Case-based reasoning:
Foundational issues, methodological variations, and system
approaches. AI Communications 7(1):39–59.
Deisenroth, M., and Rasmussen, C. E. 2011. PILCO: A
model-based and data-efficient approach to policy search.
In Proceedings of the International Conference on Machine
Learning (ICML-11), 465–472.
Diuk, C.; Li, L.; and Leffler, B. R. 2009. The adap-
tive k-meteorologists problem and its application to struc-
ture learning and feature selection in reinforcement learning.
In Proceedings of the International Conference on Machine
Learning, 249–256. ACM.
Hester, T., and Stone, P. 2009. Generalized model learning
for reinforcement learning in factored domains. In Proceed-
ings of International Conference on Autonomous Agents and
Multiagent Systems-Volume 2, 717–724. International Foun-
dation for Autonomous Agents and Multiagent Systems.
Hester, T., and Stone, P. 2012. Learning and using models.
In Reinforcement Learning. Springer. 111–141.
Jong, N. K., and Stone, P. 2007. Model-based exploration in
continuous state spaces. In Abstraction, Reformulation, and
Approximation. Springer. 258–272.
Kolobov, A. 2012. Planning with Markov decision pro-
cesses: An AI perspective. Synthesis Lectures on Artificial
Intelligence and Machine Learning 6(1):1–210.
Kuvayev, L., and Sutton, R. S. 1996. Model-based rein-
forcement learning with an approximate, learned model. In
in Proceedings of the Ninth Yale Workshop on Adaptive and
Learning Systems. Citeseer.
Molineaux, M.; Aha, D. W.; and Moore, P. 2008. Learn-
ing continuous action models in a real-time strategy envi-
ronment. In FLAIRS Conference, volume 8, 257–262.
Ormoneit, D., and Sen, . 2002. Kernel-based reinforcement
learning. Machine Learning 49(2-3):161–178.
Sutton, R. S., and Barto, A. G. 1998. Introduction to rein-
forcement learning. MIT Press.
Van Hasselt, H. 2012. Reinforcement learning in contin-
uous state and action spaces. In Reinforcement Learning.
Springer. 207–251.
Full-text available
Many traditional reinforcement-learning algorithms have been designed for problems with small finite state and action spaces. Learning in such discrete problems can been difficult, due to noise and delayed reinforcements. However, many real-world problems have continuous state or action spaces, which can make learning a good decision policy even more involved. In this chapter we discuss how to automatically find good decision policies in continuous domains. Because analytically computing a good policy from a continuous model can be infeasible, in this chapter we mainly focus on methods that explicitly update a representation of a value function, a policy or both. We discuss considerations in choosing an appropriate representation for these functions and discuss gradient-based and gradient-free ways to update the parameters. We show how to apply these methods to reinforcement-learning problems and discuss many specific algorithms. Amongst others, we cover gradient-based temporal-difference learning, evolutionary strategies, policy-gradient algorithms and actor-critic methods. We discuss the advantages of different approaches and compare the performance of a state-of-the-art actor-critic method and a state-of-the-art evolutionary strategy empirically.
Full-text available
The purpose of this paper is three-fold. First, we formalize and study a problem of learning probabilistic concepts in the recently proposed KWIK framework. We give details of an algo-rithm, known as the Adaptive k-Meteorologists Algorithm, analyze its sample-complexity up-per bound, and give a matching lower bound. Second, this algorithm is used to create a new reinforcement-learning algorithm for factored-state problems that enjoys significant improve-ment over the previous state-of-the-art algorithm. Finally, we apply the Adaptive k-Meteorologists Algorithm to remove a limiting assumption in an existing reinforcement-learning algorithm. The effectiveness of our approaches is demonstrated empirically in a couple benchmark domains as well as a robotics navigation problem.
Full-text available
We present a kernel-based approach to reinforcement learning that overcomes the stability problems of temporal-difference learning in continuous state-spaces. First, our algorithm converges to a unique solution of an approximate Bellman's equation regardless of its initialization values. Second, the method is consistent in the sense that the resulting policy converges asymptotically to the optimal policy. Parametric value function estimates such as neural networks do not possess this property. Our kernel-based approach also allows us to show that the limiting distribution of the value function estimate is a Gaussian process. This information is useful in studying the bias-variance tradeoff in reinforcement learning. We find that all reinforcement learning approaches to estimating the value function, parametric or non-parametric, are subject to a bias. This bias is typically larger in reinforcement learning than in a comparable regression problem.
Full-text available
Case-based reasoning is a recent approach to problem solving and learning that has got a lot of attention over the last few years. Originating in the US, the basic idea and underlying theories have spread to other continents, and we are now within a period of highly active research in case-based reasoning in Europe as well. This paper gives an overview of the foundational issues related to case-based reasoning, describes some of the leading methodological approaches within the field, and exemplifies the current state through pointers to some systems. Initially, a general framework is defined, to which the subsequent descriptions and discussions will refer. The framework is influenced by recent methodologies for knowledge level descriptions of intelligent systems. The methods for case retrieval, reuse, solution testing, and learning are summarized, and their actual realization is discussed in the light of a few example systems that represent different CBR approaches. We also discuss the role of case-based methods as one type of reasoning and learning method within an integrated system architecture.
Conference Paper
Full-text available
Although several researchers have integrated methods for re- inforcement learning (RL) with case-based reasoning (CBR) to model continuous action spaces, existing integrations typically employ discrete approximations of these models. This limits the set of actions that can be modeled, and may lead to non-optimal solutions. We introduce the Continuous Action and State Space Learner (CASSL), an integrated RL/CBR algorithm that uses continuous models directly. Our empirical study shows that CASSL significantly outper- forms two baseline approaches for selecting actions on a task from a real-time strategy gaming environment.
Conference Paper
Full-text available
In this paper, we introduce PILCO, a practical, data-efficient model-based policy search method. PILCO reduces model bias, one of the key problems of model-based reinforcement learning, in a principled way. By learning a probabilistic dynamics model and explicitly incorporating model uncertainty into long-term planning, PILCO can cope with very little data and facilitates learning from scratch in only a few trials. Policy evaluation is performed in closed form using state-of-the-art approximate inference. Furthermore, policy gradients are computed analytically for policy improvement. We report unprecedented learning efficiency on challenging and high-dimensional control tasks.
Markov Decision Processes (MDPs) are widely popular in Artificial Intelligence for modeling sequential decision-making scenarios with probabilistic dynamics. They are the framework of choice when designing an intelligent agent that needs to act for long periods of time in an environment where its actions could have uncertain outcomes. MDPs are actively researched in two related subareas of AI, probabilistic planning and reinforcement learning. Probabilistic planning assumes known models for the agent's goals and domain dynamics, and focuses on determining how the agent should behave to achieve its objectives. On the other hand, reinforcement learning additionally learns these models based on the feedback the agent gets from the environment. This book provides a concise introduction to the use of MDPs for solving probabilistic planning problems, with an emphasis on the algorithmic perspective. It covers the whole spectrum of the field, from the basics to state-of-the-art optimal and approximation algorithms. We first describe the theoretical foundations of MDPs and the fundamental solution techniques for them. We then discuss modern optimal algorithms based on heuristic search and the use of structured representations. A major focus of the book is on the numerous approximation schemes for MDPs that have been developed in the AI literature. These include determinization-based approaches, sampling techniques, heuristic functions, dimensionality reduction, and hierarchical representations. Finally, we briefly introduce several extensions of the standard MDP classes that model and solve even more complex planning problems.
As opposed to model-free RL methods, which learn directly from experi-ence in the domain, model-based methods learn a model of the transition and reward functions of the domain on-line and plan a policy using this model. Once the method has learned an accurate model, it can plan an optimal policy on this model without any further experience in the world. Therefore, when model-based methods are able to learn a good model quickly, they frequently have improved sample efficiency over model-free methods, which must continue taking actions in the world for values to propagate back to previous states. Another advantage of model-based methods is that they can use their models to plan multi-step exploration trajectories. In particu-lar, many methods drive the agent to explore where there is uncertainty in the model, so as to learn the model as fast as possible. In this chapter, we survey some of the types of models used in model-based methods and ways of learning them, as well as methods for planning on these models. In addition, we examine the typical architec-tures for combining model learning and planning, which vary depending on whether the designer wants the algorithm to run on-line, in batch mode, or in real-time. One of the main performance criteria for these algorithms is sample complexity, or how many actions the algorithm must take to learn. We examine the sample efficiency of a few methods, which are highly dependent on having intelligent exploration mech-anisms. We survey some approaches to solving the exploration problem, including Bayesian methods that maintain a belief distribution over possible models to ex-plicitly measure uncertainty in the model. We show some empirical comparisons of various model-based and model-free methods on two example domains before concluding with a survey of current research on scaling these methods up to larger domains with improved sample and computational complexity.
Conference Paper
Improving the sample efficiency of reinforcement learning al- gorithms to scale up to larger and more realistic domains is a current research challenge in machine learning. Model-based methods use experiential data more efficiently than model- free approaches but often require exhaustive exploration to learn an accurate model of the domain. We present an algo- rithm, Reinforcement Learning with Decision Trees (rl-dt), that uses supervised learning techniques to learn the model by generalizing the relative effect of actions across states. Specifically, rl-dt uses decision trees to model the relative effects of actions in the domain. The agent explores the en- vironment exhaustively in early episodes when its model is inaccurate. Once it believes it has developed an accurate model, it exploits its model, taking the optimal action at each step. The combination of the learning approach with the targeted exploration policy enables fast learning of the model. The sample efficiency of the algorithm is evaluated empirically in comparison to five other algorithms across three domains. rl-dt consistently accrues high cumulative rewards in comparison with the other algorithms tested.
Conference Paper
Modern reinforcement learning algorithms effectively exploit experience data sampled from an unknown controlled dynamical sys- tem to compute a good control policy, but to obtain the necessary data they typically rely on naive exploration mechansisms or human domain knowledge. Approaches that first learn a model offer improved explo- ration in finite problems, but discrete model representations do not ex- tend directly to continuous problems. This paper develops a method for approximating continuous models by fitting data to a finite sam- ple of states, leading to finite representations compatible with existing model-based exploration mechanisms. Experiments with the resulting family of fitted-model reinforcement learning algorithms reveals the crit- ical importance of how the continuous model is generalized from finite data. This paper demonstrates instantiations of fitted-model algorithms that lead to faster learning on benchmark problems than contemporary model-free RL algorithms that only apply generalization in estimating action values. Finally, the paper concludes that in continuous problems, the exploration-exploitation tradeoff is better construed as a balance be- tween exploration and generalization.