Integrating Sample-Based Planning and Model-Based Reinforcement Learning.
-
Citations (0)
-
Cited In (0)
Page 1
Integrating Sample-based Planning and Model-based Reinforcement Learning
Thomas J. Walsh and Sergiu Goschin and Michael L. Littman
Department of Computer Science, Rutgers University
Piscataway, NJ 08854, USA
{thomaswa, sgoschin, mlittman}@cs.rutgers.edu
Abstract
Recent advancements in model-based reinforcement learn-
ing have shown that the dynamics of many structured do-
mains (e.g. DBNs) can be learned withtractable sample com-
plexity, despite their exponentially large state spaces. Un-
fortunately, these algorithms all require access to a planner
that computes a near optimal policy, and while many tra-
ditional MDP algorithms make this guarantee, their com-
putation time grows with the number of states. We show
how to replace these over-matched planners with a class of
sample-based planners—whose computation time isindepen-
dent of the number of states—without sacrificing the sample-
efficiency guarantees of the overall learning algorithms. To
do so, we define sufficient criteria for a sample-based planner
to be used in such a learning system and analyze two popu-
lar sample-based approaches from the literature. We also in-
troduce our own sample-based planner, which combines the
strategiesfrom these algorithms and stillmeets thecriteriafor
integration into our learning system. In doing so, we define
the first complete RL solution for compactly represented (ex-
ponentially sized) state spaces with efficiently learnable dy-
namics that is both sample efficient and whose computation
time does not grow rapidly with the number of states.
Introduction
Reinforcement-learning (Sutton and Barto 1998) or RL al-
gorithms that rely on the most basic of representations (the
so called “flat MDP”) do not scale to environments with
large state spaces, such as the exponentiallysized spaces as-
sociated with factored-state MDPs (Boutilier, Dearden, and
Goldszmidt 2000). However, model-based RL algorithms
that use compact representations(such as DBNs), have been
shown to have sample-efficiency (like PAC-MDP) guaran-
tees in these otherwise overwhelming environments. De-
spite the promise of these sample-efficient learners, a sig-
nificant obstacle remains in their practical implementation:
they require access to a planner that guarantees ǫ-accurate
decisions for the learned model. In small domains, tradi-
tional MDP planning algorithms like Value Iteration (Puter-
man 1994) can be used for this task, but for larger models,
these planners become computationally intractable because
they scale with the size of the state space. In a sense, the
Copyright c ? 2010, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
issue is not a learning problem—RL algorithms can learn
the dynamics of such domains quickly. Instead, computa-
tion or planning is presently the bottleneck in model-based
RL. This paper explores replacing the traditional flat MDP
planners with sample-based planners. Such planners open
up the planning bottleneck because their computation time
is invariant with respect to the size of the state space.
We analyze the integrationof sample-based planners with
a class of model learners called KWIK learners, which have
been instrumental in recent advances on the sample com-
plexity of learning RL models (Li, Littman, and Walsh
2008).The KWIK framework and its agent algorithm,
KWIK-Rmax, have unified the study of sample complex-
ity across representations, including factored-state and rela-
tional models (Walsh et al. 2009). But, to date, only flat-
MDP planners have been used with these KWIK learners.
Sample-based planners, starting with Sparse Sampling or
SS (Kearns, Mansour, and Ng 2002), were developed to
combat the curse of dimensionality in large state spaces. By
concedingan exponentialdependenceonthe horizonlength,
not finding a policy for every state, and randomizing the
value calculations, SS attains a runtime that is independent
of the number of states in a domain. The successor of SS,
Upper Confidence for Trees (UCT), formed the backboneof
some extremely successful AI systems for RTS games and
Go (Balla and Fern 2009; Gelly and Silver 2007). Silver,
Sutton, and M¨ uller (2008) used a sample-based planner in a
model-based reinforcement learning system, building mod-
els from experience and using the planner with this model.
This architecture is similar to ours, but made no guaran-
tees on sample or computational complexity, which we do
in this work. We also note that while the literature some-
times refers to sample-based planners as “learning” a value
function from rollouts, their behavior is better described as
a form of search given a generative model.
The major contribution of this paper is to define the first
complete RL solution for compactly represented (exponen-
tially sized) state spaces with KWIK-learnable dynamics
that is both sample efficient and whose computation time
grows only in the size of the compact representation, not the
number of states. To do so, we (1) describe a criteria for
a general planner to be used in KWIK-Rmax and still en-
sure PAC-MDP behavior, (2) show that the original efficient
sample-based planner, SS, satisfies these conditions, but the
Page 2
more empirically successful UCT does not, and (3) intro-
duce a new sample-based planning algorithm we call For-
wardSearchSparseSampling(FSSS) thatbehavesmorelike
(and sometimes provably better than) UCT, while satisfying
the efficiency requirements.
Learning Compact Models
We first review results in RL under KWIK framework. We
begin by describing the general RL setting.
Model-Based Reinforcement Learning and KWIK
An RL agent interacts with an environment that can be de-
scribed as a Markov Decision Process (MDP) (Puterman
1994) M = ?S,A,R,T,γ? with states S, actions A, a re-
ward function R : (S,A) ?→ ℜ with maximum reward
Rmax, a transition function T : (S,A) ?→ Pr[S], and dis-
count factor 0 ≤ γ < 1. The agent’s goal is to maximize
its expected discounted reward by executing an optimal pol-
icy π∗: S ?→ Pr[A], which has an associated value func-
tion Q∗(s,a) = R(s,a)+γ?
V∗(s) = maxaQ∗(s,a). In model-based RL, an agent ini-
tially does not know R and T. Instead, on each step it ob-
serves a sample from T and R, updating its model M′, and
then planning using M′. The KWIK framework, described
below, helps measure the sample complexity of the model-
learning component by bounding the number of inaccurate
predictions M′can make.
Knows What It Knows or KWIK (Li, Littman, and Walsh
2008) is a framework developed for supervised active learn-
ing. Parameters ǫ and δ control the accuracy and certainty,
respectively. At each timestep, an input xtfrom a set X is
presented to an agent, which must make a prediction ˆ yt∈ Y
from a set of labels Y or predict ⊥ (“I don’t know”). If
ˆ yt= ⊥, the agent then sees a (potentiallynoisy) observation
ztof the true yt. An agent is said to be efficient in the KWIK
paradigm if with high probability (1 − δ): (1) The number
of ⊥ output by the agent is bounded by a polynomial func-
tion of the problemdescription; and, (2) Wheneverthe agent
predicts ˆ yt?= ⊥,||ˆ yt− yt|| < ǫ. The bound on the number
of ⊥ in the KWIK algorithm is related (with extra factors of
Rmax,(1 − γ),1
Littman 2009) bound on the agent’s behavior, defined as
Definition 1. An algorithm A is considered PAC-MDP if
for any MDP M, ǫ > 0, 0 < δ < 1, 0 < γ < 1, the
sample complexity of A, that is the number of steps t such
that VA
that is polynomialin the relevant quantities(1
|M|) with probability at least 1 − δ.
Here, |M| measures the complexity of the MDP M based
on the compact representation of T and R, not necessarily
|S| itself (e.g. in a DBN |M| is O(log(|S|))). To ensure this
guarantee, a KWIK-learning agent must be able to connect
its KWIK learner for the world’s dynamics (KL) to a plan-
ner P that will interpret the learned model “optimistically”.
For instance, in small MDPs, the model could be explicitly
constructed, with all ⊥ predictions being replaced by transi-
tions to a trapping Rmaxstate. This idea is captured in the
KWIK-Rmax algorithm (Li 2009).
s′∈ST(s,a,s′)V∗(s′) where
ǫ, and δ) to the PAC-MDP (Strehl, Li, and
t(st) < V∗(st) − ǫ is bounded by some function
ǫ,1
δ,
1
(1−γ)and
Algorithm 1 KWIK-Rmax (Li 2009)
Agent knows S (in some compact form), A, γ, ǫ, δ
Agent has access to planner P guaranteeing ǫ accuracy
Agent has KWIK learner KL for the domain
t = 0, st= start state
while Episode not done do
M′= KL with ⊥ interpreted optimistically
at= P.getAction(M′,st)
Execute at, view st+1, KL.update(st,at,st+1).
t = t + 1
Intuitively, the agent uses KL for learning the dynamics
and rewards of the environment, which are then given to an
ǫ-optimal planner P. For polynomial |S|, a standard MDP
planneris sufficient to guarantee PAC-MDP behavior. How-
ever, KWIK has shown that the sample complexity of learn-
ingrepresentationsoffarlargerstatespacescanstill bepoly-
nomially (in |M|) bounded (see below). In these settings,
the remaining hurdle to truly efficient learning is a planner
P that does not dependon the size of the state space, but still
maintains the guarantees necessary for PAC-MDP behavior.
Before describing such planners, we describe two domain
classes with exponential |S| that are still KWIK learnable.
Factored-state MDPs and Stochastic STRIPS
A factored-state MDP is an MDP where the states are com-
prised of a set of factors F that take on attribute values.
While |S| is O(2F), the transition function T is encoded us-
ing a Dynamic Bayesian Network or DBN (Boutilier, Dear-
den, and Goldszmidt 2000), where, for each factor f, the
probability of its value at time t + 1 is based on the value
of its k “parent” factors at time t. Recently, KWIK al-
gorithms for learning both the structure, parameters, and
reward function of such a representation have been de-
scribed (Li, Littman, and Walsh 2008; Walsh et al. 2009).
Combiningthese algorithms, a KWIK bound of˜O(Fk+3Ak
can be derived—notice it is polynomial (for k = O(1))
in F,A,1
in F) state space S. Similarly, a class of relational MDPs
describable using Stochastic STRIPS rules has been shown
to be partially KWIK learnable (Walsh et al. 2009). In
this representation, states are described by relational fluents
(e.g. Above(a,b)) and the state space is once again expo-
nential in the domain parameters (the number of predicates
and objects), but the preconditions and outcome probabili-
ties for actions are KWIK learnable with polynomialsample
complexity. We now present sample-based planners, whose
computational efficiency do not depend on |S|.
ǫ4
)
ǫ(and therefore |M|), instead of the (exponential
Sample-based Planners
Sample-based planners are different from the original con-
ception of planners for KWIK-Rmax in two ways. First,
they require only a generative model of the environment,
not access to the model’s parameters. Nonetheless, they fit
nicely with KWIK learners, which can be directly queried
with state/action pairs to make generative predictions. Sec-
ond, sample-based planners compute actions stochastically,
Page 3
so their policies may assign non-zero probability to sub-
optimal actions.
Formally, the MDP planning problem can be stated as,
for state s ∈ S, select an action a ∈ A such that
over time the induced policy π will be ǫ-optimal, that is,
?
erman 1994) solves the planning problem by iteratively im-
proving an estimate Q of Q∗, but takes time proportional to
|S|2in the worst case. Instead, we seek a planner that fits
the following criterion, adapted from Kearns, Mansour, and
Ng (2002), that we later show is sufficient for preserving
KWIK-Rmax’s PAC-MDP guarantees.
Definition 2. An efficient state-independent planner P
is one that, given (possibly generative) access to an MDP
model, returns an action a, such that the planning prob-
lem above is solved ǫ-optimally, and the algorithm’s per-
step runtime is independentof |S|, and scales no worse than
exponentially in the other relevant quantities (1
Sample-based planners take a different view from VI.
First, note that there is a horizon length H, a function of γ,
ǫ and Rmax, such that taking the action that is near-optimal
over the next H steps is still an ǫ-optimal action when con-
sidering the infinite sum of rewards (Kearns, Mansour, and
Ng 2002). Next, note that the d-horizon value of taking ac-
tion a from state s can be written Qd(s,a) = R(s,a) +
γ?
R(s,a). This computation can be visualized as taking place
on a depth H tree with branching factor |A||S|. Instead of
computinga policyoverthe entire state space, sample-based
planners estimate the value QH(s,a) for all actions a when-
ever they need to take an action from state s. Unfortunately,
this insight alone does not provide sufficient help because
the|A||S|branchingfactorstilldependslinearlyon|S|. But,
the randomized Sparse Sampling algorithm (described be-
low) eliminates this factor.
aπ(s,a)Q∗(s,a) ≥ V∗(s)−ǫ. Value Iteration (VI) (Put-
ǫ,1
δ,
1
1−γ).
s′∈ST(s,a,s′)maxa′ Qd−1(s′,a′), where Q1(s,a) =
Sparse Sampling
TheinsightofSparseSamplingorSS(Kearns,Mansour,and
Ng 2002) is that the summation over states in the definition
of QH(s,a) need only be taken over a sample of next states
and the size of this sample can depend on Rmax,γ, and ǫ in-
stead of |S|. Let Kd(s,a) be a sample of the necessary size
C drawn according to the distribution s′∼ T(s,a,·). Then,
the SS approximation can be written (for s′∈ Kd(s,a)):
?
s′
SS traverses the tree structure of state/horizon pairs in a
bottom-upfashion. The estimate for a state at horizon t can-
not be created until the values at t + 1 has been calculated
for all reachable states. It does not use intermediate results
to focus computation on more important parts of the tree.
As a result, its running time, both best and worst case, is
Θ((|A|C)H). Because it is state-space-sizeindependentand
solves the planning problem with ǫ accuracy (Kearns, Man-
sour, and Ng 2002), it satisfies Definition 2, making it an
attractive candidate for integration into KWIK-Rmax. We
prove this combination preserves PAC-MDP behavior later.
Qd
SS(s,a) = R(s,a) + γ
T(s,a,s′)max
a′
Qd−1
SS(s′,a′).
Unfortunatelyin practice, SS is typically quite slow because
its search of the tree is not focused, a problem addressed by
its successor, UCT.
Upper Confidence for Tree Search
Conceptually, UCT (Kocsis and Szepesv´ ari 2006) takes a
top-down approach (from root to leaf), guided by a non-
stationary search policy. In top-down approaches, planning
proceeds in a series of trials. In each trial (Algorithm 2),
a search policy is selected and followed from root to leaf
and the state-action-reward sequence is used to update the
Q estimates in the tree. Note that, if the search policy is
the optimal policy, value updates can be performed via sim-
ple averaging and estimates converge quickly. Thus, in the
best case, a top-down sample-based planner can achieve a
running time closer to C, rather than (|A|C)H.
In UCT, the sampling of actions at a state/depth node is
determined by v + maxa
?2log(nsd)/na, where v is the
average value of action a from this state/depth pair based
on previous trials, nsdcounts the number of times state s
has been visited at depth d and nais the number of times
a was tried there. This quantity represents the upper tail of
the confidence interval for the node’s value, and the strategy
encourages an aggressive search policy that explores until
it finds fairly good rewards, and then only periodically ex-
plores for better values. While UCT has performed remark-
ably in some very difficult domains, we now show that its
theoreticalpropertiesdo not satisfy the efficiencyconditions
in Definition 2.
Algorithm 2 UCT(s, d) (Kocsis and Szepesv´ ari 2006)
if d = 1 then
Qd(s,a) = R(s,a),∀a
else
a∗= argmaxa(Qd(s,a) +?2log(nsd)/na∗sd)
s′∼ T(s,a,·)
v = R(s,a∗) + γ UCT(s′, d − 1)
nsd= nsd+ 1
na∗sd= na∗sd+ 1
Qd(s,a∗) = (Qd(s,a∗) × (na∗sd− 1) + v)/na∗sd
return v
A Negative Case for UCT’s Runtime
UCT gets some of its advantages from being aggressive.
Specifically, if an action appears to lead to low reward,
UCT quickly rules it out. Unfortunately, sometimes this
conclusion is premature and it takes a very long time for
UCT’s confidence intervals to grow and encourage further
search. A concreteexampleshows howUCT cantake super-
exponential trials to find the best action from s.
The environment in Figure 1 is adapted from Coquelin
andMunos(2007)with the differencethat it has onlya poly-
nomialnumberof states in the horizon. Thedomainis deter-
ministic, with two actions available at each non-goal state.
Action a1leads from a state sito state si+1and yields 0 re-
ward, except for the transition from state sD−1to the goal
state sDwhere a reward of 1 is received. Action a2leads
Page 4
a1,0a1,0
S0
S1
SD−2
SD−1
SD
G1
G2
GD−1
GD
a2,D−1
D
a2,D−2
D
a2,1
D
a2,0
a1,0a1,1
Figure 1: A simple domain where UCT can require super-
exponential computation to find an ǫ-optimal policy.
from a state si−1to a goal state Giwhile receiving a reward
ofD−i
Coquelin and Munos (2007) proved that the initial
number of timesteps needed to reach node sD (which
is always optimal in their work) for the first time is
D−4
????
Proposition 1. For the MDP in Figure 1, for any ǫ <
the minimum number of trials (and therefore the computa-
tion time) UCT needs before node sDis reached for the first
time (and implicitly the number of steps needed to ensure
the behavior is ǫ-optimal) is a composition of O(D) expo-
nentials. Therefore, UCT does not satisfy Definition 2.
D. The optimal action is to take a1from state s0.
Ω(exp(exp(··· (exp(2))). Given this fact, we can state
1
D,
Proof. Assume action a2is always chosen first when a pre-
viously unknown node is reached1. Let ǫ <
to the analysis of Coquelin and Munos (2007), action a2is
chosenΩ(exp(exp(···(exp(2))) times beforesDisreached
the first time. The difference between the values of the two
actions is at least1
if UCT is stopped before the minimum number of necessary
steps to reach sD, it will return a policy (and implicitly an
action) that is not ǫ-optimal.
1
D. According
D> ǫ, implyingthat, with probability0.5,
We note that this result is slightly differentfrom the regret
bound of Coquelin and Munos (2007), which is not directly
applicable here. Still, the empirical success of UCT in many
realworlddomainsmakestheuseofatop-downsearchtech-
nique appealing. In the next section, we introduce a more
conservative top-down sample-based planner.
Forward Search Sparse Sampling
Our new algorithm, Forward Search Sparse Sampling
(FSSS) employs a UCT-like search strategy without sacri-
ficing the guarantees of SS. Recall that Qd(s,a) is an esti-
mate of the depth d value for state s and action a. We in-
troduce upper and lower bounds Ud(s) and Ld(s) for states
and Ud(s,a) and Ld(s,a) for state–action pairs.
Like UCT, it proceeds in a series of top-down trials (Al-
gorithm3), each of which begins with the currentstate s and
depth H and proceeds down the tree to improvethe estimate
of the actions at the root. Like SS, it limits its branchingfac-
tor to C. Ultimately, it computes precisely the same value
as SS, given the same samples. However, it benefits from a
kindof pruningto reducethe amountof computationneeded
in many cases.
1Similar results can be obtained if ties are broken randomly.
Algorithm 3 FSSS(s, d)
if d = 1 (leaf) then
Ld(s,a) = Ud(s,a) = R(s,a),∀a
Ld(s) = Ud(s) = maxaR(s,a)
else if nsd= 0 then
for each a ∈ A do
Ld(s,a) = Vmin
Ud(s,a) = Vmax
for C times do
s′∼ T(s,a,·)
Ld−1(s′) = Vmin
Ud−1(s′) = Vmax
Kd(s,a) = Kd(s,a) ∪ {s′}
a∗= argmaxaUd(s,a)
s∗= maxs′∈Kd(s,a∗)(Ud−1(s′) − Ld−1(s′))
FSSS(s∗, d − 1)
nsd= nsd+ 1
Ld(s,a∗) = R(s,a∗) + γ?
Ud(s,a∗) = R(s,a∗) + γ?
Ld(s) = maxaLd(s,a)
Ud(s) = maxaUd(s,a)
s′∈Kd(s,a∗)Ld−1(s′)/C
s′∈Kd(s,a∗)Ud−1(s′)/C
When LH(s,a∗)
argmaxaUH(s,a), no more trials are needed and a∗is the
best action at the root. The following propositions show
that, unlike UCT, FSSS solves the planning problem in ac-
cordance with Definition 2.
Proposition 2. On termination, the action chosen by FSSS
is that same as that chosen by SS.
Using the definition of Qd(s,a) in SS, it is straightfor-
wardto show that Ld(s,a) andUd(s,a) are lower andupper
boundson its value. If the terminationconditionis achieved,
these bounds indicate that a∗is the best.
Proposition 3. The total number of trials of FSSS before
termination is bounded by the number of leaves in the tree.
Note that each trial ends at a leaf node. We say a node
s at depth d is closed if its upper and lower bounds match,
Ld(s) = Ud(s). A leaf is closed the first time it is visited by
a trial. We argue that every trial closes a leaf.
If the search is not complete, the root must not be closed.
Now, inductively, assume the trial has reached a node s and
depth d that is not closed. For the selected action a∗, it fol-
lows that Ld(s,a∗) ?= Ud(s,a∗) (or s must be closed). That
means there must be some s′∈ Kd(s,a∗) such that s′at
d − 1 is not closed, otherwise the upper and lower bound
averages would be identical. FSSS chooses the s′with the
widestbound. Sinceeachstep ofthetrial leadsto anodethat
is not closed, it must terminate at a leaf that had not already
been visited. Once the leaf is visited, it is closed and cannot
be visited again.
Another property of FSSS is that it implements a version
of classical search-tree pruning. There are several kinds of
pruning suggested for SS (Kearns, Mansour, and Ng 2002),
buttheyall boil downto: Don’tdo evaluationsin partsof the
tree where the value (even if it is maximized) cannot exceed
≥maxa?=a∗ UH(s,a) for a∗
=
Page 5
01234567
−45
−40
−35
−30
−25
−20
−15
−10
−5
0
#Objects
Average Reward
Paint−Polish world
123456
0
500
1000
1500
2000
2500
3000
3500
4000
#Objects
Time (seconds) per trial
Computation
UCT
SS
FSSS
VI
Figure 2: Planners in Paint-Polish world with increasing ob-
jects (40 runs). The optimal policy’s average reward (VI)
decreases linearly with the number of objects. Note VI and
SS become intractable, as seen in the computationtime plot.
the value of some other action (even if it is minimized). Our
choice of taking the action with the maximum upper bound
achieves this behavior. One advantageFSSS has overclassi-
cal pruning is that our approach can interrupt the evaluation
of one part of the tree if partial information indicates it is no
longer very promising.
Also, as written,a trialtakes timeO(H(|A|+C)) because
of the various maximizations and averages. With more care-
ful data structures, such as heaps, it can be brought down
to O(H(log|A| + logC)) per trial. Also, if better bounds
are available for nodes of the tree, say from an admissible
shaping function, L and U can be initialized accordingly.
Figure 2 shows the three sample-based planners dis-
cussed above performing in the “Paint-Polish” relational
MDP (Stochastic STRIPS) from Walsh et al. (2009). The
domain consists of |O| objects that need to be painted, pol-
ished, and finished, but the stochastic actions sometimes
damage the objects and need to be repeated. Like most rela-
tional domains |S| is exponential in |O| and here |A| grows
linearly in |O|. With increasing |O|, VI quickly becomes in-
tractable and SS falters soon after because of its exhaustive
search. But, UCT and FSSS can still provide passable poli-
cies with low computational overhead. FSSS’s plans also
remain slightly more consistent than UCT, staying closer to
the linearly decreasing expected reward of π∗for increasing
O. For both of those planners 2000 rollouts were used to
plan at each step.
Learning with Sample-Based Planners
We now complete the connection of sample-based planners
with the KWIK-Rmax algorithm by proving that the PAC-
MDP conditions are still satisfied by our modified KWIK-
Rmax algorithm. Note this integration requires the plan-
ner to see an optimistic interpretation of the learned model.
The needed modification for an algorithm like Value Itera-
tion would be to replace any unknowntransitions (⊥) with a
high-valued “Rmax” state. In our sample-based algorithms,
wewill useRmax
where the learner makes such an unknown prediction. Note
the new algorithm no longer has to explicitly build M′, and
can instead use KL directly as a generative model for P.
1−γas thevalueofany(s,d,a)tripleinthetree
Thus modified, the algorithm has the following property.
Theorem 1. The KWIK-Rmax algorithm (Algorithm 1) us-
ing a sample-basedplanner P that satisfies Definition 2 and
querying KL as a generative model with the “optimism”
modification described above is PAC-MDP (Definition 1)
and has computation that grows only with |M|, not |S|.
Proof Sketch of Theorem 1
The full proof is similar to the original KWIK-Rmax
proof (Li 2009), so we describe only the lemmas that must
be adapted due to the use of sample-based planners.
The crux of the proof is showing KWIK-Rmax with
sample-based planners that satisfy Definition 2 satisfies the
3 sufficient conditionsfor an algorithmto be PAC-MDP: op-
timism, accuracy and bounded number of “surprises”. An
optimistic model is one for which the estimated value func-
tionina givenstate is greaterthantheoptimalvaluefunction
in the real MDP. A model is accurate when the estimated
value function is close enough to the value of the current
policy in the known part of the MDP. Since our new algo-
rithm does not explicitly build this MDP (and instead con-
nects KL and P directly), these two conditions are changed
so that at any timestep the estimated value of the optimal
stochastic policy in the current estimated MDP (from KL)
is optimistic and accurate. A surprise is a discovery event
that changes the learned MDP (in KL).
Thefollowinglemma states that the algorithm’sestimated
value function is optimistic for any time step t.
Lemma 1. With probability at least 1 − δ, Vπt
V∗(s) − ǫ for all t and (s,a), where π(t) is the policy re-
turned by the planner.
t (s) ≥
The proof is identical to the original proof of Li (2009)
(see Lemma 34) with the extra observation that the planner
computes an implicit Vπt
t
function of a stochastic policy.
Now, turning to the accuracy criterion, we can use a vari-
ation of the Simulation Lemma (c.f. Lemma 12 of Strehl,
Li, and Littman (2009)) that applies to stochastic policies,
and bounds the difference between the value functions of a
policy in two MDPs that are similar in terms of transitions
and rewards. The intuition behind the proof of this new
version is that the stationary stochastic policy π in MDPs
M1and M2induces two Markov chains M′
transitions T′
R′
standard techniques, we can show these transition and re-
ward functions are close.
Because the two Markov chains are ǫ-close, they have ǫ-
close value functions and thus, the value functions of π in
MDPs M1and M2are bounded by the same difference as
between the optimal value functions in MDPs M′
According to the standard simulation lemma, the difference
is
1−γ
. From there, the following lemma bounds
the accuracy of the policy computed by the planner:
Lemma 2. With probability at least 1 − δ, Vπt
Vπt
ner, and MKis the known MDP.
1and M′
2with
1(s,s′) =?
aπ(s,a)R(s,a) (analogously for T′
aπ(s,a)T1(s,a,s′) and rewards
1(s) =?
2, R′
2). By
1and M′
2.
ǫR+γVmaxǫT
t (st) −
MK(st) ≤ ǫ, where π(t) is the policy returned by the plan-