ArticlePDF Available

# Policy Transfer in Apprenticeship Learning

Authors:
D
AMAS
www.damas.ift.ulaval.ca
Policy Transfer in Apprenticeship Learning
Abdeslam Boularias and Brahim Chaib-draa
{boularias;chaib}@damas.ift.ulaval.ca
1. Overview
We cast the problem of Apprenticeship Learning (Imitation
Learning) as a classiﬁcation problem.
We use a modiﬁed version of the k-nearest neighbors method.
The distance between two vertices is the distance between the
graphs deﬁned around these vertices.
The distance between two graphs is the largest error of a ho-
momorphism between the two graphs.
2. Markov Decision Process (MDP)
A Markov Decision Process (MDP) is deﬁned by:
S: a ﬁnite set of states.
A: a ﬁnite set of actions.
T: a transition function, where T(s, a, s0)is the probability of
ending up in state s0after taking action ain state s.
R: a reward function, R(s, a)is the immediate reward that the
agent receives for executing action ain state s.
γ(0,1] is a discount factor.
3. Policies
A policy πis a function that maps every state into a distribution
over the actions:
π:S × A [0,1]
π(s, a) = P r(at=a|st=s)
The value of a policy πis the expected sum of the rewards that
an agent receives by following this policy.
V(π) = E[
X
t=0
γtR(st, at)|π]
Solving an MDP consists in ﬁnding an optimal policy.
4. Apprenticeship Learning
Specifying a reward function by hand is not easy in most of the
practical problems [Abbeel & Ng, 2004].
It is often easier to demonstrate examples of a desired behav-
ior than to deﬁne a reward function.
In apprenticeship learning, we assume that the reward function
is unknown.
There are two parts involved in apprenticeship learning:
1.An expert agent demonstrating an optimal policy πEfor some
states.
2.A apprentice agent trying to learn a generalized policy πAby
observing the expert.
5. Problem of Policy Transfer
Problem: How to generalize the expert’s policy to states that
have not been encountered during the demonstration.
Previous works have attempted to solve this problem by rep-
resenting the states as vectors of features, and classifying the
states accordingly.
Inverse reinforcement learning algorithms learn a reward func-
tion from the demonstration of the expert policy, and use it to
ﬁnd a generalized policy [Abbeel & Ng, 2004].
These algorithms assume that the reward function can be ex-
pressed by considering only the features of the states.
However, the reward function may depend on the topology of
the graph rather than the features of the states.
6. MDP Homomorphism [Ravindran, 2004]
A homomorphism from MDP Mto MDP M0is a surjective func-
tion fthat maps every state in Mto a state of M0such that:
T0(f(st), a, s0t+1) = X
st+1∈S,f(st+1)=s0t+1
T(st, a, st+1)
Example:
s0
0s0
1s0
2
1 1
1
s0s1
s2
s3
1
0.5
0.5
1
1
A vertex in the second graph is the image of the vertices in the
ﬁrst graph that have the same color.
7. Soft MDP Homomorphism [Sorg & Singh, 2009]
A soft homomorphism is a function fthat maps every state in M
to a distribution over the states of M0such that:
X
s0
t∈S0
f(st, s0
t)T0(s0
t, a, s0t+1) = X
st+1∈S
T(st, a, st+1)f(st+1, s0
t+1)
Finding a soft homomorphism can be casted as a linear program.
Deﬁnition.Two states are locally similar if there is a soft homo-
morphism from the MDP deﬁned by the neighbors (within a given
distance d) of the ﬁrst state to the MDP deﬁned by the neighbors
of the second.
8. Racetrack Example
A demonstration of the expert policy
?
Similar graphs Dissimilar graphs
There are two possible speeds in each direction of the vertical
and horizontal axis, in addition to the zero speed in each axis.
Actions: accelerate or decelerate in each axis, or do nothing.
Actions succeed with probability 0.9in low speeds and only 0.5
in high speeds.
The cost of an off-road is 5and the reward for reaching the
ﬁnish line is 200.
9. Algorithm
Is the expert action
for state Sknown? Return the expert
action πE(S)
Initialize the
neighboring
distance kto 1
Is there a neighbor
locally similar to S
with a known expert
action?
Run a vote with the
neighbors that are
locally similar to S
Return the policy
kk+1
yes
no
yes
no
10. Results of the Racetrack Simulation
0
5
10
15
20
10 20 30 40 50 60 70 80 90 100
Average reward per step
Number of trajectories in the demonstration
Soft Local Homomorphism
Expert
k−NN
10
15
20
25
30
35
40
45
50
10 20 30 40 50 60 70 80 90 100
Average number of steps
Number of trajectories in the demonstration
Soft Local Homomorphism
Expert
k−NN
0
0.2
0.4
0.6
0.8
1
10 20 30 40 50 60 70 80 90 100
Average number of hitted obstacles per step
Number of trajectories in the demonstration
Soft Local Homomorphism
Expert
k−NN
11. Conclusion and Future Work
Policy transfer by soft local homomorphisms is well-suited for
problems where the rewards depend on the topology of the
graph.
Using homomorphisms leads to a signiﬁcant improvement in
the quality of the policies learned by imitation.
This approach involves solving O(|S|2)linear programs, though
the number of variables is bounded by the maximal distance.
There are no guarantees about the optimality of the solution.
As a future work, we target to use random walk kernels as a
measure of similarity between graphs, and ﬁnd the theoretical
guarantees about the optimality of the solution.
References
[Abbeel & Ng, 2004]Abbeel, P., & Ng, A. Y. (2004). Apprentice-
ship Learning via Inverse Reinforcement Learning. Proceed-
ings of the Twenty-ﬁrst International Conference on Machine
Learning (ICML’04) (pp. 1–8).
[Ravindran, 2004]Ravindran, B. (2004). An Algebraic Approach
to Abstraction in Reinforcement Learning. Doctoral disserta-
tion, University of Massachusetts, Amherst MA.
[Wolfe & Barto , 2006]Wolfe, A., & Barto, A. (2006). Decision
Tree Methods for Finding Reusable MDP Homomorphisms.
Proceedings of The Twenty-ﬁrst National Conference on Arti-
ﬁcial Intelligence (AAAI’06) (pp. 530–535).
[Sorg & Singh, 2009]Sorg, J., & Singh, S. (2009). Transfer via
Soft Homomorphisms. Proceedings of The Eighth International
Conference on Autonomous Agents and Multiagent Systems
(AAMAS’09) (pp. 741–748).