ArticlePDF Available

Policy Transfer in Apprenticeship Learning

Policy Transfer in Apprenticeship Learning
Abdeslam Boularias and Brahim Chaib-draa
Laval University, Canada
1. Overview
We cast the problem of Apprenticeship Learning (Imitation
Learning) as a classification problem.
We use a modified version of the k-nearest neighbors method.
The distance between two vertices is the distance between the
graphs defined around these vertices.
The distance between two graphs is the largest error of a ho-
momorphism between the two graphs.
2. Markov Decision Process (MDP)
A Markov Decision Process (MDP) is defined by:
S: a finite set of states.
A: a finite set of actions.
T: a transition function, where T(s, a, s0)is the probability of
ending up in state s0after taking action ain state s.
R: a reward function, R(s, a)is the immediate reward that the
agent receives for executing action ain state s.
γ(0,1] is a discount factor.
3. Policies
A policy πis a function that maps every state into a distribution
over the actions:
π:S × A [0,1]
π(s, a) = P r(at=a|st=s)
The value of a policy πis the expected sum of the rewards that
an agent receives by following this policy.
V(π) = E[
γtR(st, at)|π]
Solving an MDP consists in finding an optimal policy.
4. Apprenticeship Learning
Specifying a reward function by hand is not easy in most of the
practical problems [Abbeel & Ng, 2004].
It is often easier to demonstrate examples of a desired behav-
ior than to define a reward function.
In apprenticeship learning, we assume that the reward function
is unknown.
There are two parts involved in apprenticeship learning:
1.An expert agent demonstrating an optimal policy πEfor some
2.A apprentice agent trying to learn a generalized policy πAby
observing the expert.
5. Problem of Policy Transfer
Problem: How to generalize the expert’s policy to states that
have not been encountered during the demonstration.
Previous works have attempted to solve this problem by rep-
resenting the states as vectors of features, and classifying the
states accordingly.
Inverse reinforcement learning algorithms learn a reward func-
tion from the demonstration of the expert policy, and use it to
find a generalized policy [Abbeel & Ng, 2004].
These algorithms assume that the reward function can be ex-
pressed by considering only the features of the states.
However, the reward function may depend on the topology of
the graph rather than the features of the states.
6. MDP Homomorphism [Ravindran, 2004]
A homomorphism from MDP Mto MDP M0is a surjective func-
tion fthat maps every state in Mto a state of M0such that:
T0(f(st), a, s0t+1) = X
T(st, a, st+1)
1 1
A vertex in the second graph is the image of the vertices in the
first graph that have the same color.
7. Soft MDP Homomorphism [Sorg & Singh, 2009]
A soft homomorphism is a function fthat maps every state in M
to a distribution over the states of M0such that:
f(st, s0
t, a, s0t+1) = X
T(st, a, st+1)f(st+1, s0
Finding a soft homomorphism can be casted as a linear program.
Definition.Two states are locally similar if there is a soft homo-
morphism from the MDP defined by the neighbors (within a given
distance d) of the first state to the MDP defined by the neighbors
of the second.
8. Racetrack Example
A demonstration of the expert policy
Similar graphs Dissimilar graphs
There are two possible speeds in each direction of the vertical
and horizontal axis, in addition to the zero speed in each axis.
Actions: accelerate or decelerate in each axis, or do nothing.
Actions succeed with probability 0.9in low speeds and only 0.5
in high speeds.
The cost of an off-road is 5and the reward for reaching the
finish line is 200.
9. Algorithm
Is the expert action
for state Sknown? Return the expert
action πE(S)
Initialize the
distance kto 1
Is there a neighbor
locally similar to S
with a known expert
Run a vote with the
neighbors that are
locally similar to S
Return the policy
πE(S, A) = votes(A)
10. Results of the Racetrack Simulation
10 20 30 40 50 60 70 80 90 100
Average reward per step
Number of trajectories in the demonstration
Soft Local Homomorphism
10 20 30 40 50 60 70 80 90 100
Average number of steps
Number of trajectories in the demonstration
Soft Local Homomorphism
10 20 30 40 50 60 70 80 90 100
Average number of hitted obstacles per step
Number of trajectories in the demonstration
Soft Local Homomorphism
11. Conclusion and Future Work
Policy transfer by soft local homomorphisms is well-suited for
problems where the rewards depend on the topology of the
Using homomorphisms leads to a significant improvement in
the quality of the policies learned by imitation.
This approach involves solving O(|S|2)linear programs, though
the number of variables is bounded by the maximal distance.
There are no guarantees about the optimality of the solution.
As a future work, we target to use random walk kernels as a
measure of similarity between graphs, and find the theoretical
guarantees about the optimality of the solution.
[Abbeel & Ng, 2004]Abbeel, P., & Ng, A. Y. (2004). Apprentice-
ship Learning via Inverse Reinforcement Learning. Proceed-
ings of the Twenty-first International Conference on Machine
Learning (ICML’04) (pp. 1–8).
[Ravindran, 2004]Ravindran, B. (2004). An Algebraic Approach
to Abstraction in Reinforcement Learning. Doctoral disserta-
tion, University of Massachusetts, Amherst MA.
[Wolfe & Barto , 2006]Wolfe, A., & Barto, A. (2006). Decision
Tree Methods for Finding Reusable MDP Homomorphisms.
Proceedings of The Twenty-first National Conference on Arti-
ficial Intelligence (AAAI’06) (pp. 530–535).
[Sorg & Singh, 2009]Sorg, J., & Singh, S. (2009). Transfer via
Soft Homomorphisms. Proceedings of The Eighth International
Conference on Autonomous Agents and Multiagent Systems
(AAMAS’09) (pp. 741–748).
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
The field of transfer learning aims to speed up learning across multiple related tasks by transferring knowledge be- tween source and target tasks. Past work has shown that when the tasks are specified as Markov Decision Processes (MDPs), a function that maps states in the target task to similar states in the source task can be used to trans- fer many types of knowledge. Current approaches for au- tonomously learning such functions are inefficient or require domain knowledge and lack theoretical guarantees of perfor- mance. We devise a novel approach that learns a stochastic mapping between tasks. Using this mapping, we present two algorithms for autonomous transfer learning - one that has strong convergence guarantees and another approximate method that learns online from experience. Extending exist- ing work on MDP homomorphisms, we present theoretical guarantees for the quality of a transferred value function.
Full-text available
To operate e#ectively in complex environments learning agents have to selectively ignore irrelevant details by forming useful abstractions. In this article we outline a formulation of abstraction for reinforcement learning approaches to stochastic sequential decision problems modeled as semiMarkov Decision Processes (SMDPs). Building on existing algebraic approaches, we propose the concept of SMDP homomorphism and argue that it provides a useful tool for a rigorous study of abstraction for SMDPs. We apply this framework to di#erent classes of abstractions that arise in hierarchical systems and discuss relativized options, a framework for compactly specifying a related family of temporally-extended actions. Additional details of this work are described in refs. [1, 2, 3].
We consider learning in a Markov decision process where we are not explicitly given a reward function, but where instead we can observe an expert demonstrating the task that we want to learn to perform. This setting is useful in applications (such as the task of driving) where it may be di#cult to write down an explicit reward function specifying exactly how di#erent desiderata should be traded o#. We think of the expert as trying to maximize a reward function that is expressible as a linear combination of known features, and give an algorithm for learning the task demonstrated by the expert. Our algorithm is based on using "inverse reinforcement learning" to try to recover the unknown reward function. We show that our algorithm terminates in a small number of iterations, and that even though we may never recover the expert's reward function, the policy output by the algorithm will attain performance close to that of the expert, where here performance is measured with respect to the expert 's unknown reward function.