Conference PaperPDF Available

Towards Knowledge Transfer in Deep Reinforcement Learning

Authors:
Towards Knowledge Transfer in Deep
Reinforcement Learning
Ruben Glatt, Felipe Leno da Silva, and Anna Helena Reali Costa*
Escola Polit´
ecnica da Universidade de S˜
ao Paulo, Brazil
{ruben.glatt,f.leno,anna.reali}@usp.br
Abstract—Driven by recent developments in the area of Ar-
tificial Intelligence research, a promising new technology for
building intelligent agents has evolved. The technology is termed
Deep Reinforcement Learning (DRL) and combines the classic
field of Reinforcement Learning (RL) with the representational
power of modern Deep Learning approaches. DRL enables
solutions for difficult and high dimensional tasks, such as Atari
game playing, for which previously proposed RL methods were
inadequate. However, these new solution approaches still take a
long time to learn how to actuate in such domains and so far
are mainly researched for single task scenarios. The ability to
generalize gathered knowledge and transfer it to another task has
been researched for classical RL, but remains an open problem
for the DRL domain. Consequently, in this article we evaluate
under which conditions the application of Transfer Learning
(TL) to the DRL domain improves the learning of a new task.
Our results indicate that TL can greatly accelerate DRL when
transferring knowledge from similar tasks, and that the similarity
between tasks plays a key role in the success or failure of
knowledge transfer.
I. INTRODUCTION
A technique that has been gaining a lot of publicity lately
is Deep Learning (DL), which has driven many of the recent
Machine Learning (ML) developments by profiting from the
comeback of neural networks (NN) in Artificial Intelligence
(AI) research. DL allows to learn abstract representations of
high dimensional input data and can be used to improve many
existing ML techniques [1]. DL architectures became one of
the most powerful tools in AI in short time, beating long
standing records not just in one, but in many domains of
ML, as for example in object recognition, hand-written digit
recognition, speech recognition or recommender systems.
Another technique that can benefit from DL is the well-
researched area of Reinforcement Learning (RL) [2]. In RL, an
agent explores the space of possible strategies or actions in a
given environment, receives a feedback (reward or cost) on the
outcome of the choices made and deduces a behavior policy
from his observations. As shown in Figure 1, an agent can
interact with its environment by performing an action on each
discrete time-step. The environment then answers by updating
the current state to a follow-up state and by giving a reward
to the agent, indicating the value of performing an action in
*We are grateful for the support from CAPES, CNPq (grant 311608/2014-
0), and the S˜
ao Paulo Research Foundation (FAPESP), grant 2015/16310-4.
The HPC resources for the computation are provided by the Superintendency
of Information at the University of S˜
ao Paulo. We also thank Google for the
support due to the Google Research Awards 2015 for Latin-America.
a concrete state. By performing various actions in a trial-and-
error manner, a sequence of states s, actions a, follow-up states
s0and rewards ris generated and can be stored in episodes as
tuples of hs, a, s0,ri. The goal of the agent is to determine a
policy that maps a state to an action, which maximizes the
accumulated reward over the lifetime of the agent.
This form of decision-making can be modeled as a Markov
Decision Process (MDP) [4]. The core of the MDP is the
Markov Property (MP), which is given if future states of the
process depend only upon the present state and the action
we take in this state, but not on how this state was reached.
Formally an MDP can be described as a tuple of hS, A, T, Ri.
In this tuple, Sis a finite set of all possible states s,Ais a
finite set of all possible actions a,Tis the transition function,
which provides a probability of reaching a follow-up state s0
given a state sand an action a, and Ris the reward function,
which provides the reward rthe agent receives, when reaching
a follow-up state s0from state safter executing action a.
RL already achieves excellent results in a variety of domains
from the board-game domain [5] to autonomous helicopter
flight [6]. Recent work in combining the power of DL tech-
niques with RL (Deep Reinforcement Learning - DRL) have
led to more powerful intelligent agents, which are now able to
solve problems with high-dimensional input data, like images,
with reasonable computational efforts as demonstrated in the
Fig. 1. Principal sketch of Reinforcement Learning in the Atari game playing
domain for single task learning [3].
Atari game playing domain by [7].
Although this evolution led to excellent results for single
task learning, it still does not provide significant improvements
in multi task learning scenarios. Multi task learning can be
considered a challenge for Transfer Learning (TL), which
supports the ability to generalize gathered knowledge in one
or several source tasks and transfer it to other tasks, offering
the advantage of not having to learn every task from scratch
because it relies on abstractions of past experiences [8]. Taylor
and Stone [9] name three main developments for the latest
interest in the topic: (1) RL techniques have achieved very
notable successes and outperformed other ML techniques for
a variety of difficult tasks in agent theory, (2) other classical
ML techniques have matured enough to assist with TL, and
(3) the initial results show that this combination can be very
effective and has a positive effect on various aspects of RL.
They also conclude that there is considerable room for more
work in the area, a fact that is still true today.
Given those developments, our long-term ambition is to
find ways which enable successful transfer of knowledge in
DRL. In this article we present a novel empirical evaluation
to understand the outcomes from Transfer Learning in DRL
domains. In our experimental scenario, a new target task
is presented to an agent that has access to the knowledge
acquired in already solved source tasks, and the effects of the
previous knowledge on the learning process are evaluated. Our
experimental results indicate that the transfer of knowledge can
be either benefitial to the agent, neutral, or even hamper the
learning, depending on the similarity between the source and
target tasks.
The remainder of this article is organized as follows: In
Section II we outline the underlying principles of DRL and
related work. In Section III we describe the TL approach, its
challenges and examples from the literature. In Section IV
we introduce our proposal. In Section V we present our
experiments and discuss the results. The article closes with
Section VI, where we specify possible next research steps.
II. BACKGRO UND AND RELATED WORK
The methods to solve RL problems can be divided into
three main groups: critic-only, actor-only, and actor-critic
methods. In this context we are concerned with critic-only
methods, which in general use a Temporal Difference (TD)
approach to solve the RL problem. TD methods are related
to Monte Carlo methods [10], because they rely on sampling
from the environment for the learning task, and to Dynamic
Programming [11], because they approximate future results
from past experiences. A popular example of TD methods is
the Q-Learning algorithm [12], where the optimal action-value
function Q(s, a)is estimated by a function approximator.
An advantage of this method is that it is model-free, which
means that it can directly learn the policy without learning the
transition and reward functions of the MDP explicitly. Another
advantage is, that it can be trained off-policy, following a
strategy that ensures adequate behaviour to balance exploration
and exploitation while exploring the state space. The function
approximator is formulated as a Bellman equation [13], which
was shown to converge to an optimal solution, if the function
is updated iteratively indefinitely, Qi!Q,i !1. The
optimal Q-value reflects the expected sum of the immediate
reward rand the maximal discounted reward for future actions,
assuming optimal behaviour in the future,
Q(s, a)=E[r+max
a0Q(s0,a
0)|s, a].(1)
The policy describes a mapping from a state to an action
and the optimal policy is then defined by choosing the
action a, which maximizes Q(s, a)for a given state s:
(s) = argmax
a2A(s)
Q(s, a).(2)
Although it is not a new idea to use a NN as function
approximator for RL problems, as shown for example in
Fitted Q-Learning [14], advances in algorithms for DL have
brought upon a new wave of successful applications. The
break-through work for this new approach was first published
in 2013 as a workshop presentation and later refined in an
extensive journal article for the Atari game playing domain
[7]. The authors use a combination of convolutional and fully-
connected layers to approximate the possible Q-values for a
given state, nominated Deep Q-Network (DQN). The results
demonstrate that a single architecture can successfully learn
control policies in a variety of different tasks without using
prior knowledge.
The algorithm is described in Algorithm 1, where the
states are sequences of images xand actions a,st=
x1,a
1,x
2,a
2,...,a
t1,x
t, and (s)is a preprocessing func-
tion, which applies some steps to reduce input dimensionality
and stacks the m(here 4) most recent frames to produce the
input for the DQN, so that the requirement of full observability
for MDPs is not violated. The Q-function is then described as
Q(, a;), where represents the weights of the network.
The approach differs from standard Q-Learning mainly in
two ways to make it suitable for training large NN without
diverging. The first difference is the use of experience replay,
a form of batch learning, where the agent stores its experience
at each time-step et=(st,a
t,r
t,s
t+1)in a replay memory
Dt={e1,...,e
t}. During training the agent then draws off-
policy random experiences from Dtto improve data efficiency,
break up correlations and eliminate unwanted feedback-loops.
The second difference is the introduction of a target DQN
ˆ
Q(, a, ), which is updated with the weights of the
training DQN Q(, a, )only every Ctime-steps. This adds
a delay between updating the parameters and the effect on the
trained network and makes the algorithm more stable against
oscillations and divergence.
The weights of the network can be trained by optimizing
the loss function of the network
Li(i)=Er+max
a0
ˆ
Q(s0,a
0;
i)Q(s, a;i)2.
(3)
Algorithm 1 Deep Q-Learning with experience replay
1: Initialize replay memory D
2: Initialize training action-value function Q(, a, )with
random weights
3: Initialize target action-value function ˆ
Q(, a, )with
weights =
4: repeat (for each episode )
5: initialize s1={x1}and preprocessed sequence 1=
(s1)
6: repeat (for each time-step in episode)
7: Choose action atin tfollowing "-greedy strategy
in Q(t,a,)
8: Observe new image xt+1 and reward rt
9: Set state st+1 st,a
t,x
t+1 and preprocess
t+1 =(st+1)
10: Store transition (t,a
t,r
t,
t+1)in D
11: Sample random minibatch of transitions
(j,a
j,r
j,
j+1)from D
12: repeat (for each transition in minibatch)
13: if episode terminates at step j+1 then
14: set yj rj
15: else
16: set yj rj+maxaˆ
Q(j,a,
)
17: end if
18: Perform gradient descent step on (yj
Q(j,a
j,))2with respect to network parameters
19: until no more elements in minibatch
20: Every C steps do ˆ
Q Q
21: Set t t+1 and st st+1
22: until episode ends
23: until no more episodes
Differentiating the loss function with respect to the weights
then leads to the gradient, which will be used for the gradient
descent:
riLi(i)=Er+max
a0
ˆ
Q(s0,a
0;
i)
Q(s, a;i)riQ(s, a;i.
(4)
In this context the trained DQN can be seen as a kind of
end-to-end RL approach, where the agent can learn a policy
directly from its input data without having to find a suitable
representation manually first. This is especially interesting,
because it offers the opportunity to efficiently work with
high dimensional input data and opens up a lot of possible
applications, which were too expensive to solve before.
Other researchers already picked up the idea and started
to work with the DQN basic structure to improve results
and learning speed. The authors of [15] introduce an exten-
sion of the DQN by adding a Long Short Term Memory
(LSTM) in the form of an additional recurrent layer after
the convolution layers to improve performance on partially
observed states. Another extension of the DQN was proposed
by [16] to reduce overestimation of action values and improve
performance on a number of games by decoupling the action
selection from the action evaluation. Another publication in
this domain integrates an actor-critic structure with a Deep
Deterministic Policy Gradient technique and shows that it can
also learn competitive policies for low dimensional input data
[17]. Introducing Gorila (General Reinforcement Learning
Architecture), there have also been massive improvements
on the computational architecture of the DQN to allow for
distributed computation with parallel actors, shared experience
replay, and distributed NN, leading to better results and speed-
ups by an order of magnitude [18]. The role and importance
of the experience replay has been researched in [19], where
a framework for prioritizing experience is proposed, so as
to replay important transitions more frequently, and therefore
learn more efficiently.
III. TRANSFER LEARNING FOR DEEP Q-NETWORKS
In ML many approaches have achieved good solutions for
single task learning. A remaining problem is the generalization
of these approaches to allow faster and more efficient multi
task learning. The idea of using accumulated knowledge for
this kind of problems is taken from the human learning ability,
which works quite similar. TL emerged from the need to solve
this problem and many researchers have focused on expanding
ML algorithms with the ability to transfer knowledge, expe-
rience or skills to allow the learning of follow up tasks with
very few examples. An example for a promising approach is
the introduction of an option framework to support knowledge
transfer, which provides methods for RL agents to build new
high-level skills [20]. Another method is proposed in [21],
which shows that building stochastic abstract policies that
generalize over past experiences is effective for transferring
knowledge represented as a stochastic abstract policy.
Regardless of many positive results, there are still some
challenges, which need further research. One of those is how to
handle the problem of negative transfer, when the transferred
knowledge decreases the expected result instead of improving
it. Another one is the determination of when a task or a
domain is suitable for transfer and what causes this limitation.
Knowing about the similarity degree between two tasks could
also provide information about how much training would be
necessary to achieve acceptable results in a target task. A
final example of the open challenges is the ability to separate
data that is necessary for a transfer from data that would just
provide an irrelevant bias. In regard to a TL for RL setup, it is
not clear how to organize the knowledge transfer across tasks
with very different reward functions.
Although there has been much work in the DRL domain
lately, the focus seems to be heavily on single task learning. An
earlier work, unrelated to DQNs, argues that Deep Convolu-
tional NNs are particularly well suited for knowledge transfer
and envisions creating a net that could learn new concepts
throughout its lifetime [22]. One of the rare published works
concerned with bridging the gap between single and multi task
learning for DQNs introduces an Actor-Mimic method, which
trains a general Single Policy Network (SPN) for a variety of
distinct tasks using the guidance of several expert networks.
The SPN then generalizes well for new tasks, even without
expert guidance [23].
IV. PRO POSAL
Apart from the before mentioned work, a thorough literature
review has yet found little evidence on research that aims at
combining the advantages of knowledge transfer with DQNs,
although it represents a great opportunity to advance the state
of the art towards general purpose AI agents. Since each of the
techniques involved provides successful methods to deal with
different aspects of learning in AI, but also has shortcomings
in certain areas, a combination of these approaches may be
able to solve harder problems and improve results on existing
ones.
We find that not much discussion has been dedicated to
analyze under which situations TL is useful for DQN and what
consequences arise for the learning agent when TL is blindly
applied. Our proposal in this article is to train a DQN in a
source task and reuse the trained DQN for the initialization
of a new (target) task. The new task is then expected to be
learned faster because of the reused knowledge. However, as
discussed in Section III, Transfer Learning can hamper the
learning process if the tasks are not similar.
Note also that DQNs provide both state and policy abstrac-
tion, and it is not easy to separate these functionalities from
a trained network. Hence, when transferring a DQN from one
task to another we are in fact transferring both the state and
policy abstractions.
We experimentally investigate these assumptions under dif-
ferent scenarios to evaluate when knowledge transfer makes
sense and when the learner is hampered by its previous
knowledge. We also explore the role of similarity between
tasks for the transfer of knowledge in DQNs.
V. E XPERIMENTS
In the Atari game playing domain the agent controls the
actions while playing a selected game with the goal of
maximizing the game score. For each game the agent can
only use the actions that are available for the specific game,
which represent a subset of A={NOOP’, ’FIR E’, ’UP’,
RIGHT’, ’LEFT’, ’DOWN’, ’UPRIGHT’, ’UPLEFT’, ’DOWN-
RIGHT’, ’DOWNLEFT’, ’UPFI RE’, ’R IG HTFI RE ’, ’LEF TFI RE’,
DOW NFIR E’, ’UPRIGHTFIRE’, ’UP LEF TFI RE’, ’DOWNRIGHT-
FIR E’, ’D OW NL EFT FIR E}. To keep in line with the nomencla-
ture in RL we refer to one game (finishes after loss of all lives)
as one episode and to the score per game step as the reward
per game step. The state space in such games is generally
huge, which is the reason a DQN is used as an approximation
for the state instead of saving all different states in a Q-table.
Our experiments intend to evaluate the possible effects of
TL on DRL. We divided our experiments in two phases: (i)
DRL optimizer definition; and (ii) TL evaluation. While in
the former we evaluate different DRL training algorithms and
select the best one for the second phase, in the latter we
evaluate the performance of TL applied to DRL. We limited
Fig. 2. The games for which DQNs were trained: (a) Atlantis, (b) Boxing,
and (c) Breakout.
our experiments to the games shown in Fig 2, mainly to
provide a controlled scenario and ease the analysis of our
results.
The first phase of our experiments intended to verify if
the choice of optimizer and the number of output nodes
would have an impact on the training of the DQN. Therefore
we compared two state-of-art optimizers: RMSProp [24] and
ADAM [25]. We selected the game Breakout as the benchmark
game for this phase, because we achieved the most stable
results with this game during earlier training runs. As in [7],
the DQN can be trained using only the actions that matters
for the specific game (6 for Breakout). However, another
option is to use all the 18 possible combinations of the Atari
controller. Therefore, we compared 100 epochs of learning
with RMSProp both with 6 outputs and 18 outputs, and ADAM
with 18 outputs.
The results are shown in Figure 3. Our graphs show the
rewards per epoch as a running average over 5 epochs. It
is immediately visible that all different configurations have
a similar behaviour. The distinction between the number of
output nodes does not have an impact on the results of the
game and also did not influence the total training time signif-
icantly. ADAM optimizer learns better at the very beginning
of the training, rising faster on the average game score and
losing less lives earlier. Later, for Breakout between epoch 10
and 15, RMSProp catches up and remains the algorithm with
higher average score and fewer loss of live for the rest of
the 100 epochs. As RMSProp optimizer achieved the overall
best results in average reward per episode, we chose it for
the second phase of our experiment. As the difference in
learning speed with different number of outputs is small, we
performed the training for all possible outputs in the following
experiments.
The second phase of our experiments simulates the follow-
ing situation: A (source) task is given to a learning agent
Fig. 3. Comparison between different settings for a DQN trained for the
game Breakout.
that has no background knowledge. After learning an effective
policy, a new (target) task is presented to the agent. Even
though it has no information about the new task, the previous
knowledge remains accessible to the agent, who has no reason
to neglect his knowledge base.
In order to evaluate the efficiency of TL in different
situations, we firstly trained a DQN for the Breakout game
from scratch, then the learning results are compared under the
following situations:
1) TL from similar tasks: TL is expected to present better
results when the source and target tasks are very similar.
In order to simulate this situations, we perform a new
training phase for Breakout initializing the DQN with a
previously trained DQN also on Breakout.
2) TL from neutral tasks: While in Atlantis an optimal policy
can be achieved by only using the fire actions, they are
mostly useless in Breakout (but they do not lead to terrible
situations either). We here evaluate TL when the source
task is different from the target task, but the optimal
actuation in the source task is not expected to be much
worse than a random policy. After training a DQN for the
game Atlantis, we use it to train a new DQN for Breakout.
3) TL from different tasks: TL is reported to result in negative
transfer when applied carelessly. We here evaluate the
learning performance when starting with a very bad
policy. The game Boxing offers a very different game-
play than Breakout, and also has very different outcomes
when using the same actions. We here train a DQN in
Boxing and also use it to train a new DQN for Breakout.
Figure 4 shows the results for the first situation. The achieved
rewards per episode in the first epochs are much greater
when starting with the transferred DQN, while the number
of episodes per epoch has already decreased to a good level,
meaning that fewer lives are lost during playing the game.
These results show the potential of TL when transferring
knowledge across similar tasks. Figure 5 depicts the results
for the second situation. The number of episodes and average
reward per episodes are similar in both situations, which means
that the transferred knowledge did not help with the DQN
training but also did not lead to worse results than the random
initialization. Finally, Figure 6 presents the results for the
third situation. In this case it is clear that the learned DQN
Fig. 4. Comparison between random initialization of Breakout and initial-
ization with the best performing network for a previously trained DQN for
Breakout.
Fig. 5. Comparison between random initialization of Breakout and initial-
ization with the best performing network for a previously trained DQN for
Atlantis.
Fig. 6. Comparison between random initialization of Breakout and initial-
ization with the best performing network for a previously trained DQN for
Boxing.
suffers from negative transfer. The use of a very unfit policy
when starting the learning process greatly hampered the DQN
optimization, which is shown in the achieved average reward
per episode, which is much worse when comparing to the
random initialization. The influence was so dominant that the
agent was unable to improve his performance until the end of
the 50th epoch.
Our results show that the concern about negative transfer is
also valid for TL in DRL algorithms. Even though TL is very
beneficial to the learning process when the tasks are similar
and neutral when the two domains are not very different, when
dealing with very uncorrelated domains the negative transfer
lets the learner take a long time to overcome the initial bad
actuation. This also means that TL cannot be blindly applied
to DRL, and the similarity of the source and target tasks must
be evaluated before transferring knowledge.
VI. CONCLUSION AND FUTURE WORK
In this article, we evaluated the applicability of TL to DRL
through the evaluation of improvements in the learning process
when transferring knowledge from past tasks with different
degrees of similarity to the target task.
Our results show that the initialization of the DQN plays
a far more important role than the choice of optimization
algorithm of the gradient decent method of the network. The
results also reinforce the importance of being able to start
the learning of a new task with experiences from previously
learned tasks. When transferring knowledge from similar tasks
TL achieved a greatly accelerated learning process, realizing
results closer to the optimal actuation since the beginning of
the training. However, when applied to unrelated tasks, the
negative transfer makes the training performance much worse
and very difficult to recover than with random initialization.
These outcomes show that much of the concerns presented
from researchers when applying TL in classical RL are also
valid for DRL. Being able to find which of the previously
learned tasks are similar to the target task (and possibly
defining the degree of similarity) is directly correlated to the
success or failure when applying TL.
A promising idea as proposed in [26] executes a Policy
Reuse to leverage past knowledge. When facing a new task,
the agent finds similar tasks in a policy library and uses them
to accelerate learning or extends the library. In the domain of
Atari game playing such a library could lead to a collection
of dedicated core policies for different genres of games like
for example jump-and-run, platform or racing.
However, defining a ”similarity degree” between tasks is
not a trivial undertaking. While we defined the similarity of
the learned tasks here through subjective impressions in, this
is not an appropriate procedure to use for the general case,
as we do not have a complete understanding of how the
neural networks generalize tasks and if they can find counter-
intuitive similarities between tasks. An objective similarity
determination is another challenging and unresolved research
task in itself.
Another important research question that is still open is the
definition of the best way to generalize and transfer learning
across tasks. The concept of skill transfer [27] (also referred to
as options, macro-actions or compact policies) is promising to
DRL. These skills are generalizations of extended sequences
of actions to achieve a sub-task and are defined by an option
policy, an initiation set and a termination condition. However,
extracting partial policies from DQNs is still an open problem.
Transferred to the Atari game playing domain, skills could
provide a way to solve more abstract sub-tasks like destroying
an enemy, walking through a door or dodging bullets, or any
tasks which have repeated occurrences in a variety of games.
In conclusion, TL has shown a great potential to accelerate
learning in DRL tasks, but there are still many aspects to
be understood before we can formulate the definition of a
comprehensive framework for knowledge reuse across DQNs.
REFERENCES
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep Learning,” Nature, vol. 521,
no. 7553, pp. 436–444, 2015.
[2] R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning.
Cambridge, MA, USA: MIT Press, 1998.
[3] D. Silver, “Lecture Notes in advanced topics in Machine Learning:
COMPGI13 (Reinforcement Learning),” 2015, university College Lon-
don, Computer Science Department.
[4] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dy-
namic Programming. Hoboken, NJ, USA: John Wiley & Sons, 2014.
[5] G. Tesauro, “Temporal Difference Learning and TD-Gammon,Com-
munications of the ACM, vol. 38, no. 3, pp. 58–68, 1995.
[6] A. Y. Ng, A. Coates, M. Diel, V. Ganapathi, J. Schulte, B. Tse, E. Berger,
and E. Liang, “Autonomous inverted helicopter flight via Reinforcement
Learning,” in Experimental Robotics IX. Springer, 2006, pp. 363–372.
[7] V. Mnih, D. Silver, A. A. Rusu, M. Riedmiller et al., “Human-level
control through Deep Reinforcement Learning,” Nature, vol. 518, no.
7540, pp. 529–533, 2015.
[8] S. J. Pan and Q. Yang, “A Survey on Transfer Learning,IEEE
Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp.
1345–1359, 2010.
[9] M. E. Taylor and P. Stone, “Transfer Learning for Reinforcement
Learning domains: A Survey,” Journal of Machine Learning Research
(JMLR), vol. 10, pp. 1633–1685, 2009.
[10] W. K. Hastings, “Monte Carlo Sampling methods using Markov Chains
and their applications,” Biometrika, vol. 57, no. 1, pp. 97–109, 1970.
[11] D. P. Bertsekas, Dynamic Programming and Optimal Control. Athena
Scientific Belmont, MA, 1995, vol. 1, no. 2.
[12] C. J. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8,
no. 3-4, pp. 279–292, 1992.
[13] R. Bellman, “On the theory of Dynamic Programming,” Proc. National
Academy of Sciences, vol. 38, no. 8, pp. 716–719, 1952.
[14] M. Riedmiller, “Neural Fitted Q Iteration – first experiences with a
data efficient Neural Reinforcement Learning method,” in Proc. 16th
European Conference on Machine Learning (ECML), 2005, pp. 317–
328.
[15] M. Hausknecht and P. Stone, “Deep Recurrent Q-learning for Partially
Observable MDPs,” in 2015 AAAI Fall Symposium Series, 2015.
[16] H. van Hasselt, A. Guez, and D. Silver, “Deep Reinforcement Learning
with Double Q-learning,” arXiv preprint arXiv:1509.06461, 2015.
[17] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Sil-
ver, and D. Wierstra, “Continuous control with Deep Reinforcement
Learning,” arXiv preprint arXiv:1509.02971, 2015.
[18] A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. De Maria,
V. Panneershelvam, M. Suleyman, C. Beattie, S. Petersen et al.,
“Massively parallel methods for Deep Reinforcement Learning,arXiv
preprint arXiv:1507.04296, 2015.
[19] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience
replay,arXiv preprint arXiv:1511.05952, 2015.
[20] G. Konidaris, I. Scheidwasser, and A. G. Barto, “Transfer in Rein-
forcement Learning via shared features,” Journal of Machine Learning
Research (JMLR), vol. 13, no. 1, pp. 1333–1371, 2012.
[21] M. L. Koga, V. Freire, and A. H. Costa, “Stochastic Abstract Policies:
Generalizing knowledge to improve Reinforcement Learning,IEEE
Transactions on Cybernetics, vol. 45, no. 1, pp. 77–88, 2015.
[22] S. Gutstein, O. Fuentes, and E. Freudenthal, “Knowledge transfer in
Deep Convolutional Neural Nets,International Journal on Artificial
Intelligence Tools (IJAIT), vol. 17, no. 03, pp. 555–567, 2008.
[23] E. Parisotto, L. J. Ba, and R. Salakhutdinov, “Actor-Mimic: Deep
Multitask and Transfer Reinforcement Learning,Computing Research
Repository (CoRR), vol. abs/1511.06342, 2015. [Online]. Available:
http://arxiv.org/abs/1511.06342
[24] Y. N. Dauphin, H. de Vries, J. Chung, and Y. Bengio, “RMSProp and
equilibrated adaptive learning rates for non-convex optimization,” arXiv
preprint arXiv:1502.04390, 2015.
[25] D. Kingma and J. Ba, “ADAM: A method for stochastic optimization,
arXiv preprint arXiv:1412.6980, 2014.
[26] F. Fern´
andez and M. Veloso, “Probabilistic Policy Reuse in a Reinforce-
ment Learning agent,” in Proc. 5th Autonomous Agents and Multiagent
Systems (AAMAS-06), 2006, pp. 720–727.
[27] G. Konidaris and A. G. Barto, “Building Portable Options: Skill Transfer
in Reinforcement Learning,” in Proc. 20th International Joint Confer-
ence on Artificial Intelligence (IJCAI), vol. 7, 2007, pp. 895–900.
... Regarding the learning approaches using the curriculum learning environment, the task-specific, self-play, and teach-student approaches can be implemented without any modification of the RL algorithm [20]. The task-specific approach is the most common curriculum learning that configures the curriculum with problems per the degree of difficulty. ...
... Continuous learning through the self-play approach allows agents to fully learn environments with rich experiences through actions acquired from interactions with other agents. Furthermore, the actions of agents can be developed into human-related skills, such as intrinsic motivation, through interactive learning among agents [20,30,37]. However, the competitive environment in self-play learning needs to include strategies to facilitate learning in the curriculum. ...
... As described above, the teach-student approach induces transfer learning based on previous experience and degrees of difficulties. Furthermore, it engages in the RL algorithm for learning through bootstrapping [20,41]. The stratified algorithm transfers the standard target values between layers, assesses through external compensation depending on the degree of achievement, and induces learning. ...
Article
Full-text available
Considering autonomous navigation in busy marine traffic environments (including harbors and coasts), major study issues to be solved for autonomous ships are avoidance of static and dynamic obstacles, surface vehicle control in consideration of the environment, and compliance with human-defined navigation rules. The reinforcement learning (RL) algorithm, which demonstrates high potential in autonomous cars, has been presented as an alternative to mathematical algorithms and has advanced in studies on autonomous ships. However, the RL algorithm, through interactions with the environment, receives relatively fewer data from the marine environment. Moreover, the open marine environment causes difficulties for autonomous ships in learning human-defined navigation rules because of excessive degrees of freedom. This study proposes a sustainable, intelligent learning framework for autonomous ships (ILFAS), which helps solve these difficulties and learns navigation rules specified by human beings through neighboring ships. The application of case-based RL enables the participation of humans in the RL learning process through neighboring ships and the learning of human-defined rules. Cases built as curriculums can achieve high learning effects with fewer data along with the RL of layered autonomous ships. The experiment aims at autonomous navigation from a harbor, where marine traffic occurs on a neighboring coast. The learning results using ILFAS and those in an environment where random marine traffic occurs are compared. Based on the experiment, the learning time was reduced by a tenth. Moreover, the success rate of arrival at a destination was higher with fewer controls than the random method in the new marine traffic scenario. ILFAS can continuously respond to advances in ship manufacturing technology and changes in the marine environment.
... Taylor researched the knowledge reuse of multi-tasking in a single agent with a full observation [12]. Glatt R. highlighted that transfer learning between similar tasks could accelerate RL training, while negative transfer should be avoided [13]. Omidshafiei researched the transfer learning of multi-tasking with a partial observation [14]. ...
... While the predators initially show up randomly in the grid world to pursue the prey. [13]. The RL parameters include the training episode EPISODE = 20,000 and most experiment steps of each episode STEP = 50. ...
Article
Full-text available
With the development and appliance of multi-agent systems, multi-agent cooperation is becoming an important problem in artificial intelligence. Multi-agent reinforcement learning (MARL) is one of the most effective methods for solving multi-agent cooperative tasks. However, the huge sample complexity of traditional reinforcement learning methods results in two kinds of training waste in MARL for cooperative tasks: all homogeneous agents are trained independently and repetitively, and multi-agent systems need training from scratch when adding a new teammate. To tackle these two problems, we propose the knowledge reuse methods of MARL. On the one hand, this paper proposes sharing experience and policy within agents to mitigate training waste. On the other hand, this paper proposes reusing the policies learned by original teams to avoid knowledge waste when adding a new agent. Experimentally, the Pursuit task demonstrates how sharing experience and policy can accelerate the training speed and enhance the performance simultaneously. Additionally, transferring the learned policies from the N-agent enables the (N+1)–agent team to immediately perform cooperative tasks successfully, and only a minor training resource can allow the multi-agents to reach optimal performance identical to that from scratch.
... An improved soft actor-critic-Lagrangian (SAC-Lagrangian) algorithm (Ha et al., 2020) is used to solve the optimal policy. Moreover, considering the similarity and increasing difficulty of the cooperative tasks performed by the multiple UAVs, transfer learning (TL; Glatt et al., 2016) is used to improve the training efficiency. Our experimental results demonstrate the effectiveness, generalization, and safety of the proposed method, which we call the transfer-safe soft actorcritic (TSSAC) method. ...
Article
Full-text available
A system with multiple cooperating unmanned aerial vehicles (multi-UAVs) can use its advantages to accomplish complicated tasks. Recent developments in deep reinforcement learning (DRL) offer good prospects for decision-making for multi-UAV systems. However, the safety and training efficiencies of DRL still need to be improved before practical use. This study presents a transfer-safe soft actor-critic (TSSAC) for multi-UAV decision-making. Decision-making by each UAV is modeled with a constrained Markov decision process (CMDP), in which safety is constrained to maximize the return. The soft actor-critic-Lagrangian (SAC-Lagrangian) algorithm is combined with a modified Lagrangian multiplier in the CMDP model. Moreover, parameter-based transfer learning is used to enable cooperative and efficient training of the tasks to the multi-UAVs. Simulation experiments indicate that the proposed method can improve the safety and training efficiencies and allow the UAVs to adapt to a dynamic scenario.
... Transfer Learning (Pan & Yang, 2010) through communication in multi-agent systems (Dawson et al., 2021), we are here not interested in parallel learning or sample sharing but focus on reusing previously learned knowledge in a sequential way. However, reusing knowledge from one (or more) task(s) for learning a new one is challenging (Glatt et al., 2016) because we can generally expect that the distribution P (Y |X) will change from one (source) task to another (target) task, a phenomenon known as Concept Drift (Webb et al., 2016), making it hard to directly reuse samples or models (Glatt et al., 2020). Additionally, bias in the sampling process might cause the distribution P (X) to shift between training and deployment phase, known as Covariate Drift. ...
... On the other hand, a line of research in meta-RL investigates how to quickly adapt the knowledge learned in multiple tasks to a new one (Finn et al., 2017;Gupta et al., 2018a;b;Nagabandi et al., 2018;Liu et al., 2019). These two research directions do not enable incorporating external guidance from different sources into learning, so their transferred knowledge is limited to the one learned among similar environments (Glatt et al., 2016). In addition, they require extensive amount of training data before a set of knowledge can be reused in a new task. ...
Preprint
Full-text available
Receiving knowledge, abiding by laws, and being aware of regulations are common behaviors in human society. Bearing in mind that reinforcement learning (RL) algorithms benefit from mimicking humanity, in this work, we propose that an RL agent can act on external guidance in both its learning process and model deployment, making the agent more socially acceptable. We introduce the concept, Knowledge-Grounded RL (KGRL), with a formal definition that an agent learns to follow external guidelines and develop its own policy. Moving towards the goal of KGRL, we propose a novel actor model with an embedding-based attention mechanism that can attend to either a learnable internal policy or external knowledge. The proposed method is orthogonal to training algorithms, and the external knowledge can be flexibly recomposed, rearranged, and reused in both training and inference stages. Through experiments on tasks with discrete and continuous action space, our KGRL agent is shown to be more sample efficient and generalizable, and it has flexibly rearrangeable knowledge embeddings and interpretable behaviors.
... In the course of determining the suitability of knowledge transfer for DRL, we conducted a number of experiments to encourage a broader discussion of the topic (Glatt, Silva, and Costa 2016) and developed a software framework to facilitate research in the area 1 . ...
Article
Recent successes in applying Deep Learning techniques on Reinforcement Learning algorithms have led to a wave of breakthrough developments in agent theory and established the field of Deep Reinforcement Learning (DRL). While DRL has shown great results for single task learning, the multi-task case is still underrepresented in the available literature. This D.Sc. research proposal aims at extending DRL to the multi- task case by leveraging the power of Transfer Learning algorithms to improve the training time and results for multi-task learning. Our focus lies on defining a novel framework for scalable DRL agents that detects similarities between tasks and balances various TL techniques, like parameter initialization, policy or skill transfer.
... These methods can be direct action suggestions from other agents (usually in fully cooperative settings), transferring human expertise, receiving heuristics or reward shaping from teachers to improve exploration, learning from demonstration or imitating other agents through observing them. There is not yet a lot of work that consider knowledge transfer for agents learning with deep networks; however, there are works indicating that unprincipled knowledge transfer in MADRL can lead to more catastrophic negative transfer effects than in the traditional MARL [193,191]. In addition, the security aspect of knowledge transfer between agents has not been considered practically at all, that is, agents participating in the knowledge transfer are assumed to be benevolent. ...
Preprint
Full-text available
Future AI applications require performance, reliability and privacy that the existing, cloud-dependant system architectures cannot provide. In this article, we study orchestration in the device-edge-cloud continuum, and focus on AI for edge, that is, the AI methods used in resource orchestration. We claim that to support the constantly growing requirements of intelligent applications in the device-edge-cloud computing continuum, resource orchestration needs to embrace edge AI and emphasize local autonomy and intelligence. To justify the claim, we provide a general definition for continuum orchestration, and look at how current and emerging orchestration paradigms are suitable for the computing continuum. We describe certain major emerging research themes that may affect future orchestration, and provide an early vision of an orchestration paradigm that embraces those research themes. Finally, we survey current key edge AI methods and look at how they may contribute into fulfilling the vision of future continuum orchestration.
Chapter
Although Reinforcement Learning (RL) algorithms have made impressive progress in learning complex tasks over the past years, there are still prevailing short-comings and challenges. Specifically, the sample-inefficiency and limited adaptation across tasks often make classic RL techniques impractical for real-world applications despite the gained representational power when combining deep neural networks with RL, known as Deep Reinforcement Learning (DRL). Recently, a number of approaches to address those issues have emerged. Many of those solutions are based on smart DRL architectures that enhance single task algorithms with the capability to share knowledge between agents and across tasks by introducing Transfer Learning (TL) capabilities. This survey addresses strategies of knowledge transfer from simple parameter sharing to privacy preserving federated learning and aims at providing a general overview of the field of TL in the DRL domain, establishes a classification framework, and briefly describes representative works in the area.
Article
Full-text available
Deep reinforcement learning (DRL) is poised to revolutionise the field of artificial intelligence (AI) by endowing autonomous systems with high levels of understanding of the real world. Currently, deep learning (DL) is enabling DRL to effectively solve various intractable problems in various fields including computer vision, natural language processing, healthcare, robotics, to name a few. Most importantly, DRL algorithms are also being employed in audio signal processing to learn directly from speech, music and other sound signals in order to create audio-based autonomous systems that have many promising applications in the real world. In this article, we conduct a comprehensive survey on the progress of DRL in the audio domain by bringing together research studies across different but related areas in speech and music. We begin with an introduction to the general field of DL and reinforcement learning (RL), then progress to the main DRL methods and their applications in the audio domain. We conclude by presenting important challenges faced by audio-based DRL agents and by highlighting open areas for future research and investigation. The findings of this paper will guide researchers interested in DRL for the audio domain.
Article
Deep multi-agent reinforcement learning (MARL) can efficiently learn decentralized policies for real-world applications. However, current MARL methods suffer from the difficulty of transferring knowledge from already learned tasks to improve its exploration. In this paper, we propose a novel MARL method called Qauxi, which forms coordinated exploration scheme to improve the traditional MARL algorithms by reusing the meta-experience transferred from auxiliary task. We also use the weighting function to weight the importance of the joint action in monotonic loss function in order to focus on more important joint actions and thus avoid yielding suboptimal policies. Furthermore, we prove the convergence of Qauxi based on contraction mapping theorem. Qauxi is evaluated on the widely adopted StarCraft benchmarks (SMAC) across easy, hard, and super hard scenarios. Experimental results show that the proposed method outperforms the state-of-the-art MARL methods by a large margin in the most challenging super hard scenarios.
Article
Full-text available
Experience replay lets online reinforcement learning agents remember and reuse experiences from the past. In prior work, experience transitions were uniformly sampled from a replay memory. However, this approach simply replays transitions at the same frequency that they were originally experienced, regardless of their significance. In this paper we develop a framework for prioritizing experience, so as to replay important transitions more frequently, and therefore learn more efficiently. We use prioritized experience replay in Deep Q-Networks (DQN), a reinforcement learning algorithm that achieved human-level performance across many Atari games. DQN with prioritized experience replay achieves a new state-of-the-art, outperforming DQN with uniform replay on 42 out of 57 games.
Article
Full-text available
The popular Q-learning algorithm is known to overestimate action values under certain conditions. It was not previously known whether, in practice, such overestimations are common, whether this harms performance, and whether they can generally be prevented. In this paper, we answer all these questions affirmatively. In particular, we first show that the recent DQN algorithm, which combines Q-learning with a deep neural network, suffers from substantial overestimations in some games in the Atari 2600 domain. We then show that the idea behind the Double Q-learning algorithm, which was introduced in a tabular setting, can be generalized to work with large-scale function approximation. We propose a specific adaptation to the DQN algorithm and show that the resulting algorithm not only reduces the observed overestimations, as hypothesized, but that this also leads to much better performance on several games.
Article
Full-text available
We present the first massively distributed architecture for deep reinforcement learning. This architecture uses four main components: parallel actors that generate new behaviour; parallel learners that are trained from stored experience; a distributed neural network to represent the value function or behaviour policy; and a distributed store of experience. We used our architecture to implement the Deep Q-Network algorithm (DQN). Our distributed algorithm was applied to 49 games from Atari 2600 games from the Arcade Learning Environment, using identical hyperparameters. Our performance surpassed non-distributed DQN in 41 of the 49 games and also reduced the wall-time required to achieve these results by an order of magnitude on most games.
Conference Paper
Experience replay lets online reinforcement learning agents remember and reuse experiences from the past. In prior work, experience transitions were uniformly sampled from a replay memory. However, this approach simply replays transitions at the same frequency that they were originally experienced, regardless of their significance. In this paper we develop a framework for prioritizing experience, so as to replay important transitions more frequently, and therefore learn more efficiently. We use prioritized experience replay in Deep Q-Networks (DQN), a reinforcement learning algorithm that achieved human-level performance across many Atari games. DQN with prioritized experience replay achieves a new state-of-the-art, outperforming DQN with uniform replay on 41 out of 49 games.
Article
The ability to act in multiple environments and transfer previous knowledge to new situations can be considered a critical aspect of any intelligent agent. Towards this goal, we define a novel method of multitask and transfer learning that enables an autonomous agent to learn how to behave in multiple tasks simultaneously, and then generalize its knowledge to new domains. This method, termed "Actor-Mimic", exploits the use of deep reinforcement learning and model compression techniques to train a single policy network that learns how to act in a set of distinct tasks by using the guidance of several expert teachers. We then show that the representations learnt by the deep policy network are capable of generalizing to new tasks, speeding up learning in novel environments. Although our method can in general be applied to a wide range of problems, we use Atari games as a testing environment to demonstrate these methods.
Article
Deep Reinforcement Learning has yielded proficient controllers for complex tasks. However, these controllers have limited memory and rely on being able to perceive the complete game screen at each decision point. To address these shortcomings, this article investigates the effects of adding recurrency to a Deep Q-Network (DQN) by replacing the first post-convolutional fully-connected layer with a recurrent LSTM. The resulting Deep Recurrent Q-Network (DRQN) exhibits similar performance on standard Atari 2600 MDPs but better performance on equivalent partially observed domains featuring flickering game screens. Results indicate that given the same length of history, recurrency allows partial information to be integrated through time and is superior to alternatives such as stacking a history of frames in the network's input layer. We additionally show that when trained with partial observations, DRQN's performance at evaluation time scales as a function of observability. Similarly, when trained with full observations and evaluated with partial observations, DRQN's performance degrades more gracefully than that of DQN. We therefore conclude that when dealing with partially observed domains, the use of recurrency confers tangible benefits.