Content uploaded by Cong Ma
Author content
All content in this area was uploaded by Cong Ma on Jul 07, 2020
Content may be subject to copyright.
Paralleled Multi-Environments Shaping Algorithm for Complex
Multi-step Task
Cong Maa,d,∗, Zhizhong Lib,∗, Dahua Linc, Jiangshe Zhanga,∗∗
aSchool of Mathematics and Statistics, Xi’an Jiaotong University, No.28, Xianning West Road, Xi’an, Shannxi,
P.R. China
bSenseTime, No 12, Science Park East Avenue, HKSTP, Shatin, Hong Kong, P.R. China
cDepartment of Information Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong, P.R. China
dDepartment of Mathematics, KU Leuven, Celestijnenlaan 200B, Leuven 3001, Belgium
Abstract
Because of the sparse reward and the sequence of task, the complex multi-step task is a big challenge
in reinforcement learning, that is, the agent needs to implement several consecutive sequential steps to
complete the whole task without the intermediate reward. Reward shaping and curriculum learning are
always used to solve this challenge, but reward shaping is prone to sub-optimal policies and curriculum
learning easily suers from the catastrophic forgetting problem. In this paper, we propose a novel algo-
rithm called Paralleled Multi-Environment Shaping algorithm (PMES), where several sub-environments
are built based on human knowledge to make the agent aware of the importance of intermediate steps,
each of which corresponds to each key intermediate steps. Specically, the learning agent is trained under
these paralleled multiple environments including the original environment and several sub-environments
by synchronous advantage actor-critic algorithm. And the PMES algorithm has the mechanism of adap-
tive reward shaping to adjust the reward function. In this way, PMES algorithm eectively incorporates
human experience by multiple dierent environments rather than only shaping the reward function, it
combines the benets of reward shaping and curriculum learning while avoiding their drawbacks. Exten-
sive experiments on the mini-game ‘Build Marines’ of StarCraft II environment show that our proposed
algorithm is more eective than reward shaping, curriculum learning, and PLAID algorithms, which is
almost close to the level of human Grandmaster. And compared with the best existing algorithm, it takes
less time and computing resource to reach a good result.
Keywords: Reinforcement Learning, Multi-step task, Paralleled Multiple Environments, Adaptive
Reward Shaping
∗The two authors contributed equally
∗∗Corresponding author
Email address: jszhang@xjtu.edu.cn (Jiangshe Zhang)
Preprint submitted to Journal of Neurocomputing December 11, 2019
1. Introduction
In recent years, reinforcement learning (RL) algorithm, especially deep reinforcement learning (DRL)
algorithm, which combines RL and deep learning algorithms, receives widespread attention in many
elds, such as, robot control [1, 2], autonomous driving [3, 4], playing games [5, 6], even optimal control
problems [7, 8] and so on. In RL algorithm, the agent learns and improves its performance by trying5
constantly, then nds the optimal path by maximizing the return function. But due to the huge state
space, action space and sparse reward function, instructing the agent to accomplish the complex multi-
step task like the human has been the biggest challenge for a long time.
In the multi-step task, it is dicult to make the agent aware of the importance of the intermediate
steps in the absence of the corresponding intermediate rewards, while these tasks are very easy for10
the human being due to the prior knowledge or experience. Thus, how to transfer knowledge into the
algorithm eciently is very important. Aiming at this problem, various approaches are proposed. Reward
Shaping algorithm [9–14] manipulates the reward signals from the environment to instruct the agent to
learn faster. And Ng Andrew [9] proved that the potential based shaping can preserve the optimal policy.
Devlin [15] further generalized the potential based shaping to be time-dependent, called dynamic shaping.15
RUDDER [16] was proposed to solve the problem of delayed reward by decomposing the original reward
into a new MDP that is return-equivalent. But shaping the reward function properly is very dicult,
which is prone to be stuck at the sub-optimal [17]. Also, Curriculum Learning algorithm [18–23] can be
used to solve the multi-step task. It divides the whole task into several increasingly dicult sub-tasks
in series and trains the agent in a meaningful order because solving simple tasks rst would benet the20
agent to solve more complex tasks later. As a result, the training complexity of the original task is
simplied by the knowledge from the previous auxiliary tasks. Narvekar [18] proposed that designing
a sequence of source tasks for the agent to improve the performance and the learning speed of the RL
algorithm. Svetlik [20] further presented a method to generate the curriculum automatically based on
task descriptors for the RL algorithm. However, the development is still in its infancy and the application25
of these methods is quite restricted. For example, the reverse curriculum learning [19] requires that the
environments can be reversely evolved back to ‘earlier’ states, which is not feasible for some environment.
And curriculum learning could suer from the catastrophic forgetting problem, i.e., knowledge learned
previously might be erased during the learning process of new knowledge. This problem can be solved by
the distillation algorithm, which creates one single controller to implement the tasks of a set of experts.30
For instance, Parisotto [24] proposed adopting a supervised regression function to learns all expert policies,
called distillation. Further, Berseth [25] proposed PLAID algorithm which combines transfer learning
and policy distillation to learn multiple skills. But it is very dicult to nd the best policy based on
a mixture of experts. Thus, studying a novel RL algorithm for complex multi-step tasks is particularly
important.35
2
In RL algorithm, there is a lot of research about multi-task learning [26–33], whose goal is to accom-
plish all tasks together rather than focusing on one of them. But multi-step task learning is very dierent
from the multi-task learning, which puts more emphasis on the sequential relationship between multiple
tasks, that is, these sub-tasks must be implemented sequentially. And these rewards is very sparse where
the agent get the reward only after completing the whole task. Thus, we divide the whole task into40
several sub-tasks according to the steps and train them simultaneously. In the previous works, there
are some paralleled RL algorithms [34–36], where the agent interacts with multiple single environments
simultaneously to speed up training, such as Asynchronous Advantage Actor-Critic (A3C) algorithm [34]
and Advantage Actor-Critic (A2C) algorithm. Inspired by them, we can train the agent under multiple
dierent environments to learn dierent tasks simultaneously when their state spaces and action spaces45
are exactly same.
In this paper, we propose a novel algorithm, named Paralleled Multi-Environment Shaping (PMES)
algorithm to solve these complex multi-step tasks. Considering the lack of rewards for intermediate steps,
we introduce the corresponding rewards by building several sub-environments, each of which corresponds
to intermediate key steps. Meanwhile, combining the paralleled RL algorithms, we train the agent50
under multiple paralleled dierent environments, including the original environment and several sub-
environments. In this way, the agent can assign these rewards automatically and achieve the nal target.
In recent years, there are some classical complex environments for testing RL algorithms, especially
the SC2LE environments [37–42]. It is an environment for the real-time strategy (RTS) game StarCraft II,
released by Deepmind and Blizzard [43], which contains seven mini-games aiming at dierent elements of55
playing StarCraft II. One of the mini-games, Build Marines, is much more dicult than others because
it involves multiple dependent decisions based on previous works. Thus, in this paper, we use this mini-
game as the platform to evaluate the performance of the PMES algorithm. These experimental results
on the StarCraft II demonstrate that the proposed algorithm yields good performance and gets a higher
score than other algorithms as a faster convergence.60
The rest of the paper is organized as follows. Section 2 introduces preliminary concepts and also sets
up notations. Section 3 shows the PMES algorithm. Section 4 compares PMES with other algorithms
qualitatively and quantitatively. Section 5 concludes the paper.
2. Preliminaries
A reinforcement learning task is often modeled as a Markov Decision Process (MDP) M= (S, S0,A,T, γ, R),
where Sis the state space, S0is the distribution of initial states, Ais the action space, T={Psa (·)|s∈ S, a ∈ A}
is the transition probabilities, γ∈(0,1] is the discount factor, and R(s, a, s′)is a stochastic reward from
state sto state s′by taking action a. During the training, the agent interacts with the environment
by choosing an action atand receives the reward Rt, and then transits to the next state st+1 at any
3
time t. The actions are sampled according to its current policy π:S → P (A), where P(A)is the set of
probability measures on A. The discounted return is dened as
Gt=
T−1
i=t
γi−tRi,(1)
and the goal is to maximize the expected return from the start distribution J=Es0∼S0[G0|s0]. The
action-value function is the expected return of taking action atin state st,
Qπ(st, at) := E[Gt|st, at],(2)
and the optimal policy satises the optimal Bellman equation,
Q∗(s, a) = Es′∼Psa R(s, a, s′) + γmax
a′∈A Q∗(s′, a′).(3)
In policy-based model-free methods, policy πθis parameterized by θand is optimized by gradient ascent,
where the gradient can be estimated by
∇θJ(θ) = Eτ∼πθ(τ)T
t=0
∇θlog πθ(at|st)Gt.(4)
Here τ={s0, a0, s1, a1, . . . , sT, aT}is a trajectory sampled following the policy. The high variance is
a problem of the above gradient estimation, which can be reduced by adopting the advantage function
A(st, at) = Q(st, at)−V(st)to replace Gtin Equation (4), where
V(st) = Eat[Q(st, at)] (5)
is the value function of state st. An estimate of the advantage function is given by
k−1
i=0
γirt+i+γkVω(st+k)−Vω(st),(6)
where Vωis the value network with parameter ω, and t+kis bounded by T. So
∇θJ(θ) = Eτ∼πθ(τ)T
t=0
∇θlog πθ(at|st)A(st, at).(7)
In the A2C algorithm, the value network shares the parameters with the policy network. Besides, the
entropy H(πθ(st)) of the policy πis added into the objective function to encourage exploration. The
objective function of the A2C algorithm for some sampled trajectories τis
Eτ∼π(τ)log πθ(at|st)A(st, at) + c1Rt−Vθ(st)2
+c2H(πθ(st)),
(8)
where c1, c2are the coecients controlling the loss of value and the entropy, respectively, and Rtis the65
expected cumulative discounted reward at the timepoint t.
4
fc1+fc2
Mini-map
Screen-map
Non-spatial
conv1 conv2
conv1 conv2
fc1
value
conv3
Spatial policy
Non-spatial policy
policy
action
auxiliary
environments
main environments
state
Figure 1: Architecture of PMES for StarCraft II. The agent interacts with multiple environments, including main environ-
ments and auxiliary environments. The overall structure is based on A2C. The input of the network contains spatial and
non-spatial features. Spatial features include screen-maps and mini-maps which abstract away from the RGB images of the
game when playing. Non-spatial features contain information such as the number of minerals collected. The output actions
consist of game commands and their spatial coordinates.
3. Paralleled Multi-Environment Shaping Algorithm
The core of Paralleled Multi-Environment Shaping (PMES) algorithm is building up auxiliary envi-
ronments to assist the agent to learn in the original main environment. For simplicity, we assume the
main task and the auxiliary tasks only dier in the initial states and the reward functions. Specically,
we denote the original task in the main environment as an MDP
M(0) =S, S(0)
0,A, T, γ, R(0) ,(9)
and the Nkinds of auxiliary tasks as
M(n)=S, S(n)
0,A, T, γ, R(n), n = 1, . . . , N , (10)
where Sis the state space, and S(0)
0,S(n)
0are initial states in the main environment and the n-th auxiliary
environment respectively. Because we use the synchronous, batched deep RL algorithm A2C to train,
only one agent is interacting with all environments in parallel. From the perspective of the learning agent,70
the overall mixed environments not only can start at dierent initial states, but also be provided with
dierent reward signals. Strictly speaking, the process is a partially observed MDP (POMDP) because
it is unknown which exact environment the agent is currently interacting with. However, the purpose of
POMDP is to nd the optimal actions for each possible belief over the world states, which deviates from
our ultimate goal of optimizing the main task.75
5
There is the overall architecture of PMES, as shown in Figure 1. It follows the settings of the
A2C agent in [43, 44]. In the implementation, both the policy network and the value network share
parameters. One important aspect of PMES architecture is the interaction of one single learning agent
with multiple, paralleled environments simultaneously. We exploit this feature to enable the adaptive
shaping mechanism by sub-environments.80
3.1. A Special Case
Suppose the count of each type of environment is d(n), n = 0, . . . , N , where index 0is reserved for
the main environment. Then the total number of environments is nd(n), and the ratio of the agent
interacting with each type of environment is d(n)/nd(n). Further, assume that the current policy of
the agent is π, and denote the probability of visiting state sin environment M(n)as p(n)
π(s). Then, the
state sis accessed from dierent environments with probability ρ(n)
π, where
ρ(n)
π(s)∝d(n)
nd(n)·p(n)
π(s), n = 0, . . . , N. (11)
Therefore, the triple tuple (s, a, s′)of the source environments distribution can be obtained from
ρ(n)
π(s, a, s′)∝ρ(n)
π(s)·π(a|s)·Psa(s′).(12)
Since the policy πand the transition probability Psa are the same in all environments, we have ρ(n)
π(s, a, s′) =
ρ(n)
π(s). Thus the overall reward function Rπ(depends on policy π) is
Rπ(s, a, s′) =
n
ρ(n)
π(s, a, s′)R(n)(s, a, s′)
=
n
ρ(n)
π(s)R(n)(s, a, s′).
(13)
In the special case of ergodic MDPs and potential-based shaping rewards [9], the optimal policy is
preserved.
Proposition 1. Suppose the state space Sand action space Aof the main MDP M(0) in Equation (9)
are nite, and the auxiliary MDPs are constructed as in Equation (10). If the M(0) is ergodic, and the
auxiliary rewards are potential-based shaping rewards
R(n)(s, a, s′) = R(0)(s, a, s′) + γϕ(n)(s′)−ϕ(n)(s),(14)
where n= 1, . . . , N and ϕ(n)is the potential function for the n-th environment. Then the optimal policy
π∗of the PMES algorithm is also the optimal policy of the original main environment M(0),i.e., PMES85
has policy invariance.
Proof. Our proof follows the practice of reward shaping literature [9, 11], with the necessary modications.
For notational simplicity, we set ϕ(0) ≡0. For the optimal policy π∗, the corresponding Q-function Q∗
6
would satisfy the optimal Bellman equation
Q∗(s, a)
=Es′∼Psa Rπ∗(s, a, s′) + γmax
a′∈A Q∗(s′, a′)
=E
n
ρ(n)
π∗(s)R(n)(s, a, s′) + γmax
a′∈A Q∗(s′, a′)
=ER(0)(s, a, s′) +
n
ρ(n)
π∗(s)γϕ(n)(s′)−
ϕ(n)(s)+γmax
a′∈A Q∗(s′, a′).
(15)
One property of ergodic MDPs is that they have unique stationary state distributions. Since the
auxiliary MDPs only dier in initial states and reward functions, their induced Markov Random Processes
are the same, and thus have the same stationary distribution as the main environment, i.e.,
p(n)
π(s) = p(0)
π(s), n = 1, . . . , N, ∀s∈ S.(16)
And together with Equation (11), we have
ρ(n)
π(s) = ρ(n)
π(s′), n = 0, . . . , N, ∀s, s′∈ S.(17)
Using the above equation, we can rearrange Equation (15),
Q∗(s, a) +
n
ρ(n)
π∗(s)ϕ(n)(s)
=Es′∼Psa R(0)(s, a, s′) + γmax
a′∈A Q∗(s′, a′)
+
n
ρ(n)
π∗(s′)ϕ(n)(s′).
(18)
Set
ˆ
Q∗(s, a) := Q∗(s, a) +
n
ρ(n)
π∗(s)ϕ(n)(s),(19)
then we have the optimal Bellman equation for the original MDP M(0),
ˆ
Q∗(s, a) = Es′∼Psa R(0)(s, a, s′) + γmax
a′∈A
ˆ
Q∗(s′, a′).(20)
The optimal policy π∗satises
π∗(s)∈argmax
a∈A
Q∗(s, a)
= argmax
a∈A Q∗(s′, a′) +
n
ρ(n)
π∗(s)ϕ(n)(s)
= argmax
a∈A
ˆ
Q∗(s′, a′).
(21)
7
So the optimal policy for the paralleled, mixed environments is also optimal policy for the original main
environment. The opposite direction, that every optimal policy of the main environment is also the
optimal of the mixed environment can also be argued by reversing the above discussion.
From the proposition, the optimal solution is not related to the distribution of initial states or the90
shaping rewards under the specied conditions. However, these two factors do have an impact on learning
eciency. At the starting phase of the training process, the agent is able to explore multiple critical parts
of the MDP simultaneously for the diverse arrangement of the initial states. Thus, the auxiliary environ-
ments provide a curriculum, where each one starts and focuses on a dierent phase of the original problem
at the early training stage. Dierent from the traditional shaping methods, PMES algorithm provides95
an autonomous, adaptive way of shaping. The perceived reward function is shown in Equation (13). In
the early exploration training stage, unlike the case for the equilibrium state in Equation (17), the ratios
ρ(n)
π(s)for dierent state shave not converge and might be dierent. For example, for a state snear the
initial state of the rst auxiliary environment M(1), the ratio ρ(1)
π(s)might near 1and the rst auxiliary
environment contributes to most visits of state s. Then the the agent’s current perceived reward for100
the tuple (s, a, s′)for action aand the consecutive state s′is inuenced mostly by the reward function
R(1)(s, a, s′)in auxiliary environment M(1). As the training progress develops, the main environments
reach state smore frequently. The perceived reward for (s, a, s′)will be inuenced by the original reward
structure from the main environment M(0) more than the auxiliary environment M(1).
3.2. Application in the mini-game ‘Build Marines’ of StarCraft II105
There is a screenshot of StarCraft II shown in Figure 2. The diculties of the mini-game Build Marines
lie in the huge exploration space, the long chain of action dependency, and the sparse and delayed reward.
All of them make the agent extremely hard to build even the rst marine. This work utilizes the domain
knowledge that the building process can be divided into three simpler sub-tasks: building supply depots,
building barracks, building marines. Then, we encodes them into three smaller, tailored environments,110
each of which corresponds to every key intermediate step. every new auxiliary environment starts at
dierent initial states and has dierent rewards. Their state spaces Sand transition probabilities T
between states are kept the same as the original MDP. In order to satisfy ergodicity, we introduce the
destroying action as the invertible way of building action in these auxiliary environments. The detailed
congurations are listed in Table 1.115
Figure 3(a) illustrates the initial state of SubEnv-A, where only some minerals are provided. This
auxiliary environment rewards the agent +1 if one supply depot is built. SubEnv-B instructs the agent to
build some barracks when there are some supply depots. Note that its initial state contains some pre-built
supply depots, as shown in Figure 3(b). When barracks are present and minerals are adequate, the agent
can build marines. Under SubEnv-C, the agent will be given +1 reward when one marine is successfully120
8
Command Center
Barrack
Mineral
Supply Depot
Marines
Farmers
Figure 2: A screenshot of mini-game Build Marines. Some common unit types are marked in the gure. The way the
agent plays the game resembles a human player: through issuing commands together with mouse clicks and drags on the
screen. The game can be divided into three consecutive sub-tasks: building supply depots, building barracks and building
marines.
built. To make the agent focus on this sub-task, the initial state of SubEnv-C is equipped with barracks
and supply depots as shown in Figure 3(c). Since Attack is a valid action for agents, we observe that
important facilities might be destroyed by the angry marines. The burning barrack in Figure 5(a) is such
an example. To discourage those harmful actions, a penalty of −1reward is added into the auxiliary
environments when an attack or destroy event happens.125
These sub-environments can serve as simple curriculums to teach the agent about the basic knowledge
of building a marine. There are two notable facts. First, the auxiliary environments barely give some
rough instructions to bootstrap the agent, and it is far from covering all useful knowledge. For instance,
the agent needs to learn how to manage minerals by itself and the sequence of building. Secondly,
knowledge is conveyed indirectly through the design of initial states and rewards. There is no explicit130
demonstration of how to achieve each sub-task. The agent needs to learn how to build supply depots,
how to build barracks and how to build marines through its interaction with environments.
During the training of PMES, every auxiliary environment uses its reward shaping function to award
any successful accomplishment of the corresponding sub-task. As a result, the agent perceives an adap-
tively changing reward function. We use the reward from building a barrack as an example to illustrate135
this adaptive reward shaping mechanism. Almost all barrack-building events occur in SubEnv-B, and the
overall reward for building a barrack is +1. At rst, the agent in the main environment cannot build
barracks spontaneously from the initial states. As the training proceeds, suppose that the agent is now
able to build barracks in the main environment, then the overall reward for building a barrack adaptively
9
Table 1: The setting of sub-environments for the task Build Marines. The middle 3 columns are the conguration of initial
states, and the last 4 columns show rewards for actions Build/Destroy/Attack. Here [a, b]denotes a number randomly
drawn in range ato b. ‘Center’ is the Control center in the game.
Initial State Build/Destroy/Attack
Mineral Depot Barrack Center Depot Barrack Marine
OrigEnv 50 0 0 — — — +1/0/0
SubEnv-A [50,200] 0 0 — +1/-1/0 — —
SubEnv-B [50,200] [5,20] 0 — — +1/-1/0 —
SubEnv-C [50,200] [5,20] [1,10] 0/0/-1 0/0/-1 0/0/-1 +1/-1/-1
Supply depots Barracks
Supply depots
(a) Initial state of SubEnv-A (b) Initial state of SubEnv-B (c) Initial state of SubEnv-C
Minerals
Figure 3: The initial states of auxiliary environments SubEnv-A, B and C. The initial state of SubEnv-A is similar to that
of the main environment. SubEnv-B starts with some randomly positioned supply depots, while SubEnv-C is in addition
initialized with some random barracks.
changes to a lower level. According to Equation (13), the overall reward for building barracks is smaller140
than +1 because there is no reward for building a barrack in the main environment. We conclude that
when the agent learns to accomplish this sub-task starting from the initial states of the main environ-
ment, the reward for this sub-task will decrease. On one hand, high rewards guide the agent to master
tasks quickly in the beginning; on the other hand, the adaptive reduction of rewards prevents the agent
from being trapped by the rewards from sub-tasks. This adaptive mechanism will be demonstrated in145
Section 4.1.
4. Experiment
In this section, we compare PMES algorithm with four commonly used baselines, including A2C [43,
44], Reward Shaping, Curriculum Learning, and PLAID [25] algorithms. All of them are implemented
based on the A2C algorithm with the objective function dened in Equation (8). The structure of the A2C150
model applied to the PMES algorithm is illustrated in Figure 1. The input of the network includes two
sources of information: spatial features and non-spatial features. The spatial features include information
from the mini-map and the game screen, while non-spatial features include additional information like
10
the remaining minerals or available actions. The network outputs both values and actions. The actions
are then passed to PySC2 [43], a Python library for the StarCraft II Learning Environment (SC2LE),155
which provides an interface for RL agents to interact with the game. We implement the algorithms using
Tensorow [45] and run them on GPUs. Due to the constraint of computation resource, the screen and
mini-map resolutions are set to 16 ×16 pixels, and the total paralleled environments are xed to 16 for
all algorithms.
The used environments are listed in Table 1, including the original environment OrigEnv, and 3auxil-160
iary environments SubEnv-A,SubEnv-B, and SubEnv-C. Their designs have been discussed in Section 3.2.
Below, we describe each algorithm.
A2C Algorithm. The vanilla A2C is implemented with 16 paralleled original environments. A detailed
grid search is performed to nd the best hyper-parameters, which are then adopted for all other compared
algorithms, including the PMES algorithm. Specically, we use the Adam optimizer with learning rate165
1e-5, and (β1, β2) = (0.9,0.999). In the loss Equation (8), the value loss coecient c1is 0.25 and the the
entropy coecient c2is 9e-5. The discount factor γis 0.99, and gradients are clipped at norm 1. We
also train the other six mini-games using the same hyper-parameters and observed that they can achieve
high scores. This ensures that the hyper-parameters do not overt to one mini-game.
Reward Shaping Algorithm. We alter the original environment by adding reward shaping functions, and170
denote the resulting environment as ShapingEnv. Specically, when a supply depot is built (destroyed),
the agent will be rewarded (deducted) mpoints; when a barrack is built (destroyed), the agent will
be rewarded (deducted) npoints; and when a marine is created (killed), the agent will be rewarded
(deducted) ppoints. Same as SubEnv-A, B, C, the agent will be deducted 1 point when the command
center, supply depots, and barracks are attacked. The best congurations of m, n, p is found by the grid175
searched in the experiment, that is, (m, n, p) = (1/9,1/3,1).
Curriculum Learning Algorithm. We use the three auxiliary environments to form curriculums before
training in the original environment. Firstly, the agent is trained under SubEnv-A to learn the ‘build
supply depot’ sub-task for 50K updates. Then, the model is trained under SubEnv-B to learn ‘build
barracks’ for 50K updates. Next, the model is trained under SubEnv-C to learn ‘build marines’ for 50K180
updates. In each stage, the model is initialized by the weights from the previous stage. Lastly, the agent
is trained under the original environment OrigEnv for 300K updates.
PLAID Algorithm. The PLAID algorithm improves over the curriculum learning algorithm by inserting
knowledge ‘distillation’ steps. Firstly, an agent is trained under SubEnv-A and learns the policy πA. Then,
an agent is trained under SubEnv-B with weights initialized from πA. The learned policy is denoted as πB.185
After that, a new agent is trained to distill the knowledge from experts πAand πBusing the DAGGER
11
algorithm [25, 46]. The distilled policy is called ˆπB. Next, an agent is initialized by the parameters of
ˆπBto learn the knowledge of SubEnv-C, and we get πC. Another distillation process is then performed
to distill knowledge from ˆπBand πC, and we get ˆπC. Lastly, we train the agent under the original
environment OrigEnv based on ˆπCfor 300K updates.190
PMES Algorithm. The proposed PMES algorithm is implemented under the 16 paralleled environments
containing the main and auxiliary environments, i.e.OrigEnv,SubEnv-A,SubEnv-B, and SubEnv-C. Since
the number of each environment inuences the performance of the algorithm, the best conguration of
the ratio of sub-environments is found by grid search.
Figure 4: The comparison of mean episodic rewards of A2C, Curriculum Learning, Reward Shaping, PLAID, and PMES.
For Curriculum Learning and PLAID, only their last stages are drawn. The Score is the mean episodic reward for building
marines in the original environment. Each training curve is the average of 5 runs, and the shadow represents the 90%
condence interval.
4.1. Result Analysis195
Figure 4 shows the training processes of the ve algorithms, and Table 2 summarizes the maximum
score and average score of the ve algorithms and the existing work at the end of the train and also
their scores when testing. Furthermore, the game screenshot for each algorithm is illustrated in Figure 5.
We also provide the recordings of the game replay for all algorithms in the supplementary video. In
12
Supply depots
Barrack
Marines
(a) A2C
Supply depots
Barracks
(b) Reward Shaping
Supply depot
Barrack
Marines
(c) Curriculum Learning
Supply depot
Barrack
Marines
(d) PLAID
Supply depots
Barracks Marines
(e) PMES
Figure 5: The replay scene of 5 algorithms. All snapshots are captured at the last 1 minute of an episode (total duration
is 15min). Important units are labeled in the images. These screenshots illustrate how many marines each algorithm can
build, and also intuitively reveal reasons if they fail to build marines.
13
Table 2: Comparison of the scores between PMES algorithm and other algorithms, which contains 4 baseline algorithms
and other existing work. The score represents the reward for building marines in the original environment. The max (mean)
score is the maximum (average) reward overall paralleled main environments and all run 5 times. The train results show
the performance near the end of training at 300K, and the test results are the statistics of running the agent in evaluation
mode for 1024 ×5episodes (1024 episodes for each run). The K, M, B means thousand, million, billion, respectively.
Settings Main Environment Sub-Environments Mean Reward (Train) Max Reward (Train) Mean Reward (Test) Max Reward (Test) Updates
A2C OrigEnv — 8 49 12.9 44 300K
Reward Shaping ShapingEnv — 20 120 29.5 126 300K
Curriculum Learning OrigEnv A, B, C 27 108 48.6 111 300K
PLAID OrigEnv A, B, C 57 122 51.7 123 300K
PMES OrigEnv A, B, C 85 129 91.6 134 300K
Grandmaster [43] OrigEnv — — — — 142 —
FullyConv LSTM [43] OrigEnv — — — — 62 600M
Relational agent [42] OrigEnv — — — 123 — 10B
the results, the PMES algorithm gets the highest score, and it is also very ecient for that it can reach200
high scores after 100K updates. Compared to PMES, the maximum score and sample eciency of other
4algorithms are worse. From the results in Table 2, the PMES algorithm can get 134 points when
testing, which is the best among the ve algorithms. The number is very close to the level (144 points) of
StarCraft Grandmaster [43]. The mean score of A2C,Reward Shaping and Curriculum Learning are very
low, which means that these algorithms do not learn how to build marines well. PLAID has a medium205
performance with a mean reward of 57. Although the mean score of PMES algorithm doesn’t exceed the
Relational agent [42], but it takes fewer update times to reach a comparable level (300 thousand is far
less than 10 billion). Therefore, we conclude that our algorithm is superior to others.
In the following, we analyze the training process of each algorithm qualitatively according to the
changing of scores. Recalling from Equations (1), (2) and (5), the value of state directly reects how210
many rewards it could gather in future. Here the agent will get a reward when a task or subtask is done,
so the value of the initial state is an indicator of whether the agent has mastered the task. Next, we will
illustrate the adaptive mechanism of PMES from the values changing of the three types of initial states.
The Training Process of the A2C Algorithm. Figure 6 shows the training process of A2C algorithm in
detail. From Figure 6(a), the score is unstable during training and uctuates around 12. Figure 6(b)215
shows the values change of initial states of SubEnv-A, B, C. The value of C curve gradually increases as
the training progresses. This indicates the agent receives the reward from building marines occasionally.
However, the value of A curve is very low, which means the agent cannot stably build marines from the
initial state of the original environment. This is veried intuitively by the game screenshot in Figure 5(a).
Compared with the screenshot of PMES in Figure 5(e), few supply depots are built in the A2C algorithm.220
Thus, the reason for its failure is that the agent even does not learn the rst step of building marines.
Another interesting phenomenon in Figure 5(a) is that barracks might be destroyed by the randomly
14
0
4
8
12
16
20
050 100 150 200 250 300
Score
Training Updates (K)
(a) The changing of scores of A2C
-1
0
1
2
3
4
050 100 150 200 250 300
Value
Training Updates (K)
A Value B Value C Value
A Value Trend B Value Trend C Value Trend
(b) The changing of values of A2C
Figure 6: The training process of the A2C algorithm. Note that only the original environments are involved. (a) The Score
is the reward of building marines. (b) The changing of the value of the initial states of SubEnv-A, B, C. The values of the
initial states (see Figure 3) of SubEnv-A, B, C are monitored. Each data point is the average value of 704 sampled initial
states in the same environment. To clearly illustrate the trend, we t the curves using polynomials with a sliding window,
which are denoted as A (B, C) Value Trends.
built marines without attack punishment.
-20
-10
0
10
20
30
40
50
0 50 100 150 200 250 300
Score
Training Updates (K)
All Rewards
Shaping Rewards
Main Rewards
(a) The changing of scores and rewards of Reward Shaping
-1
0
1
2
3
0 50 100 150 200 250 300
Value
Training Updates (K)
A Value B Value C Value
A Value Trend B Value Trend C Value Trend
(b) The changing of values of Reward Shaping
Figure 7: The training process of the Reward Shaping algorithm. (a) The rewards of the Reward Shaping algorithm has two
components. Main Rewards is the reward for building marines; Shaping Rewards is the auxiliary reward such as building
supply depots and barracks; All Rewards is the actual reward used in training, which equals to the sum of Main Reward
and Shaping Rewards. (b) The changing of values during the training process of Reward Shaping algorithm. The meanings
of A, B, C curves are the same as in Figure 6.
The Training Process of the Reward Shaping Algorithm. From the screenshot in Figure 5(b), we observe
that the agent devotes most resources into building supply depots, a situation which also reects in the225
changing process of rewards in Figure 7(a). There are 3 curves: the main rewards is the reward for building
marines which is inherited from the original environment; the shaping rewards is the sum of auxiliary
rewards for actions such as building/destroying/attacking supply depots and barracks (excluding the
15
action of building marines); and their sum, the all rewards, is the actual reward used in training. The
Reward Shaping algorithm stabilizes around 5during the training, which is very large considering that230
building one supply depot only gains +1/9rewards. As a result, the agent does not have enough resources
to build barracks and marines. In Figure 7(b), the values of A and B curve stay almost unchanged during
training. Without the adaptive shaping mechanism, the agent is allured by the easy, small immediate
reward of building supply depots, and fails to explore for the long-term, large reward of building marines.
Thus, the reason for the poor performance of the Reward Shaping algorithm is the short-sighted strategy235
and bad allocation of resources.
The Training Process of the Curriculum Learning Algorithm. The training process is highly unstable.
In 5 runs of the algorithm, the agent can build about 70 marines just for one time; in the other four
experiments, the agent can only build 5 to 10 marines. It implies that the Curriculum Learning algorithm
cannot stably instruct the agent to learn the steps of building marines in most circumstances. Figure 8(a)240
shows the training process of Curriculum Learning algorithm at each stage. The process can be divided
into 4 stages according to the training environments: SubEnv-A,SubEnv-B,SubEnv-C and OrigEnv.
The scores represent the rewards under the corresponding sub-environments. For example, the score
of the model under SubEnv-A is the sum of the rewards for building a supply depot and the penalties
for destroying it. From the gure, we can see the score of each stage can reach a relatively high level,245
which indicates that the sub-tasks are easy to learn. But the score of the OrigEnv is only around 5
points, which means that learnings in the early stages have little eect on building marines. Further, the
values of initial states of three sub-environments in Figure 8(b) diers greatly, which implies that the
agent under new environments will quickly forget what it has learned in the previous curriculum. In the
screenshot as shown in Figure 5(c), there is a very limited number of supply depots and barracks, which250
conrms the catastrophic forgetting problem: forgetting how to build supply depots and barracks.
The Training Process of the PLAID Algorithm. Dierent from Curriculum Learning, the PLAID algo-
rithm contains 6 stages, which are denoted by SubEnv-A,SubEnv-B,Distill-AB,SubEnv-C,Distill-BC,
and OrigEnv, respectively. The new stages are the two ‘distillation’ processes. In Distil l-AB, an agent
is trained to learn knowledge from two expert polices which are previously trained in stages SubEnv-A255
and SubEnv-B. After the process of Distill-AB, the agent masters the knowledge of both SubEnv-A and
SubEnv-B, as indicated by the reward curves shown in Figure 9(a). There are two reward curves in dis-
tillation stages, corresponding to the performance of the agent under the two related sub-environments.
At the end of Distill-BC, the agent remembers how to build barracks (indicated by the reward curve
of SubEnv-B) and how to build marines (indicated by the reward curve of SubEnv-C). The distillation260
process mitigates the catastrophic forgetting problem of transfer learning, and the agent is able to build
marines at the very beginning of stage OrigEnv. Also, in Figure 9(b), the C value is lower than the B
value, because the agent has not learned the knowledge of SubEnv-C during the process of Distill-AB. In
16
0
5
10
15
20
25
0 50 100 150 200 250 300 350 400
Score
Training Updates (K)
SubEnv-A
SubEnv-B
SubEnv-C
OrigEnv
SubEnv-A
SubEnv-B
SubEnv-C
OrigEnv
(a) The changing of rewards of Curriculum Learning
-1
0
1
2
3
4
0 50 100 150 200 250 300 350 400
Value
Training Updates (K)
A Value B C
A Value Trend B Value Trend C Value Trend
SubEnv-A
SubEnv-B
SubEnv-C
OrigEnv
(b) The changing of values of Curriculum Learning
Figure 8: The training process of the Curriculum Learning algorithm. (a) The rewards of Curriculum Learning. It contains
4 stages under dierent environments: SubEnv-A,SubEnv-B,SubEnv-C, and OrigEnv. The Score of each stage is the
reward under its corresponding environment. (b) The changing of values during the training process of Curriculum Leaning.
The meanings of A, B, C curves are the same as in Figure 6.
the Distill-BC stage, the C value has exceeded the B value, which shows that the agent has learned how
to build marines after being trained under SubEnv-C. In the end, the agent achieves a better score than265
the agent of the Curriculum Learning algorithm. So the distillation is eective to remember knowledge.
The screenshot in Figure 5(d) gives an explanation that the building of marines is restricted by the num-
ber of barracks. Thus the agent has not learned how to better allocate resources. Further performance
improvements for the PLAID algorithm might require more training updates. However, PLAID does not
allocate resources eciently like the PMES algorithm, which doesn’t have an adaptive reward mechanism.270
0
20
40
60
80
050 100 150 200 250 300 350 400 450 500 550
Score
Training Updates (K)
SubEnv-A
SubEnv-B
SubEnv-C
OrigEnv
SubEnv-A
SubEnv-B
Distill-AB
SubEnv-C
Distill-BC
OrigEnv
(a) The changing of rewards of PLAID
-1
0
1
2
3
4
5
6
0 50 100 150 200 250 300 350 400 450 500 550
Value
Training Updates (K)
A Value B Value C Value
A Value Trend B Value Trend C Value Trend
SubEnv-B
SubEnv-A
Distill-AB
SubEnv-C
Distill-BC
OrigEnv
(b) The changing of values of PLAID
Figure 9: The training process of the PLAID algorithm. (a) The rewards of the PLAID algorithm. It contains 6 stages
under dierent environments: SubEnv-A,SubEnv-B,Distil l-AB,SubEnv-C,Distil l-BC, and OrigEnv. The Score of each
stage is the reward under its corresponding environments. Note that distill stages involves two environments at the same
time. (b) The changing of values during the training of PLAID. The meanings of A, B, C curves are the same as in Figure 6.
17
The Training Process of the PMES Algorithm. At the beginning stage of the training process shown in
Figure 10, all the values of A, B, and C increase concurrently, which shows the eciency of parallelism in
learning the sub-tasks. Refer to Figure 10(a), after updating for 60K, the agent has learned how to build
marines in the main environment from scratch, thus the overall perceived rewards for building supply275
depots and barracks are reduced due to the adaptive shaping mechanism. As a result, the values of curves
A and B decrease after 60K. In contrast, the value of curve C is mainly aected by the rewards of building
marines, which is relatively stable during the training. Particularly, comparing the changing of values of
the PMES algorithm in Figure 10 with the changing of values of other algorithms in Figures 6(b), 7(b)
and 8(b), we nd that the adaptive reward shaping is unique in the PMES algorithm. From the screenshot280
shown in Figure 5(e), the proportion of supply depots and barracks is well balanced, and there are a
lot of marines being built. It implies that the agent has learned to build marines step by step, and can
allocate the resource reasonably based on the mechanism of adaptive reward shaping. Therefore, the
PMES algorithm has incomparable superiority in performance than other algorithms.
0
20
40
60
80
100
050 100 150 200 250 300
Score
Training Updates (K)
(a) The changing of rewards of PMES
-1
0
1
2
3
0 50 100 150 200 250 300
Value
Training Updates (K)
A Value B Value C Value
A Value Trend B Value Trend C Value Trend
(b) The changing of values of PMES
Figure 10: The change of values during the training of the PMES algorithm. (a) The rewards of the PMES algorithm. (b)
The changing of values during the training of the PMES algorithm. The meanings of A, B, C curves are the same as in
Figure 6.
4.2. Ablation Study285
In this section, we demonstrate the importance of domain knowledge in the design of auxiliary envi-
ronments, and study the design choices in the PMES algorithm.
Importance of Domain Knowledge. Recall that in the design of sub-environments, we not only consider
the build action of buildings and marines, but also take account of the destroy and attack actions. To
demonstrate the importance of this domain knowledge, we design two comparative experiments that290
gradually remove domain knowledge from the environments used in the PMES algorithm. The rst one
is called the PMES_RP model. It consists of 3 sub-environments SubEnv-A,SubEnv-B,SubEnv-C**
18
and the main environment OrigEnv, where SubEnv-C** removes the attack action from SubEnv-C sub-
environment. Another model, namely the PMES_R model, removes both the attack and destroy actions
from the 3 sub-environments and the main environment. The compositional environments are SubEnv-295
A*,SubEnv-B*,SubEnv-C* and OrigEnv. The settings of the new sub-environments are listed in Table 3.
Their results are shown in Figure 11. Note that the mean score of PMES_RP model is far less than
the score of the original PMES model after removing attack action. Further, the PMES_R model which
removes attack and destroy gets a lower score than PMES_RP, and also has a slower convergence speed.
Therefore, adding more domain knowledge into the model can greatly improve the performance of the300
algorithm.
Table 3: Conguration of extra auxiliary environments. Their initial states since they are the same as in Table 1. The 3
numbers in the cells correspond to rewards for actions Build/Destroy/Attack. ‘Center’ is the Control center in the game.
Setting Center Depot Barrack Marine
SubEnv-A* — +1/0/0 — —
SubEnv-B* — — +1/0/0 —
SubEnv-C* — — — +1/0/0
SubEnv-C** — — — +1/-1/0
Table 4: Comparison of using one sub-environment (A, B or C) in the PMES algorithm. Each Setting is trained to 300K
updates with 2 runs.
Conguration Rewards
OrigEnv SubEnv-A SubEnv-B SubEnv-C Mean Max
Setting A 10 6 0 0 5 10
Setting B 10 0 6 0 45 69
Setting C 10 0 0 6 32 65
The Sensitivity on the Number of Environments. In this work, the model is trained based on 16 paralleled
environments. We analyze the impact of the number of dierent environments on the performance of the
model, as shown in Figure 12. There are some dierences between the results of dierent settings. The
performance of the PMES algorithm is related to the number of dierent environments. Nonetheless,305
even in the imperfect conguration of environments, the performance of the PMES algorithm is better
than all other compared baseline algorithms (Note that experiments in Figure 12 are trained only for
75K updates).
19
Figure 11: Three settings encode gradually more knowledge into the sub-environment design. PMES_R model rewards the
building actions; PMES_RP model also punishes the destroying consequences; PMES model in addition discourages the
attack actions. Note that the more knowledge being encoded, the better the performance is obtained. 5 runs are performed
for each setting.
15
40
55
70
31
60
38
50
80
50
15
35
10
68
4
0
20
40
60
80
8-2-4-2 9-1-4-2 9-2-3-2 10-1-3-2 10-2-2-2
Score
Environment Configuration
Experiment-1
Experiment-2
Experiment-3
Figure 12: The eect of environment congurations. The four numbers of conguration represent the number of four
environments (OrigEnv,SubEnv-A,SubEnv-B and SubEnv-C). Each experiment is trained for 75K updates.
20
The Role of Auxiliary Environments. In order to show the importance of each auxiliary environment, we
single out SubEnv-A,SubEnv-B, and SubEnv-C respectively, and only use one of them as the auxiliary310
environment. The results are shown in the Table 4, where the setting of the hyper-parameters is same as
before. By comparing the results of dierent settings, we conclude that introducing part of the auxiliary
environments into the main environment can eectively guide the agent to learn. Among them, SubEnv-B
and SubEnv-C is more eective than SubEnv-A. This is reasonable since SubEnv-A would misguide the
agent to spend resources into building supply depots rather than building marines.315
5. Conclusion
In this paper, we propose a novel deep reinforcement learning algorithm to solve these problems of
complex multi-step task, called the PMES algorithm, which introduces auxiliary environments into the
main environment to eectively convey domain knowledge to the learning agent. In the PMES algorithm,
the agent interacts with dierent environments at the same time, leading to the speedup of the training320
process. More importantly, the PMES algorithm has an adaptive shaping mechanism, where the overall
reward function adjusts adaptively as the training progresses. Not only can the mechanism of adaptive
reward shaping instruct the agent nd the way to complete the task, but also it can prevent the agent
from being trapped into short-sighted, sub-optimal policies. Experimental results on the StarCraft II
mini-game demonstrate the eectiveness of the proposed PMES algorithm, which is much more eective325
and ecient than the traditional A2C, Reward Shaping, Curriculum Learning, and PLAID algorithms.
Acknowledgment
We would like to thank the members of The Killers team and their instructors Lisen Mu and Jingchu
Liu in DeeCamp 2018 for the insightful discussions. This work was supported by three grants from the
National Natural Science Foundation of China (No. 61976174, No. 61877049, and No. 11671317), and in330
part from the program of China Scholarships Council (No.201906280201).
Conict of Interest
The authors declare that there is no conict of interest regarding the publication of this article.
References
[1] S. Li, L. Ding, H. Gao, C. Chen, Z. Liu, Z. Deng, Adaptive neural network tracking control-based335
reinforcement learning for wheeled mobile robots with skidding and slipping, Neurocomputing 283
(2018) 20–30.
21
[2] F. Li, Q. Jiang, S. Zhang, M. Wei, R. Song, Robot skill acquisition in assembly process using deep
reinforcement learning, Neurocomputing 345 (2019) 92–102.
[3] M. Wulfmeier, D. Z. Wang, I. Posner, Watch this: Scalable cost-function learning for path planning340
in urban environments, in: 2016 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), IEEE, 2016, pp. 2089–2095.
[4] A. E. Sallab, M. Abdou, E. Perot, S. Yogamani, Deep reinforcement learning framework for au-
tonomous driving, Electronic Imaging 2017 (2017) 70–76.
[5] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,345
I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., Mastering the game of go with deep neural
networks and tree search, nature 529 (2016) 484.
[6] H. Jiang, H. Zhang, K. Zhang, X. Cui, Data-driven adaptive dynamic programming schemes for
non-zero-sum games of unknown discrete-time nonlinear systems, Neurocomputing 275 (2018) 649–
658.350
[7] H. Jiang, H. Zhang, X. Xie, J. Han, Neural-network-based learning algorithms for cooperative games
of discrete-time multi-player systems with control constraints via adaptive dynamic programming,
Neurocomputing 344 (2019) 13–19.
[8] Z. Wang, L. Liu, H. Zhang, G. Xiao, Fault-tolerant controller design for a class of nonlinear mimo
discrete-time systems via online reinforcement learning algorithm, IEEE Transactions on Systems,355
Man, and Cybernetics: Systems 46 (2015) 611–622.
[9] A. Y. Ng, D. Harada, S. Russell, Policy invariance under reward transformations: Theory and
application to reward shaping, in: ICML, volume 99, 1999, pp. 278–287.
[10] G. Konidaris, A. Barto, Autonomous shaping: Knowledge transfer in reinforcement learning, in:
Proceedings of the 23rd international conference on Machine learning, ACM, 2006, pp. 489–496.360
[11] E. Wiewiora, G. W. Cottrell, C. Elkan, Principled methods for advising reinforcement learning
agents, in: Proceedings of the 20th International Conference on Machine Learning (ICML-03), 2003,
pp. 792–799.
[12] A. Laud, G. DeJong, The inuence of reward on the speed of reinforcement learning: An analysis
of shaping, in: Proceedings of the 20th International Conference on Machine Learning (ICML-03),365
2003, pp. 440–447.
[13] M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. van de Wiele, V. Mnih, N. Heess,
J. T. Springenberg, Learning by playing solving sparse reward tasks from scratch, in: Proceedings
of the 35th International Conference on Machine Learning, volume 80, 2018, pp. 4344–4353.
22
[14] B. Marthi, Automatic shaping and decomposition of reward functions, in: Proceedings of the 24th370
International Conference on Machine learning, ACM, 2007, pp. 601–608.
[15] S. M. Devlin, D. Kudenko, Dynamic potential-based reward shaping, in: Proceedings of the 11th
International Conference on Autonomous Agents and Multiagent Systems, IFAAMAS, 2012, pp.
433–440.
[16] J. A. Arjona-Medina, M. Gillhofer, M. Widrich, T. Unterthiner, J. Brandstetter, S. Hochreiter,375
Rudder: Return decomposition for delayed rewards, in: Advances in Neural Information Processing
Systems, 2019, pp. 13544–13555.
[17] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, P. Abbeel, Overcoming exploration in rein-
forcement learning with demonstrations, in: 2018 IEEE International Conference on Robotics and
Automation (ICRA), IEEE, 2018, pp. 6292–6299.380
[18] S. Narvekar, J. Sinapov, M. Leonetti, P. Stone, Source task creation for curriculum learning, in:
Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems,
International Foundation for Autonomous Agents and Multiagent Systems, 2016, pp. 566–574.
[19] C. Florensa, D. Held, M. Wulfmeier, M. Zhang, P. Abbeel, Reverse curriculum generation for
reinforcement learning, in: Conference on Robot Learning, 2017, pp. 482–495.385
[20] M. Svetlik, M. Leonetti, J. Sinapov, R. Shah, N. Walker, P. Stone, Automatic curriculum graph
generation for reinforcement learning agents., in: The Conference on Articial Intelligence, AAAI,
2017, 2017, pp. 2590–2596.
[21] W. M. Czarnecki, S. M. Jayakumar, M. Jaderberg, L. Hasenclever, Y. W. Teh, N. Heess, S. Osindero,
R. Pascanu, Mix & match agent curricula for reinforcement learning, in: International Conference390
on Machine Learning, 2018, pp. 1095–1103.
[22] Z. Ren, D. Dong, H. Li, C. Chen, Self-paced prioritized curriculum learning with coverage penalty in
deep reinforcement learning, IEEE transactions on neural networks and learning systems 29 (2018)
2216–2226.
[23] N. Jiang, S. Jin, C. Zhang, Hierarchical automatic curriculum learning: Converting a sparse reward395
navigation task into dense reward, Neurocomputing (2019).
[24] E. Parisotto, J. L. Ba, R. Salakhutdinov, Actor-mimic: Deep multitask and transfer reinforcement
learning, in: 4th International Conference on Learning Representations, 2016.
[25] G. Berseth, C. Xie, P. Cernek, M. Van de Panne, Progressive reinforcement learning with distillation
for multi-skilled motion control, in: 6th International Conference on Learning Representations, 2018.400
23
[26] F. Tanaka, M. Yamamura, Multitask reinforcement learning on the distribution of mdps, in: Com-
putational Intelligence in Robotics and Automation, 2003. Proceedings. 2003 IEEE International
Symposium on, volume 3, IEEE, 2003, pp. 1108–1113.
[27] A. Wilson, A. Fern, S. Ray, P. Tadepalli, Multi-task reinforcement learning: a hierarchical bayesian
approach, in: Proceedings of the 24th international conference on Machine learning, ACM, 2007, pp.405
1015–1022.
[28] M. Snel, S. Whiteson, Multi-task evolutionary shaping without pre-specied representations, in:
Proceedings of the 12th annual conference on Genetic and evolutionary computation, ACM, 2010,
pp. 1031–1038.
[29] D. Calandriello, A. Lazaric, M. Restelli, Sparse multi-task reinforcement learning, in: Advances in410
Neural Information Processing Systems, 2014, pp. 819–827.
[30] Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, R. Pascanu,
Distral: Robust multitask reinforcement learning, in: Advances in Neural Information Processing
Systems, 2017, pp. 4496–4506.
[31] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, K. Kavukcuoglu, Rein-415
forcement learning with unsupervised auxiliary tasks, in: 5th International Conference on Learning
Representations, 2017.
[32] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino, M. Denil, R. Goroshin,
L. Sifre, K. Kavukcuoglu, et al., Learning to navigate in complex environments, in: 5th International
Conference on Learning Representations, 2017.420
[33] S. Cabi, S. G. Colmenarejo, M. W. Homan, M. Denil, Z. Wang, N. Freitas, The intentional
unintentional agent: Learning to solve many continuous control tasks simultaneously, in: Conference
on Robot Learning, 2017, pp. 207–216.
[34] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu,
Asynchronous methods for deep reinforcement learning, in: International conference on machine425
learning, 2016, pp. 1928–1937.
[35] A. Gruslys, W. Dabney, M. G. Azar, B. Piot, M. G. Bellemare, R. Munos, The reactor: A fast
and sample-ecient actor-critic agent for reinforcement learning, in: International Conference on
Learning Representations, ICLR 2018, 2018.
[36] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley,430
I. Dunning, et al., Impala: Scalable distributed deep-rl with importance weighted actor-learner
architectures, in: International Conference on Machine Learning, 2018, pp. 1406–1415.
24
[37] D. Lee, H. Tang, J. O. Zhang, H. Xu, T. Darrell, P. Abbeel, Modular architecture for starcraft
ii with deep reinforcement learning, in: Fourteenth Articial Intelligence and Interactive Digital
Entertainment Conference, 2018.435
[38] Z. Pang, R. Liu, Z. Meng, Y. Zhang, Y. Yu, T. Lu, On reinforcement learning for full-length game
of starcraft, in: The Conference on Articial Intelligence, AAAI, 2019, 2019, pp. 4691–4698.
[39] T. Rashid, M. Samvelyan, C. S. Witt, G. Farquhar, J. Foerster, S. Whiteson, Qmix: Monotonic value
function factorisation for deep multi-agent reinforcement learning, in: International Conference on
Machine Learning, 2018, pp. 4292–4301.440
[40] V. Zambaldi, D. Raposo, A. Santoro, V. Bapst, Y. Li, I. Babuschkin, K. Tuyls, D. Reichert, T. Lil-
licrap, E. Lockhart, et al., Deep reinforcement learning with relational inductive biases, in: 7th
International Conference on Learning Representations, 2019.
[41] J. Gehring, D. Ju, V. Mella, D. Gant, N. Usunier, G. Synnaeve, High-level strategy selection under
partial observability in starcraft: Brood war, in: NeurIPS Workshop on Reinforcement Learning445
under Partial Observability, 2018.
[42] V. F. Zambaldi, D. Raposo, A. Santoro, V. Bapst, Y. Li, I. Babuschkin, K. Tuyls, D. P. Reichert,
T. P. Lillicrap, E. Lockhart, M. Shanahan, V. Langston, R. Pascanu, M. Botvinick, O. Vinyals, P. W.
Battaglia, Deep reinforcement learning with relational inductive biases, in: International Conference
on Learning Representations, ICLR 2019, 2019.450
[43] O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Küttler,
J. Agapiou, J. Schrittwieser, et al., Starcraft ii: A new challenge for reinforcement learning, arXiv
preprint arXiv:1708.04782 (2017).
[44] R. Ring, T. Matiisen, Replicating deepmind starcraft ii reinforcement learning benchmark with
actor-critic methods (2018).455
[45] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,
M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,
M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,
B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals,
P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: Large-scale machine learn-460
ing on heterogeneous systems, 2015. URL: https://www.tensorflow.org/, software available from
tensorow.org.
[46] S. Ross, G. J. Gordon, J. A. Bagnell, No-regret reductions for imitation learning and structured
prediction, in: In AISTATS, Citeseer, 2011.
25