Content uploaded by Cong Ma

Author content

All content in this area was uploaded by Cong Ma on Jul 07, 2020

Content may be subject to copyright.

Paralleled Multi-Environments Shaping Algorithm for Complex

Multi-step Task

Cong Maa,d,∗, Zhizhong Lib,∗, Dahua Linc, Jiangshe Zhanga,∗∗

aSchool of Mathematics and Statistics, Xi’an Jiaotong University, No.28, Xianning West Road, Xi’an, Shannxi,

P.R. China

bSenseTime, No 12, Science Park East Avenue, HKSTP, Shatin, Hong Kong, P.R. China

cDepartment of Information Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong, P.R. China

dDepartment of Mathematics, KU Leuven, Celestijnenlaan 200B, Leuven 3001, Belgium

Abstract

Because of the sparse reward and the sequence of task, the complex multi-step task is a big challenge

in reinforcement learning, that is, the agent needs to implement several consecutive sequential steps to

complete the whole task without the intermediate reward. Reward shaping and curriculum learning are

always used to solve this challenge, but reward shaping is prone to sub-optimal policies and curriculum

learning easily suers from the catastrophic forgetting problem. In this paper, we propose a novel algo-

rithm called Paralleled Multi-Environment Shaping algorithm (PMES), where several sub-environments

are built based on human knowledge to make the agent aware of the importance of intermediate steps,

each of which corresponds to each key intermediate steps. Specically, the learning agent is trained under

these paralleled multiple environments including the original environment and several sub-environments

by synchronous advantage actor-critic algorithm. And the PMES algorithm has the mechanism of adap-

tive reward shaping to adjust the reward function. In this way, PMES algorithm eectively incorporates

human experience by multiple dierent environments rather than only shaping the reward function, it

combines the benets of reward shaping and curriculum learning while avoiding their drawbacks. Exten-

sive experiments on the mini-game ‘Build Marines’ of StarCraft II environment show that our proposed

algorithm is more eective than reward shaping, curriculum learning, and PLAID algorithms, which is

almost close to the level of human Grandmaster. And compared with the best existing algorithm, it takes

less time and computing resource to reach a good result.

Keywords: Reinforcement Learning, Multi-step task, Paralleled Multiple Environments, Adaptive

Reward Shaping

∗The two authors contributed equally

∗∗Corresponding author

Email address: jszhang@xjtu.edu.cn (Jiangshe Zhang)

Preprint submitted to Journal of Neurocomputing December 11, 2019

1. Introduction

In recent years, reinforcement learning (RL) algorithm, especially deep reinforcement learning (DRL)

algorithm, which combines RL and deep learning algorithms, receives widespread attention in many

elds, such as, robot control [1, 2], autonomous driving [3, 4], playing games [5, 6], even optimal control

problems [7, 8] and so on. In RL algorithm, the agent learns and improves its performance by trying5

constantly, then nds the optimal path by maximizing the return function. But due to the huge state

space, action space and sparse reward function, instructing the agent to accomplish the complex multi-

step task like the human has been the biggest challenge for a long time.

In the multi-step task, it is dicult to make the agent aware of the importance of the intermediate

steps in the absence of the corresponding intermediate rewards, while these tasks are very easy for10

the human being due to the prior knowledge or experience. Thus, how to transfer knowledge into the

algorithm eciently is very important. Aiming at this problem, various approaches are proposed. Reward

Shaping algorithm [9–14] manipulates the reward signals from the environment to instruct the agent to

learn faster. And Ng Andrew [9] proved that the potential based shaping can preserve the optimal policy.

Devlin [15] further generalized the potential based shaping to be time-dependent, called dynamic shaping.15

RUDDER [16] was proposed to solve the problem of delayed reward by decomposing the original reward

into a new MDP that is return-equivalent. But shaping the reward function properly is very dicult,

which is prone to be stuck at the sub-optimal [17]. Also, Curriculum Learning algorithm [18–23] can be

used to solve the multi-step task. It divides the whole task into several increasingly dicult sub-tasks

in series and trains the agent in a meaningful order because solving simple tasks rst would benet the20

agent to solve more complex tasks later. As a result, the training complexity of the original task is

simplied by the knowledge from the previous auxiliary tasks. Narvekar [18] proposed that designing

a sequence of source tasks for the agent to improve the performance and the learning speed of the RL

algorithm. Svetlik [20] further presented a method to generate the curriculum automatically based on

task descriptors for the RL algorithm. However, the development is still in its infancy and the application25

of these methods is quite restricted. For example, the reverse curriculum learning [19] requires that the

environments can be reversely evolved back to ‘earlier’ states, which is not feasible for some environment.

And curriculum learning could suer from the catastrophic forgetting problem, i.e., knowledge learned

previously might be erased during the learning process of new knowledge. This problem can be solved by

the distillation algorithm, which creates one single controller to implement the tasks of a set of experts.30

For instance, Parisotto [24] proposed adopting a supervised regression function to learns all expert policies,

called distillation. Further, Berseth [25] proposed PLAID algorithm which combines transfer learning

and policy distillation to learn multiple skills. But it is very dicult to nd the best policy based on

a mixture of experts. Thus, studying a novel RL algorithm for complex multi-step tasks is particularly

important.35

2

In RL algorithm, there is a lot of research about multi-task learning [26–33], whose goal is to accom-

plish all tasks together rather than focusing on one of them. But multi-step task learning is very dierent

from the multi-task learning, which puts more emphasis on the sequential relationship between multiple

tasks, that is, these sub-tasks must be implemented sequentially. And these rewards is very sparse where

the agent get the reward only after completing the whole task. Thus, we divide the whole task into40

several sub-tasks according to the steps and train them simultaneously. In the previous works, there

are some paralleled RL algorithms [34–36], where the agent interacts with multiple single environments

simultaneously to speed up training, such as Asynchronous Advantage Actor-Critic (A3C) algorithm [34]

and Advantage Actor-Critic (A2C) algorithm. Inspired by them, we can train the agent under multiple

dierent environments to learn dierent tasks simultaneously when their state spaces and action spaces45

are exactly same.

In this paper, we propose a novel algorithm, named Paralleled Multi-Environment Shaping (PMES)

algorithm to solve these complex multi-step tasks. Considering the lack of rewards for intermediate steps,

we introduce the corresponding rewards by building several sub-environments, each of which corresponds

to intermediate key steps. Meanwhile, combining the paralleled RL algorithms, we train the agent50

under multiple paralleled dierent environments, including the original environment and several sub-

environments. In this way, the agent can assign these rewards automatically and achieve the nal target.

In recent years, there are some classical complex environments for testing RL algorithms, especially

the SC2LE environments [37–42]. It is an environment for the real-time strategy (RTS) game StarCraft II,

released by Deepmind and Blizzard [43], which contains seven mini-games aiming at dierent elements of55

playing StarCraft II. One of the mini-games, Build Marines, is much more dicult than others because

it involves multiple dependent decisions based on previous works. Thus, in this paper, we use this mini-

game as the platform to evaluate the performance of the PMES algorithm. These experimental results

on the StarCraft II demonstrate that the proposed algorithm yields good performance and gets a higher

score than other algorithms as a faster convergence.60

The rest of the paper is organized as follows. Section 2 introduces preliminary concepts and also sets

up notations. Section 3 shows the PMES algorithm. Section 4 compares PMES with other algorithms

qualitatively and quantitatively. Section 5 concludes the paper.

2. Preliminaries

A reinforcement learning task is often modeled as a Markov Decision Process (MDP) M= (S, S0,A,T, γ, R),

where Sis the state space, S0is the distribution of initial states, Ais the action space, T={Psa (·)|s∈ S, a ∈ A}

is the transition probabilities, γ∈(0,1] is the discount factor, and R(s, a, s′)is a stochastic reward from

state sto state s′by taking action a. During the training, the agent interacts with the environment

by choosing an action atand receives the reward Rt, and then transits to the next state st+1 at any

3

time t. The actions are sampled according to its current policy π:S → P (A), where P(A)is the set of

probability measures on A. The discounted return is dened as

Gt=

T−1

i=t

γi−tRi,(1)

and the goal is to maximize the expected return from the start distribution J=Es0∼S0[G0|s0]. The

action-value function is the expected return of taking action atin state st,

Qπ(st, at) := E[Gt|st, at],(2)

and the optimal policy satises the optimal Bellman equation,

Q∗(s, a) = Es′∼Psa R(s, a, s′) + γmax

a′∈A Q∗(s′, a′).(3)

In policy-based model-free methods, policy πθis parameterized by θand is optimized by gradient ascent,

where the gradient can be estimated by

∇θJ(θ) = Eτ∼πθ(τ)T

t=0

∇θlog πθ(at|st)Gt.(4)

Here τ={s0, a0, s1, a1, . . . , sT, aT}is a trajectory sampled following the policy. The high variance is

a problem of the above gradient estimation, which can be reduced by adopting the advantage function

A(st, at) = Q(st, at)−V(st)to replace Gtin Equation (4), where

V(st) = Eat[Q(st, at)] (5)

is the value function of state st. An estimate of the advantage function is given by

k−1

i=0

γirt+i+γkVω(st+k)−Vω(st),(6)

where Vωis the value network with parameter ω, and t+kis bounded by T. So

∇θJ(θ) = Eτ∼πθ(τ)T

t=0

∇θlog πθ(at|st)A(st, at).(7)

In the A2C algorithm, the value network shares the parameters with the policy network. Besides, the

entropy H(πθ(st)) of the policy πis added into the objective function to encourage exploration. The

objective function of the A2C algorithm for some sampled trajectories τis

Eτ∼π(τ)log πθ(at|st)A(st, at) + c1Rt−Vθ(st)2

+c2H(πθ(st)),

(8)

where c1, c2are the coecients controlling the loss of value and the entropy, respectively, and Rtis the65

expected cumulative discounted reward at the timepoint t.

4

fc1+fc2

Mini-map

Screen-map

Non-spatial

conv1 conv2

conv1 conv2

fc1

value

conv3

Spatial policy

Non-spatial policy

policy

action

auxiliary

environments

main environments

state

Figure 1: Architecture of PMES for StarCraft II. The agent interacts with multiple environments, including main environ-

ments and auxiliary environments. The overall structure is based on A2C. The input of the network contains spatial and

non-spatial features. Spatial features include screen-maps and mini-maps which abstract away from the RGB images of the

game when playing. Non-spatial features contain information such as the number of minerals collected. The output actions

consist of game commands and their spatial coordinates.

3. Paralleled Multi-Environment Shaping Algorithm

The core of Paralleled Multi-Environment Shaping (PMES) algorithm is building up auxiliary envi-

ronments to assist the agent to learn in the original main environment. For simplicity, we assume the

main task and the auxiliary tasks only dier in the initial states and the reward functions. Specically,

we denote the original task in the main environment as an MDP

M(0) =S, S(0)

0,A, T, γ, R(0) ,(9)

and the Nkinds of auxiliary tasks as

M(n)=S, S(n)

0,A, T, γ, R(n), n = 1, . . . , N , (10)

where Sis the state space, and S(0)

0,S(n)

0are initial states in the main environment and the n-th auxiliary

environment respectively. Because we use the synchronous, batched deep RL algorithm A2C to train,

only one agent is interacting with all environments in parallel. From the perspective of the learning agent,70

the overall mixed environments not only can start at dierent initial states, but also be provided with

dierent reward signals. Strictly speaking, the process is a partially observed MDP (POMDP) because

it is unknown which exact environment the agent is currently interacting with. However, the purpose of

POMDP is to nd the optimal actions for each possible belief over the world states, which deviates from

our ultimate goal of optimizing the main task.75

5

There is the overall architecture of PMES, as shown in Figure 1. It follows the settings of the

A2C agent in [43, 44]. In the implementation, both the policy network and the value network share

parameters. One important aspect of PMES architecture is the interaction of one single learning agent

with multiple, paralleled environments simultaneously. We exploit this feature to enable the adaptive

shaping mechanism by sub-environments.80

3.1. A Special Case

Suppose the count of each type of environment is d(n), n = 0, . . . , N , where index 0is reserved for

the main environment. Then the total number of environments is nd(n), and the ratio of the agent

interacting with each type of environment is d(n)/nd(n). Further, assume that the current policy of

the agent is π, and denote the probability of visiting state sin environment M(n)as p(n)

π(s). Then, the

state sis accessed from dierent environments with probability ρ(n)

π, where

ρ(n)

π(s)∝d(n)

nd(n)·p(n)

π(s), n = 0, . . . , N. (11)

Therefore, the triple tuple (s, a, s′)of the source environments distribution can be obtained from

ρ(n)

π(s, a, s′)∝ρ(n)

π(s)·π(a|s)·Psa(s′).(12)

Since the policy πand the transition probability Psa are the same in all environments, we have ρ(n)

π(s, a, s′) =

ρ(n)

π(s). Thus the overall reward function Rπ(depends on policy π) is

Rπ(s, a, s′) =

n

ρ(n)

π(s, a, s′)R(n)(s, a, s′)

=

n

ρ(n)

π(s)R(n)(s, a, s′).

(13)

In the special case of ergodic MDPs and potential-based shaping rewards [9], the optimal policy is

preserved.

Proposition 1. Suppose the state space Sand action space Aof the main MDP M(0) in Equation (9)

are nite, and the auxiliary MDPs are constructed as in Equation (10). If the M(0) is ergodic, and the

auxiliary rewards are potential-based shaping rewards

R(n)(s, a, s′) = R(0)(s, a, s′) + γϕ(n)(s′)−ϕ(n)(s),(14)

where n= 1, . . . , N and ϕ(n)is the potential function for the n-th environment. Then the optimal policy

π∗of the PMES algorithm is also the optimal policy of the original main environment M(0),i.e., PMES85

has policy invariance.

Proof. Our proof follows the practice of reward shaping literature [9, 11], with the necessary modications.

For notational simplicity, we set ϕ(0) ≡0. For the optimal policy π∗, the corresponding Q-function Q∗

6

would satisfy the optimal Bellman equation

Q∗(s, a)

=Es′∼Psa Rπ∗(s, a, s′) + γmax

a′∈A Q∗(s′, a′)

=E

n

ρ(n)

π∗(s)R(n)(s, a, s′) + γmax

a′∈A Q∗(s′, a′)

=ER(0)(s, a, s′) +

n

ρ(n)

π∗(s)γϕ(n)(s′)−

ϕ(n)(s)+γmax

a′∈A Q∗(s′, a′).

(15)

One property of ergodic MDPs is that they have unique stationary state distributions. Since the

auxiliary MDPs only dier in initial states and reward functions, their induced Markov Random Processes

are the same, and thus have the same stationary distribution as the main environment, i.e.,

p(n)

π(s) = p(0)

π(s), n = 1, . . . , N, ∀s∈ S.(16)

And together with Equation (11), we have

ρ(n)

π(s) = ρ(n)

π(s′), n = 0, . . . , N, ∀s, s′∈ S.(17)

Using the above equation, we can rearrange Equation (15),

Q∗(s, a) +

n

ρ(n)

π∗(s)ϕ(n)(s)

=Es′∼Psa R(0)(s, a, s′) + γmax

a′∈A Q∗(s′, a′)

+

n

ρ(n)

π∗(s′)ϕ(n)(s′).

(18)

Set

ˆ

Q∗(s, a) := Q∗(s, a) +

n

ρ(n)

π∗(s)ϕ(n)(s),(19)

then we have the optimal Bellman equation for the original MDP M(0),

ˆ

Q∗(s, a) = Es′∼Psa R(0)(s, a, s′) + γmax

a′∈A

ˆ

Q∗(s′, a′).(20)

The optimal policy π∗satises

π∗(s)∈argmax

a∈A

Q∗(s, a)

= argmax

a∈A Q∗(s′, a′) +

n

ρ(n)

π∗(s)ϕ(n)(s)

= argmax

a∈A

ˆ

Q∗(s′, a′).

(21)

7

So the optimal policy for the paralleled, mixed environments is also optimal policy for the original main

environment. The opposite direction, that every optimal policy of the main environment is also the

optimal of the mixed environment can also be argued by reversing the above discussion.

From the proposition, the optimal solution is not related to the distribution of initial states or the90

shaping rewards under the specied conditions. However, these two factors do have an impact on learning

eciency. At the starting phase of the training process, the agent is able to explore multiple critical parts

of the MDP simultaneously for the diverse arrangement of the initial states. Thus, the auxiliary environ-

ments provide a curriculum, where each one starts and focuses on a dierent phase of the original problem

at the early training stage. Dierent from the traditional shaping methods, PMES algorithm provides95

an autonomous, adaptive way of shaping. The perceived reward function is shown in Equation (13). In

the early exploration training stage, unlike the case for the equilibrium state in Equation (17), the ratios

ρ(n)

π(s)for dierent state shave not converge and might be dierent. For example, for a state snear the

initial state of the rst auxiliary environment M(1), the ratio ρ(1)

π(s)might near 1and the rst auxiliary

environment contributes to most visits of state s. Then the the agent’s current perceived reward for100

the tuple (s, a, s′)for action aand the consecutive state s′is inuenced mostly by the reward function

R(1)(s, a, s′)in auxiliary environment M(1). As the training progress develops, the main environments

reach state smore frequently. The perceived reward for (s, a, s′)will be inuenced by the original reward

structure from the main environment M(0) more than the auxiliary environment M(1).

3.2. Application in the mini-game ‘Build Marines’ of StarCraft II105

There is a screenshot of StarCraft II shown in Figure 2. The diculties of the mini-game Build Marines

lie in the huge exploration space, the long chain of action dependency, and the sparse and delayed reward.

All of them make the agent extremely hard to build even the rst marine. This work utilizes the domain

knowledge that the building process can be divided into three simpler sub-tasks: building supply depots,

building barracks, building marines. Then, we encodes them into three smaller, tailored environments,110

each of which corresponds to every key intermediate step. every new auxiliary environment starts at

dierent initial states and has dierent rewards. Their state spaces Sand transition probabilities T

between states are kept the same as the original MDP. In order to satisfy ergodicity, we introduce the

destroying action as the invertible way of building action in these auxiliary environments. The detailed

congurations are listed in Table 1.115

Figure 3(a) illustrates the initial state of SubEnv-A, where only some minerals are provided. This

auxiliary environment rewards the agent +1 if one supply depot is built. SubEnv-B instructs the agent to

build some barracks when there are some supply depots. Note that its initial state contains some pre-built

supply depots, as shown in Figure 3(b). When barracks are present and minerals are adequate, the agent

can build marines. Under SubEnv-C, the agent will be given +1 reward when one marine is successfully120

8

Command Center

Barrack

Mineral

Supply Depot

Marines

Farmers

Figure 2: A screenshot of mini-game Build Marines. Some common unit types are marked in the gure. The way the

agent plays the game resembles a human player: through issuing commands together with mouse clicks and drags on the

screen. The game can be divided into three consecutive sub-tasks: building supply depots, building barracks and building

marines.

built. To make the agent focus on this sub-task, the initial state of SubEnv-C is equipped with barracks

and supply depots as shown in Figure 3(c). Since Attack is a valid action for agents, we observe that

important facilities might be destroyed by the angry marines. The burning barrack in Figure 5(a) is such

an example. To discourage those harmful actions, a penalty of −1reward is added into the auxiliary

environments when an attack or destroy event happens.125

These sub-environments can serve as simple curriculums to teach the agent about the basic knowledge

of building a marine. There are two notable facts. First, the auxiliary environments barely give some

rough instructions to bootstrap the agent, and it is far from covering all useful knowledge. For instance,

the agent needs to learn how to manage minerals by itself and the sequence of building. Secondly,

knowledge is conveyed indirectly through the design of initial states and rewards. There is no explicit130

demonstration of how to achieve each sub-task. The agent needs to learn how to build supply depots,

how to build barracks and how to build marines through its interaction with environments.

During the training of PMES, every auxiliary environment uses its reward shaping function to award

any successful accomplishment of the corresponding sub-task. As a result, the agent perceives an adap-

tively changing reward function. We use the reward from building a barrack as an example to illustrate135

this adaptive reward shaping mechanism. Almost all barrack-building events occur in SubEnv-B, and the

overall reward for building a barrack is +1. At rst, the agent in the main environment cannot build

barracks spontaneously from the initial states. As the training proceeds, suppose that the agent is now

able to build barracks in the main environment, then the overall reward for building a barrack adaptively

9

Table 1: The setting of sub-environments for the task Build Marines. The middle 3 columns are the conguration of initial

states, and the last 4 columns show rewards for actions Build/Destroy/Attack. Here [a, b]denotes a number randomly

drawn in range ato b. ‘Center’ is the Control center in the game.

Initial State Build/Destroy/Attack

Mineral Depot Barrack Center Depot Barrack Marine

OrigEnv 50 0 0 — — — +1/0/0

SubEnv-A [50,200] 0 0 — +1/-1/0 — —

SubEnv-B [50,200] [5,20] 0 — — +1/-1/0 —

SubEnv-C [50,200] [5,20] [1,10] 0/0/-1 0/0/-1 0/0/-1 +1/-1/-1

Supply depots Barracks

Supply depots

(a) Initial state of SubEnv-A (b) Initial state of SubEnv-B (c) Initial state of SubEnv-C

Minerals

Figure 3: The initial states of auxiliary environments SubEnv-A, B and C. The initial state of SubEnv-A is similar to that

of the main environment. SubEnv-B starts with some randomly positioned supply depots, while SubEnv-C is in addition

initialized with some random barracks.

changes to a lower level. According to Equation (13), the overall reward for building barracks is smaller140

than +1 because there is no reward for building a barrack in the main environment. We conclude that

when the agent learns to accomplish this sub-task starting from the initial states of the main environ-

ment, the reward for this sub-task will decrease. On one hand, high rewards guide the agent to master

tasks quickly in the beginning; on the other hand, the adaptive reduction of rewards prevents the agent

from being trapped by the rewards from sub-tasks. This adaptive mechanism will be demonstrated in145

Section 4.1.

4. Experiment

In this section, we compare PMES algorithm with four commonly used baselines, including A2C [43,

44], Reward Shaping, Curriculum Learning, and PLAID [25] algorithms. All of them are implemented

based on the A2C algorithm with the objective function dened in Equation (8). The structure of the A2C150

model applied to the PMES algorithm is illustrated in Figure 1. The input of the network includes two

sources of information: spatial features and non-spatial features. The spatial features include information

from the mini-map and the game screen, while non-spatial features include additional information like

10

the remaining minerals or available actions. The network outputs both values and actions. The actions

are then passed to PySC2 [43], a Python library for the StarCraft II Learning Environment (SC2LE),155

which provides an interface for RL agents to interact with the game. We implement the algorithms using

Tensorow [45] and run them on GPUs. Due to the constraint of computation resource, the screen and

mini-map resolutions are set to 16 ×16 pixels, and the total paralleled environments are xed to 16 for

all algorithms.

The used environments are listed in Table 1, including the original environment OrigEnv, and 3auxil-160

iary environments SubEnv-A,SubEnv-B, and SubEnv-C. Their designs have been discussed in Section 3.2.

Below, we describe each algorithm.

A2C Algorithm. The vanilla A2C is implemented with 16 paralleled original environments. A detailed

grid search is performed to nd the best hyper-parameters, which are then adopted for all other compared

algorithms, including the PMES algorithm. Specically, we use the Adam optimizer with learning rate165

1e-5, and (β1, β2) = (0.9,0.999). In the loss Equation (8), the value loss coecient c1is 0.25 and the the

entropy coecient c2is 9e-5. The discount factor γis 0.99, and gradients are clipped at norm 1. We

also train the other six mini-games using the same hyper-parameters and observed that they can achieve

high scores. This ensures that the hyper-parameters do not overt to one mini-game.

Reward Shaping Algorithm. We alter the original environment by adding reward shaping functions, and170

denote the resulting environment as ShapingEnv. Specically, when a supply depot is built (destroyed),

the agent will be rewarded (deducted) mpoints; when a barrack is built (destroyed), the agent will

be rewarded (deducted) npoints; and when a marine is created (killed), the agent will be rewarded

(deducted) ppoints. Same as SubEnv-A, B, C, the agent will be deducted 1 point when the command

center, supply depots, and barracks are attacked. The best congurations of m, n, p is found by the grid175

searched in the experiment, that is, (m, n, p) = (1/9,1/3,1).

Curriculum Learning Algorithm. We use the three auxiliary environments to form curriculums before

training in the original environment. Firstly, the agent is trained under SubEnv-A to learn the ‘build

supply depot’ sub-task for 50K updates. Then, the model is trained under SubEnv-B to learn ‘build

barracks’ for 50K updates. Next, the model is trained under SubEnv-C to learn ‘build marines’ for 50K180

updates. In each stage, the model is initialized by the weights from the previous stage. Lastly, the agent

is trained under the original environment OrigEnv for 300K updates.

PLAID Algorithm. The PLAID algorithm improves over the curriculum learning algorithm by inserting

knowledge ‘distillation’ steps. Firstly, an agent is trained under SubEnv-A and learns the policy πA. Then,

an agent is trained under SubEnv-B with weights initialized from πA. The learned policy is denoted as πB.185

After that, a new agent is trained to distill the knowledge from experts πAand πBusing the DAGGER

11

algorithm [25, 46]. The distilled policy is called ˆπB. Next, an agent is initialized by the parameters of

ˆπBto learn the knowledge of SubEnv-C, and we get πC. Another distillation process is then performed

to distill knowledge from ˆπBand πC, and we get ˆπC. Lastly, we train the agent under the original

environment OrigEnv based on ˆπCfor 300K updates.190

PMES Algorithm. The proposed PMES algorithm is implemented under the 16 paralleled environments

containing the main and auxiliary environments, i.e.OrigEnv,SubEnv-A,SubEnv-B, and SubEnv-C. Since

the number of each environment inuences the performance of the algorithm, the best conguration of

the ratio of sub-environments is found by grid search.

Figure 4: The comparison of mean episodic rewards of A2C, Curriculum Learning, Reward Shaping, PLAID, and PMES.

For Curriculum Learning and PLAID, only their last stages are drawn. The Score is the mean episodic reward for building

marines in the original environment. Each training curve is the average of 5 runs, and the shadow represents the 90%

condence interval.

4.1. Result Analysis195

Figure 4 shows the training processes of the ve algorithms, and Table 2 summarizes the maximum

score and average score of the ve algorithms and the existing work at the end of the train and also

their scores when testing. Furthermore, the game screenshot for each algorithm is illustrated in Figure 5.

We also provide the recordings of the game replay for all algorithms in the supplementary video. In

12

Supply depots

Barrack

Marines

(a) A2C

Supply depots

Barracks

(b) Reward Shaping

Supply depot

Barrack

Marines

(c) Curriculum Learning

Supply depot

Barrack

Marines

(d) PLAID

Supply depots

Barracks Marines

(e) PMES

Figure 5: The replay scene of 5 algorithms. All snapshots are captured at the last 1 minute of an episode (total duration

is 15min). Important units are labeled in the images. These screenshots illustrate how many marines each algorithm can

build, and also intuitively reveal reasons if they fail to build marines.

13

Table 2: Comparison of the scores between PMES algorithm and other algorithms, which contains 4 baseline algorithms

and other existing work. The score represents the reward for building marines in the original environment. The max (mean)

score is the maximum (average) reward overall paralleled main environments and all run 5 times. The train results show

the performance near the end of training at 300K, and the test results are the statistics of running the agent in evaluation

mode for 1024 ×5episodes (1024 episodes for each run). The K, M, B means thousand, million, billion, respectively.

Settings Main Environment Sub-Environments Mean Reward (Train) Max Reward (Train) Mean Reward (Test) Max Reward (Test) Updates

A2C OrigEnv — 8 49 12.9 44 300K

Reward Shaping ShapingEnv — 20 120 29.5 126 300K

Curriculum Learning OrigEnv A, B, C 27 108 48.6 111 300K

PLAID OrigEnv A, B, C 57 122 51.7 123 300K

PMES OrigEnv A, B, C 85 129 91.6 134 300K

Grandmaster [43] OrigEnv — — — — 142 —

FullyConv LSTM [43] OrigEnv — — — — 62 600M

Relational agent [42] OrigEnv — — — 123 — 10B

the results, the PMES algorithm gets the highest score, and it is also very ecient for that it can reach200

high scores after 100K updates. Compared to PMES, the maximum score and sample eciency of other

4algorithms are worse. From the results in Table 2, the PMES algorithm can get 134 points when

testing, which is the best among the ve algorithms. The number is very close to the level (144 points) of

StarCraft Grandmaster [43]. The mean score of A2C,Reward Shaping and Curriculum Learning are very

low, which means that these algorithms do not learn how to build marines well. PLAID has a medium205

performance with a mean reward of 57. Although the mean score of PMES algorithm doesn’t exceed the

Relational agent [42], but it takes fewer update times to reach a comparable level (300 thousand is far

less than 10 billion). Therefore, we conclude that our algorithm is superior to others.

In the following, we analyze the training process of each algorithm qualitatively according to the

changing of scores. Recalling from Equations (1), (2) and (5), the value of state directly reects how210

many rewards it could gather in future. Here the agent will get a reward when a task or subtask is done,

so the value of the initial state is an indicator of whether the agent has mastered the task. Next, we will

illustrate the adaptive mechanism of PMES from the values changing of the three types of initial states.

The Training Process of the A2C Algorithm. Figure 6 shows the training process of A2C algorithm in

detail. From Figure 6(a), the score is unstable during training and uctuates around 12. Figure 6(b)215

shows the values change of initial states of SubEnv-A, B, C. The value of C curve gradually increases as

the training progresses. This indicates the agent receives the reward from building marines occasionally.

However, the value of A curve is very low, which means the agent cannot stably build marines from the

initial state of the original environment. This is veried intuitively by the game screenshot in Figure 5(a).

Compared with the screenshot of PMES in Figure 5(e), few supply depots are built in the A2C algorithm.220

Thus, the reason for its failure is that the agent even does not learn the rst step of building marines.

Another interesting phenomenon in Figure 5(a) is that barracks might be destroyed by the randomly

14

0

4

8

12

16

20

050 100 150 200 250 300

Score

Training Updates (K)

(a) The changing of scores of A2C

-1

0

1

2

3

4

050 100 150 200 250 300

Value

Training Updates (K)

A Value B Value C Value

A Value Trend B Value Trend C Value Trend

(b) The changing of values of A2C

Figure 6: The training process of the A2C algorithm. Note that only the original environments are involved. (a) The Score

is the reward of building marines. (b) The changing of the value of the initial states of SubEnv-A, B, C. The values of the

initial states (see Figure 3) of SubEnv-A, B, C are monitored. Each data point is the average value of 704 sampled initial

states in the same environment. To clearly illustrate the trend, we t the curves using polynomials with a sliding window,

which are denoted as A (B, C) Value Trends.

built marines without attack punishment.

-20

-10

0

10

20

30

40

50

0 50 100 150 200 250 300

Score

Training Updates (K)

All Rewards

Shaping Rewards

Main Rewards

(a) The changing of scores and rewards of Reward Shaping

-1

0

1

2

3

0 50 100 150 200 250 300

Value

Training Updates (K)

A Value B Value C Value

A Value Trend B Value Trend C Value Trend

(b) The changing of values of Reward Shaping

Figure 7: The training process of the Reward Shaping algorithm. (a) The rewards of the Reward Shaping algorithm has two

components. Main Rewards is the reward for building marines; Shaping Rewards is the auxiliary reward such as building

supply depots and barracks; All Rewards is the actual reward used in training, which equals to the sum of Main Reward

and Shaping Rewards. (b) The changing of values during the training process of Reward Shaping algorithm. The meanings

of A, B, C curves are the same as in Figure 6.

The Training Process of the Reward Shaping Algorithm. From the screenshot in Figure 5(b), we observe

that the agent devotes most resources into building supply depots, a situation which also reects in the225

changing process of rewards in Figure 7(a). There are 3 curves: the main rewards is the reward for building

marines which is inherited from the original environment; the shaping rewards is the sum of auxiliary

rewards for actions such as building/destroying/attacking supply depots and barracks (excluding the

15

action of building marines); and their sum, the all rewards, is the actual reward used in training. The

Reward Shaping algorithm stabilizes around 5during the training, which is very large considering that230

building one supply depot only gains +1/9rewards. As a result, the agent does not have enough resources

to build barracks and marines. In Figure 7(b), the values of A and B curve stay almost unchanged during

training. Without the adaptive shaping mechanism, the agent is allured by the easy, small immediate

reward of building supply depots, and fails to explore for the long-term, large reward of building marines.

Thus, the reason for the poor performance of the Reward Shaping algorithm is the short-sighted strategy235

and bad allocation of resources.

The Training Process of the Curriculum Learning Algorithm. The training process is highly unstable.

In 5 runs of the algorithm, the agent can build about 70 marines just for one time; in the other four

experiments, the agent can only build 5 to 10 marines. It implies that the Curriculum Learning algorithm

cannot stably instruct the agent to learn the steps of building marines in most circumstances. Figure 8(a)240

shows the training process of Curriculum Learning algorithm at each stage. The process can be divided

into 4 stages according to the training environments: SubEnv-A,SubEnv-B,SubEnv-C and OrigEnv.

The scores represent the rewards under the corresponding sub-environments. For example, the score

of the model under SubEnv-A is the sum of the rewards for building a supply depot and the penalties

for destroying it. From the gure, we can see the score of each stage can reach a relatively high level,245

which indicates that the sub-tasks are easy to learn. But the score of the OrigEnv is only around 5

points, which means that learnings in the early stages have little eect on building marines. Further, the

values of initial states of three sub-environments in Figure 8(b) diers greatly, which implies that the

agent under new environments will quickly forget what it has learned in the previous curriculum. In the

screenshot as shown in Figure 5(c), there is a very limited number of supply depots and barracks, which250

conrms the catastrophic forgetting problem: forgetting how to build supply depots and barracks.

The Training Process of the PLAID Algorithm. Dierent from Curriculum Learning, the PLAID algo-

rithm contains 6 stages, which are denoted by SubEnv-A,SubEnv-B,Distill-AB,SubEnv-C,Distill-BC,

and OrigEnv, respectively. The new stages are the two ‘distillation’ processes. In Distil l-AB, an agent

is trained to learn knowledge from two expert polices which are previously trained in stages SubEnv-A255

and SubEnv-B. After the process of Distill-AB, the agent masters the knowledge of both SubEnv-A and

SubEnv-B, as indicated by the reward curves shown in Figure 9(a). There are two reward curves in dis-

tillation stages, corresponding to the performance of the agent under the two related sub-environments.

At the end of Distill-BC, the agent remembers how to build barracks (indicated by the reward curve

of SubEnv-B) and how to build marines (indicated by the reward curve of SubEnv-C). The distillation260

process mitigates the catastrophic forgetting problem of transfer learning, and the agent is able to build

marines at the very beginning of stage OrigEnv. Also, in Figure 9(b), the C value is lower than the B

value, because the agent has not learned the knowledge of SubEnv-C during the process of Distill-AB. In

16

0

5

10

15

20

25

0 50 100 150 200 250 300 350 400

Score

Training Updates (K)

SubEnv-A

SubEnv-B

SubEnv-C

OrigEnv

SubEnv-A

SubEnv-B

SubEnv-C

OrigEnv

(a) The changing of rewards of Curriculum Learning

-1

0

1

2

3

4

0 50 100 150 200 250 300 350 400

Value

Training Updates (K)

A Value B C

A Value Trend B Value Trend C Value Trend

SubEnv-A

SubEnv-B

SubEnv-C

OrigEnv

(b) The changing of values of Curriculum Learning

Figure 8: The training process of the Curriculum Learning algorithm. (a) The rewards of Curriculum Learning. It contains

4 stages under dierent environments: SubEnv-A,SubEnv-B,SubEnv-C, and OrigEnv. The Score of each stage is the

reward under its corresponding environment. (b) The changing of values during the training process of Curriculum Leaning.

The meanings of A, B, C curves are the same as in Figure 6.

the Distill-BC stage, the C value has exceeded the B value, which shows that the agent has learned how

to build marines after being trained under SubEnv-C. In the end, the agent achieves a better score than265

the agent of the Curriculum Learning algorithm. So the distillation is eective to remember knowledge.

The screenshot in Figure 5(d) gives an explanation that the building of marines is restricted by the num-

ber of barracks. Thus the agent has not learned how to better allocate resources. Further performance

improvements for the PLAID algorithm might require more training updates. However, PLAID does not

allocate resources eciently like the PMES algorithm, which doesn’t have an adaptive reward mechanism.270

0

20

40

60

80

050 100 150 200 250 300 350 400 450 500 550

Score

Training Updates (K)

SubEnv-A

SubEnv-B

SubEnv-C

OrigEnv

SubEnv-A

SubEnv-B

Distill-AB

SubEnv-C

Distill-BC

OrigEnv

(a) The changing of rewards of PLAID

-1

0

1

2

3

4

5

6

0 50 100 150 200 250 300 350 400 450 500 550

Value

Training Updates (K)

A Value B Value C Value

A Value Trend B Value Trend C Value Trend

SubEnv-B

SubEnv-A

Distill-AB

SubEnv-C

Distill-BC

OrigEnv

(b) The changing of values of PLAID

Figure 9: The training process of the PLAID algorithm. (a) The rewards of the PLAID algorithm. It contains 6 stages

under dierent environments: SubEnv-A,SubEnv-B,Distil l-AB,SubEnv-C,Distil l-BC, and OrigEnv. The Score of each

stage is the reward under its corresponding environments. Note that distill stages involves two environments at the same

time. (b) The changing of values during the training of PLAID. The meanings of A, B, C curves are the same as in Figure 6.

17

The Training Process of the PMES Algorithm. At the beginning stage of the training process shown in

Figure 10, all the values of A, B, and C increase concurrently, which shows the eciency of parallelism in

learning the sub-tasks. Refer to Figure 10(a), after updating for 60K, the agent has learned how to build

marines in the main environment from scratch, thus the overall perceived rewards for building supply275

depots and barracks are reduced due to the adaptive shaping mechanism. As a result, the values of curves

A and B decrease after 60K. In contrast, the value of curve C is mainly aected by the rewards of building

marines, which is relatively stable during the training. Particularly, comparing the changing of values of

the PMES algorithm in Figure 10 with the changing of values of other algorithms in Figures 6(b), 7(b)

and 8(b), we nd that the adaptive reward shaping is unique in the PMES algorithm. From the screenshot280

shown in Figure 5(e), the proportion of supply depots and barracks is well balanced, and there are a

lot of marines being built. It implies that the agent has learned to build marines step by step, and can

allocate the resource reasonably based on the mechanism of adaptive reward shaping. Therefore, the

PMES algorithm has incomparable superiority in performance than other algorithms.

0

20

40

60

80

100

050 100 150 200 250 300

Score

Training Updates (K)

(a) The changing of rewards of PMES

-1

0

1

2

3

0 50 100 150 200 250 300

Value

Training Updates (K)

A Value B Value C Value

A Value Trend B Value Trend C Value Trend

(b) The changing of values of PMES

Figure 10: The change of values during the training of the PMES algorithm. (a) The rewards of the PMES algorithm. (b)

The changing of values during the training of the PMES algorithm. The meanings of A, B, C curves are the same as in

Figure 6.

4.2. Ablation Study285

In this section, we demonstrate the importance of domain knowledge in the design of auxiliary envi-

ronments, and study the design choices in the PMES algorithm.

Importance of Domain Knowledge. Recall that in the design of sub-environments, we not only consider

the build action of buildings and marines, but also take account of the destroy and attack actions. To

demonstrate the importance of this domain knowledge, we design two comparative experiments that290

gradually remove domain knowledge from the environments used in the PMES algorithm. The rst one

is called the PMES_RP model. It consists of 3 sub-environments SubEnv-A,SubEnv-B,SubEnv-C**

18

and the main environment OrigEnv, where SubEnv-C** removes the attack action from SubEnv-C sub-

environment. Another model, namely the PMES_R model, removes both the attack and destroy actions

from the 3 sub-environments and the main environment. The compositional environments are SubEnv-295

A*,SubEnv-B*,SubEnv-C* and OrigEnv. The settings of the new sub-environments are listed in Table 3.

Their results are shown in Figure 11. Note that the mean score of PMES_RP model is far less than

the score of the original PMES model after removing attack action. Further, the PMES_R model which

removes attack and destroy gets a lower score than PMES_RP, and also has a slower convergence speed.

Therefore, adding more domain knowledge into the model can greatly improve the performance of the300

algorithm.

Table 3: Conguration of extra auxiliary environments. Their initial states since they are the same as in Table 1. The 3

numbers in the cells correspond to rewards for actions Build/Destroy/Attack. ‘Center’ is the Control center in the game.

Setting Center Depot Barrack Marine

SubEnv-A* — +1/0/0 — —

SubEnv-B* — — +1/0/0 —

SubEnv-C* — — — +1/0/0

SubEnv-C** — — — +1/-1/0

Table 4: Comparison of using one sub-environment (A, B or C) in the PMES algorithm. Each Setting is trained to 300K

updates with 2 runs.

Conguration Rewards

OrigEnv SubEnv-A SubEnv-B SubEnv-C Mean Max

Setting A 10 6 0 0 5 10

Setting B 10 0 6 0 45 69

Setting C 10 0 0 6 32 65

The Sensitivity on the Number of Environments. In this work, the model is trained based on 16 paralleled

environments. We analyze the impact of the number of dierent environments on the performance of the

model, as shown in Figure 12. There are some dierences between the results of dierent settings. The

performance of the PMES algorithm is related to the number of dierent environments. Nonetheless,305

even in the imperfect conguration of environments, the performance of the PMES algorithm is better

than all other compared baseline algorithms (Note that experiments in Figure 12 are trained only for

75K updates).

19

Figure 11: Three settings encode gradually more knowledge into the sub-environment design. PMES_R model rewards the

building actions; PMES_RP model also punishes the destroying consequences; PMES model in addition discourages the

attack actions. Note that the more knowledge being encoded, the better the performance is obtained. 5 runs are performed

for each setting.

15

40

55

70

31

60

38

50

80

50

15

35

10

68

4

0

20

40

60

80

8-2-4-2 9-1-4-2 9-2-3-2 10-1-3-2 10-2-2-2

Score

Environment Configuration

Experiment-1

Experiment-2

Experiment-3

Figure 12: The eect of environment congurations. The four numbers of conguration represent the number of four

environments (OrigEnv,SubEnv-A,SubEnv-B and SubEnv-C). Each experiment is trained for 75K updates.

20

The Role of Auxiliary Environments. In order to show the importance of each auxiliary environment, we

single out SubEnv-A,SubEnv-B, and SubEnv-C respectively, and only use one of them as the auxiliary310

environment. The results are shown in the Table 4, where the setting of the hyper-parameters is same as

before. By comparing the results of dierent settings, we conclude that introducing part of the auxiliary

environments into the main environment can eectively guide the agent to learn. Among them, SubEnv-B

and SubEnv-C is more eective than SubEnv-A. This is reasonable since SubEnv-A would misguide the

agent to spend resources into building supply depots rather than building marines.315

5. Conclusion

In this paper, we propose a novel deep reinforcement learning algorithm to solve these problems of

complex multi-step task, called the PMES algorithm, which introduces auxiliary environments into the

main environment to eectively convey domain knowledge to the learning agent. In the PMES algorithm,

the agent interacts with dierent environments at the same time, leading to the speedup of the training320

process. More importantly, the PMES algorithm has an adaptive shaping mechanism, where the overall

reward function adjusts adaptively as the training progresses. Not only can the mechanism of adaptive

reward shaping instruct the agent nd the way to complete the task, but also it can prevent the agent

from being trapped into short-sighted, sub-optimal policies. Experimental results on the StarCraft II

mini-game demonstrate the eectiveness of the proposed PMES algorithm, which is much more eective325

and ecient than the traditional A2C, Reward Shaping, Curriculum Learning, and PLAID algorithms.

Acknowledgment

We would like to thank the members of The Killers team and their instructors Lisen Mu and Jingchu

Liu in DeeCamp 2018 for the insightful discussions. This work was supported by three grants from the

National Natural Science Foundation of China (No. 61976174, No. 61877049, and No. 11671317), and in330

part from the program of China Scholarships Council (No.201906280201).

Conict of Interest

The authors declare that there is no conict of interest regarding the publication of this article.

References

[1] S. Li, L. Ding, H. Gao, C. Chen, Z. Liu, Z. Deng, Adaptive neural network tracking control-based335

reinforcement learning for wheeled mobile robots with skidding and slipping, Neurocomputing 283

(2018) 20–30.

21

[2] F. Li, Q. Jiang, S. Zhang, M. Wei, R. Song, Robot skill acquisition in assembly process using deep

reinforcement learning, Neurocomputing 345 (2019) 92–102.

[3] M. Wulfmeier, D. Z. Wang, I. Posner, Watch this: Scalable cost-function learning for path planning340

in urban environments, in: 2016 IEEE/RSJ International Conference on Intelligent Robots and

Systems (IROS), IEEE, 2016, pp. 2089–2095.

[4] A. E. Sallab, M. Abdou, E. Perot, S. Yogamani, Deep reinforcement learning framework for au-

tonomous driving, Electronic Imaging 2017 (2017) 70–76.

[5] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,345

I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., Mastering the game of go with deep neural

networks and tree search, nature 529 (2016) 484.

[6] H. Jiang, H. Zhang, K. Zhang, X. Cui, Data-driven adaptive dynamic programming schemes for

non-zero-sum games of unknown discrete-time nonlinear systems, Neurocomputing 275 (2018) 649–

658.350

[7] H. Jiang, H. Zhang, X. Xie, J. Han, Neural-network-based learning algorithms for cooperative games

of discrete-time multi-player systems with control constraints via adaptive dynamic programming,

Neurocomputing 344 (2019) 13–19.

[8] Z. Wang, L. Liu, H. Zhang, G. Xiao, Fault-tolerant controller design for a class of nonlinear mimo

discrete-time systems via online reinforcement learning algorithm, IEEE Transactions on Systems,355

Man, and Cybernetics: Systems 46 (2015) 611–622.

[9] A. Y. Ng, D. Harada, S. Russell, Policy invariance under reward transformations: Theory and

application to reward shaping, in: ICML, volume 99, 1999, pp. 278–287.

[10] G. Konidaris, A. Barto, Autonomous shaping: Knowledge transfer in reinforcement learning, in:

Proceedings of the 23rd international conference on Machine learning, ACM, 2006, pp. 489–496.360

[11] E. Wiewiora, G. W. Cottrell, C. Elkan, Principled methods for advising reinforcement learning

agents, in: Proceedings of the 20th International Conference on Machine Learning (ICML-03), 2003,

pp. 792–799.

[12] A. Laud, G. DeJong, The inuence of reward on the speed of reinforcement learning: An analysis

of shaping, in: Proceedings of the 20th International Conference on Machine Learning (ICML-03),365

2003, pp. 440–447.

[13] M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. van de Wiele, V. Mnih, N. Heess,

J. T. Springenberg, Learning by playing solving sparse reward tasks from scratch, in: Proceedings

of the 35th International Conference on Machine Learning, volume 80, 2018, pp. 4344–4353.

22

[14] B. Marthi, Automatic shaping and decomposition of reward functions, in: Proceedings of the 24th370

International Conference on Machine learning, ACM, 2007, pp. 601–608.

[15] S. M. Devlin, D. Kudenko, Dynamic potential-based reward shaping, in: Proceedings of the 11th

International Conference on Autonomous Agents and Multiagent Systems, IFAAMAS, 2012, pp.

433–440.

[16] J. A. Arjona-Medina, M. Gillhofer, M. Widrich, T. Unterthiner, J. Brandstetter, S. Hochreiter,375

Rudder: Return decomposition for delayed rewards, in: Advances in Neural Information Processing

Systems, 2019, pp. 13544–13555.

[17] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, P. Abbeel, Overcoming exploration in rein-

forcement learning with demonstrations, in: 2018 IEEE International Conference on Robotics and

Automation (ICRA), IEEE, 2018, pp. 6292–6299.380

[18] S. Narvekar, J. Sinapov, M. Leonetti, P. Stone, Source task creation for curriculum learning, in:

Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems,

International Foundation for Autonomous Agents and Multiagent Systems, 2016, pp. 566–574.

[19] C. Florensa, D. Held, M. Wulfmeier, M. Zhang, P. Abbeel, Reverse curriculum generation for

reinforcement learning, in: Conference on Robot Learning, 2017, pp. 482–495.385

[20] M. Svetlik, M. Leonetti, J. Sinapov, R. Shah, N. Walker, P. Stone, Automatic curriculum graph

generation for reinforcement learning agents., in: The Conference on Articial Intelligence, AAAI,

2017, 2017, pp. 2590–2596.

[21] W. M. Czarnecki, S. M. Jayakumar, M. Jaderberg, L. Hasenclever, Y. W. Teh, N. Heess, S. Osindero,

R. Pascanu, Mix & match agent curricula for reinforcement learning, in: International Conference390

on Machine Learning, 2018, pp. 1095–1103.

[22] Z. Ren, D. Dong, H. Li, C. Chen, Self-paced prioritized curriculum learning with coverage penalty in

deep reinforcement learning, IEEE transactions on neural networks and learning systems 29 (2018)

2216–2226.

[23] N. Jiang, S. Jin, C. Zhang, Hierarchical automatic curriculum learning: Converting a sparse reward395

navigation task into dense reward, Neurocomputing (2019).

[24] E. Parisotto, J. L. Ba, R. Salakhutdinov, Actor-mimic: Deep multitask and transfer reinforcement

learning, in: 4th International Conference on Learning Representations, 2016.

[25] G. Berseth, C. Xie, P. Cernek, M. Van de Panne, Progressive reinforcement learning with distillation

for multi-skilled motion control, in: 6th International Conference on Learning Representations, 2018.400

23

[26] F. Tanaka, M. Yamamura, Multitask reinforcement learning on the distribution of mdps, in: Com-

putational Intelligence in Robotics and Automation, 2003. Proceedings. 2003 IEEE International

Symposium on, volume 3, IEEE, 2003, pp. 1108–1113.

[27] A. Wilson, A. Fern, S. Ray, P. Tadepalli, Multi-task reinforcement learning: a hierarchical bayesian

approach, in: Proceedings of the 24th international conference on Machine learning, ACM, 2007, pp.405

1015–1022.

[28] M. Snel, S. Whiteson, Multi-task evolutionary shaping without pre-specied representations, in:

Proceedings of the 12th annual conference on Genetic and evolutionary computation, ACM, 2010,

pp. 1031–1038.

[29] D. Calandriello, A. Lazaric, M. Restelli, Sparse multi-task reinforcement learning, in: Advances in410

Neural Information Processing Systems, 2014, pp. 819–827.

[30] Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, R. Pascanu,

Distral: Robust multitask reinforcement learning, in: Advances in Neural Information Processing

Systems, 2017, pp. 4496–4506.

[31] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, K. Kavukcuoglu, Rein-415

forcement learning with unsupervised auxiliary tasks, in: 5th International Conference on Learning

Representations, 2017.

[32] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino, M. Denil, R. Goroshin,

L. Sifre, K. Kavukcuoglu, et al., Learning to navigate in complex environments, in: 5th International

Conference on Learning Representations, 2017.420

[33] S. Cabi, S. G. Colmenarejo, M. W. Homan, M. Denil, Z. Wang, N. Freitas, The intentional

unintentional agent: Learning to solve many continuous control tasks simultaneously, in: Conference

on Robot Learning, 2017, pp. 207–216.

[34] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu,

Asynchronous methods for deep reinforcement learning, in: International conference on machine425

learning, 2016, pp. 1928–1937.

[35] A. Gruslys, W. Dabney, M. G. Azar, B. Piot, M. G. Bellemare, R. Munos, The reactor: A fast

and sample-ecient actor-critic agent for reinforcement learning, in: International Conference on

Learning Representations, ICLR 2018, 2018.

[36] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley,430

I. Dunning, et al., Impala: Scalable distributed deep-rl with importance weighted actor-learner

architectures, in: International Conference on Machine Learning, 2018, pp. 1406–1415.

24

[37] D. Lee, H. Tang, J. O. Zhang, H. Xu, T. Darrell, P. Abbeel, Modular architecture for starcraft

ii with deep reinforcement learning, in: Fourteenth Articial Intelligence and Interactive Digital

Entertainment Conference, 2018.435

[38] Z. Pang, R. Liu, Z. Meng, Y. Zhang, Y. Yu, T. Lu, On reinforcement learning for full-length game

of starcraft, in: The Conference on Articial Intelligence, AAAI, 2019, 2019, pp. 4691–4698.

[39] T. Rashid, M. Samvelyan, C. S. Witt, G. Farquhar, J. Foerster, S. Whiteson, Qmix: Monotonic value

function factorisation for deep multi-agent reinforcement learning, in: International Conference on

Machine Learning, 2018, pp. 4292–4301.440

[40] V. Zambaldi, D. Raposo, A. Santoro, V. Bapst, Y. Li, I. Babuschkin, K. Tuyls, D. Reichert, T. Lil-

licrap, E. Lockhart, et al., Deep reinforcement learning with relational inductive biases, in: 7th

International Conference on Learning Representations, 2019.

[41] J. Gehring, D. Ju, V. Mella, D. Gant, N. Usunier, G. Synnaeve, High-level strategy selection under

partial observability in starcraft: Brood war, in: NeurIPS Workshop on Reinforcement Learning445

under Partial Observability, 2018.

[42] V. F. Zambaldi, D. Raposo, A. Santoro, V. Bapst, Y. Li, I. Babuschkin, K. Tuyls, D. P. Reichert,

T. P. Lillicrap, E. Lockhart, M. Shanahan, V. Langston, R. Pascanu, M. Botvinick, O. Vinyals, P. W.

Battaglia, Deep reinforcement learning with relational inductive biases, in: International Conference

on Learning Representations, ICLR 2019, 2019.450

[43] O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Küttler,

J. Agapiou, J. Schrittwieser, et al., Starcraft ii: A new challenge for reinforcement learning, arXiv

preprint arXiv:1708.04782 (2017).

[44] R. Ring, T. Matiisen, Replicating deepmind starcraft ii reinforcement learning benchmark with

actor-critic methods (2018).455

[45] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,

M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,

M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,

B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals,

P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: Large-scale machine learn-460

ing on heterogeneous systems, 2015. URL: https://www.tensorflow.org/, software available from

tensorow.org.

[46] S. Ross, G. J. Gordon, J. A. Bagnell, No-regret reductions for imitation learning and structured

prediction, in: In AISTATS, Citeseer, 2011.

25