ArticlePDF Available

Parallel Multi-Environment Shaping Algorithm for Complex Multi-step Task


Abstract and Figures

Because of the sparse reward and the sequence of the complex multi-step task, there is a big challenge in reinforcement learning, that is, the agent needs to implement several consecutive sequential steps to complete the whole task without the intermediate reward. Reward Shaping and Curriculum Learning algorithms are always used to solve this challenge, but Reward Shaping is prone to sub-optimal policies and Curriculum Learning easily suffers from the catastrophic forgetting problem. In this paper, we propose a novel algorithm called Parallel Multi-Environment Shaping (PMES) algorithm, where several sub-environments are built based on human knowledge to make the agent aware of the importance of intermediate steps, each of which corresponds to each key intermediate step. Specifically, the learning agent is trained under these parallel multiple environments including the original environment and several sub-environments by synchronous advantage actor-critic algorithm. And the PMES algorithm has the mechanism of adaptive reward shaping to adjust the reward function. In this way, PMES algorithm effectively incorporates human experience by multiple different environments rather than only shaping the reward function, it combines the benefits of Reward Shaping and Curriculum Learning algorithms while avoiding their drawbacks. Extensive experiments on the mini-game ‘Build Marines’ of StarCraft II environment show that our proposed algorithm is more effective than Reward Shaping, Curriculum Learning, and PLAID algorithms, which is almost close to the level of human Grandmaster. And compared with the existing work, it takes less time and computing resources to reach a good result.
Content may be subject to copyright.
Paralleled Multi-Environments Shaping Algorithm for Complex
Multi-step Task
Cong Maa,d,, Zhizhong Lib,, Dahua Linc, Jiangshe Zhanga,∗∗
aSchool of Mathematics and Statistics, Xi’an Jiaotong University, No.28, Xianning West Road, Xi’an, Shannxi,
P.R. China
bSenseTime, No 12, Science Park East Avenue, HKSTP, Shatin, Hong Kong, P.R. China
cDepartment of Information Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong, P.R. China
dDepartment of Mathematics, KU Leuven, Celestijnenlaan 200B, Leuven 3001, Belgium
Because of the sparse reward and the sequence of task, the complex multi-step task is a big challenge
in reinforcement learning, that is, the agent needs to implement several consecutive sequential steps to
complete the whole task without the intermediate reward. Reward shaping and curriculum learning are
always used to solve this challenge, but reward shaping is prone to sub-optimal policies and curriculum
learning easily suers from the catastrophic forgetting problem. In this paper, we propose a novel algo-
rithm called Paralleled Multi-Environment Shaping algorithm (PMES), where several sub-environments
are built based on human knowledge to make the agent aware of the importance of intermediate steps,
each of which corresponds to each key intermediate steps. Specically, the learning agent is trained under
these paralleled multiple environments including the original environment and several sub-environments
by synchronous advantage actor-critic algorithm. And the PMES algorithm has the mechanism of adap-
tive reward shaping to adjust the reward function. In this way, PMES algorithm eectively incorporates
human experience by multiple dierent environments rather than only shaping the reward function, it
combines the benets of reward shaping and curriculum learning while avoiding their drawbacks. Exten-
sive experiments on the mini-game ‘Build Marines’ of StarCraft II environment show that our proposed
algorithm is more eective than reward shaping, curriculum learning, and PLAID algorithms, which is
almost close to the level of human Grandmaster. And compared with the best existing algorithm, it takes
less time and computing resource to reach a good result.
Keywords: Reinforcement Learning, Multi-step task, Paralleled Multiple Environments, Adaptive
Reward Shaping
The two authors contributed equally
∗∗Corresponding author
Email address: (Jiangshe Zhang)
Preprint submitted to Journal of Neurocomputing December 11, 2019
1. Introduction
In recent years, reinforcement learning (RL) algorithm, especially deep reinforcement learning (DRL)
algorithm, which combines RL and deep learning algorithms, receives widespread attention in many
elds, such as, robot control [1, 2], autonomous driving [3, 4], playing games [5, 6], even optimal control
problems [7, 8] and so on. In RL algorithm, the agent learns and improves its performance by trying5
constantly, then nds the optimal path by maximizing the return function. But due to the huge state
space, action space and sparse reward function, instructing the agent to accomplish the complex multi-
step task like the human has been the biggest challenge for a long time.
In the multi-step task, it is dicult to make the agent aware of the importance of the intermediate
steps in the absence of the corresponding intermediate rewards, while these tasks are very easy for10
the human being due to the prior knowledge or experience. Thus, how to transfer knowledge into the
algorithm eciently is very important. Aiming at this problem, various approaches are proposed. Reward
Shaping algorithm [9–14] manipulates the reward signals from the environment to instruct the agent to
learn faster. And Ng Andrew [9] proved that the potential based shaping can preserve the optimal policy.
Devlin [15] further generalized the potential based shaping to be time-dependent, called dynamic shaping.15
RUDDER [16] was proposed to solve the problem of delayed reward by decomposing the original reward
into a new MDP that is return-equivalent. But shaping the reward function properly is very dicult,
which is prone to be stuck at the sub-optimal [17]. Also, Curriculum Learning algorithm [18–23] can be
used to solve the multi-step task. It divides the whole task into several increasingly dicult sub-tasks
in series and trains the agent in a meaningful order because solving simple tasks rst would benet the20
agent to solve more complex tasks later. As a result, the training complexity of the original task is
simplied by the knowledge from the previous auxiliary tasks. Narvekar [18] proposed that designing
a sequence of source tasks for the agent to improve the performance and the learning speed of the RL
algorithm. Svetlik [20] further presented a method to generate the curriculum automatically based on
task descriptors for the RL algorithm. However, the development is still in its infancy and the application25
of these methods is quite restricted. For example, the reverse curriculum learning [19] requires that the
environments can be reversely evolved back to ‘earlier’ states, which is not feasible for some environment.
And curriculum learning could suer from the catastrophic forgetting problem, i.e., knowledge learned
previously might be erased during the learning process of new knowledge. This problem can be solved by
the distillation algorithm, which creates one single controller to implement the tasks of a set of experts.30
For instance, Parisotto [24] proposed adopting a supervised regression function to learns all expert policies,
called distillation. Further, Berseth [25] proposed PLAID algorithm which combines transfer learning
and policy distillation to learn multiple skills. But it is very dicult to nd the best policy based on
a mixture of experts. Thus, studying a novel RL algorithm for complex multi-step tasks is particularly
In RL algorithm, there is a lot of research about multi-task learning [26–33], whose goal is to accom-
plish all tasks together rather than focusing on one of them. But multi-step task learning is very dierent
from the multi-task learning, which puts more emphasis on the sequential relationship between multiple
tasks, that is, these sub-tasks must be implemented sequentially. And these rewards is very sparse where
the agent get the reward only after completing the whole task. Thus, we divide the whole task into40
several sub-tasks according to the steps and train them simultaneously. In the previous works, there
are some paralleled RL algorithms [34–36], where the agent interacts with multiple single environments
simultaneously to speed up training, such as Asynchronous Advantage Actor-Critic (A3C) algorithm [34]
and Advantage Actor-Critic (A2C) algorithm. Inspired by them, we can train the agent under multiple
dierent environments to learn dierent tasks simultaneously when their state spaces and action spaces45
are exactly same.
In this paper, we propose a novel algorithm, named Paralleled Multi-Environment Shaping (PMES)
algorithm to solve these complex multi-step tasks. Considering the lack of rewards for intermediate steps,
we introduce the corresponding rewards by building several sub-environments, each of which corresponds
to intermediate key steps. Meanwhile, combining the paralleled RL algorithms, we train the agent50
under multiple paralleled dierent environments, including the original environment and several sub-
environments. In this way, the agent can assign these rewards automatically and achieve the nal target.
In recent years, there are some classical complex environments for testing RL algorithms, especially
the SC2LE environments [37–42]. It is an environment for the real-time strategy (RTS) game StarCraft II,
released by Deepmind and Blizzard [43], which contains seven mini-games aiming at dierent elements of55
playing StarCraft II. One of the mini-games, Build Marines, is much more dicult than others because
it involves multiple dependent decisions based on previous works. Thus, in this paper, we use this mini-
game as the platform to evaluate the performance of the PMES algorithm. These experimental results
on the StarCraft II demonstrate that the proposed algorithm yields good performance and gets a higher
score than other algorithms as a faster convergence.60
The rest of the paper is organized as follows. Section 2 introduces preliminary concepts and also sets
up notations. Section 3 shows the PMES algorithm. Section 4 compares PMES with other algorithms
qualitatively and quantitatively. Section 5 concludes the paper.
2. Preliminaries
A reinforcement learning task is often modeled as a Markov Decision Process (MDP) M= (S, S0,A,T, γ, R),
where Sis the state space, S0is the distribution of initial states, Ais the action space, T={Psa (·)|s∈ S, a ∈ A}
is the transition probabilities, γ(0,1] is the discount factor, and R(s, a, s)is a stochastic reward from
state sto state sby taking action a. During the training, the agent interacts with the environment
by choosing an action atand receives the reward Rt, and then transits to the next state st+1 at any
time t. The actions are sampled according to its current policy π:S → P (A), where P(A)is the set of
probability measures on A. The discounted return is dened as
and the goal is to maximize the expected return from the start distribution J=Es0S0[G0|s0]. The
action-value function is the expected return of taking action atin state st,
Qπ(st, at) := E[Gt|st, at],(2)
and the optimal policy satises the optimal Bellman equation,
Q(s, a) = EsPsa R(s, a, s) + γmax
a∈A Q(s, a).(3)
In policy-based model-free methods, policy πθis parameterized by θand is optimized by gradient ascent,
where the gradient can be estimated by
θJ(θ) = Eτπθ(τ)T
θlog πθ(at|st)Gt.(4)
Here τ={s0, a0, s1, a1, . . . , sT, aT}is a trajectory sampled following the policy. The high variance is
a problem of the above gradient estimation, which can be reduced by adopting the advantage function
A(st, at) = Q(st, at)V(st)to replace Gtin Equation (4), where
V(st) = Eat[Q(st, at)] (5)
is the value function of state st. An estimate of the advantage function is given by
where Vωis the value network with parameter ω, and t+kis bounded by T. So
θJ(θ) = Eτπθ(τ)T
θlog πθ(at|st)A(st, at).(7)
In the A2C algorithm, the value network shares the parameters with the policy network. Besides, the
entropy H(πθ(st)) of the policy πis added into the objective function to encourage exploration. The
objective function of the A2C algorithm for some sampled trajectories τis
Eτπ(τ)log πθ(at|st)A(st, at) + c1RtVθ(st)2
where c1, c2are the coecients controlling the loss of value and the entropy, respectively, and Rtis the65
expected cumulative discounted reward at the timepoint t.
conv1 conv2
conv1 conv2
Spatial policy
Non-spatial policy
main environments
Figure 1: Architecture of PMES for StarCraft II. The agent interacts with multiple environments, including main environ-
ments and auxiliary environments. The overall structure is based on A2C. The input of the network contains spatial and
non-spatial features. Spatial features include screen-maps and mini-maps which abstract away from the RGB images of the
game when playing. Non-spatial features contain information such as the number of minerals collected. The output actions
consist of game commands and their spatial coordinates.
3. Paralleled Multi-Environment Shaping Algorithm
The core of Paralleled Multi-Environment Shaping (PMES) algorithm is building up auxiliary envi-
ronments to assist the agent to learn in the original main environment. For simplicity, we assume the
main task and the auxiliary tasks only dier in the initial states and the reward functions. Specically,
we denote the original task in the main environment as an MDP
M(0) =S, S(0)
0,A, T, γ, R(0) ,(9)
and the Nkinds of auxiliary tasks as
M(n)=S, S(n)
0,A, T, γ, R(n), n = 1, . . . , N , (10)
where Sis the state space, and S(0)
0are initial states in the main environment and the n-th auxiliary
environment respectively. Because we use the synchronous, batched deep RL algorithm A2C to train,
only one agent is interacting with all environments in parallel. From the perspective of the learning agent,70
the overall mixed environments not only can start at dierent initial states, but also be provided with
dierent reward signals. Strictly speaking, the process is a partially observed MDP (POMDP) because
it is unknown which exact environment the agent is currently interacting with. However, the purpose of
POMDP is to nd the optimal actions for each possible belief over the world states, which deviates from
our ultimate goal of optimizing the main task.75
There is the overall architecture of PMES, as shown in Figure 1. It follows the settings of the
A2C agent in [43, 44]. In the implementation, both the policy network and the value network share
parameters. One important aspect of PMES architecture is the interaction of one single learning agent
with multiple, paralleled environments simultaneously. We exploit this feature to enable the adaptive
shaping mechanism by sub-environments.80
3.1. A Special Case
Suppose the count of each type of environment is d(n), n = 0, . . . , N , where index 0is reserved for
the main environment. Then the total number of environments is nd(n), and the ratio of the agent
interacting with each type of environment is d(n)/nd(n). Further, assume that the current policy of
the agent is π, and denote the probability of visiting state sin environment M(n)as p(n)
π(s). Then, the
state sis accessed from dierent environments with probability ρ(n)
π, where
π(s), n = 0, . . . , N. (11)
Therefore, the triple tuple (s, a, s)of the source environments distribution can be obtained from
π(s, a, s)ρ(n)
Since the policy πand the transition probability Psa are the same in all environments, we have ρ(n)
π(s, a, s) =
π(s). Thus the overall reward function Rπ(depends on policy π) is
Rπ(s, a, s) =
π(s, a, s)R(n)(s, a, s)
π(s)R(n)(s, a, s).
In the special case of ergodic MDPs and potential-based shaping rewards [9], the optimal policy is
Proposition 1. Suppose the state space Sand action space Aof the main MDP M(0) in Equation (9)
are nite, and the auxiliary MDPs are constructed as in Equation (10). If the M(0) is ergodic, and the
auxiliary rewards are potential-based shaping rewards
R(n)(s, a, s) = R(0)(s, a, s) + γϕ(n)(s)ϕ(n)(s),(14)
where n= 1, . . . , N and ϕ(n)is the potential function for the n-th environment. Then the optimal policy
πof the PMES algorithm is also the optimal policy of the original main environment M(0),i.e., PMES85
has policy invariance.
Proof. Our proof follows the practice of reward shaping literature [9, 11], with the necessary modications.
For notational simplicity, we set ϕ(0) 0. For the optimal policy π, the corresponding Q-function Q
would satisfy the optimal Bellman equation
Q(s, a)
=EsPsa Rπ(s, a, s) + γmax
a∈A Q(s, a)
π(s)R(n)(s, a, s) + γmax
a∈A Q(s, a)
=ER(0)(s, a, s) +
a∈A Q(s, a).
One property of ergodic MDPs is that they have unique stationary state distributions. Since the
auxiliary MDPs only dier in initial states and reward functions, their induced Markov Random Processes
are the same, and thus have the same stationary distribution as the main environment, i.e.,
π(s) = p(0)
π(s), n = 1, . . . , N, s∈ S.(16)
And together with Equation (11), we have
π(s) = ρ(n)
π(s), n = 0, . . . , N, s, s∈ S.(17)
Using the above equation, we can rearrange Equation (15),
Q(s, a) +
=EsPsa R(0)(s, a, s) + γmax
a∈A Q(s, a)
Q(s, a) := Q(s, a) +
then we have the optimal Bellman equation for the original MDP M(0),
Q(s, a) = EsPsa R(0)(s, a, s) + γmax
Q(s, a).(20)
The optimal policy πsatises
Q(s, a)
= argmax
a∈A Q(s, a) +
= argmax
Q(s, a).
So the optimal policy for the paralleled, mixed environments is also optimal policy for the original main
environment. The opposite direction, that every optimal policy of the main environment is also the
optimal of the mixed environment can also be argued by reversing the above discussion.
From the proposition, the optimal solution is not related to the distribution of initial states or the90
shaping rewards under the specied conditions. However, these two factors do have an impact on learning
eciency. At the starting phase of the training process, the agent is able to explore multiple critical parts
of the MDP simultaneously for the diverse arrangement of the initial states. Thus, the auxiliary environ-
ments provide a curriculum, where each one starts and focuses on a dierent phase of the original problem
at the early training stage. Dierent from the traditional shaping methods, PMES algorithm provides95
an autonomous, adaptive way of shaping. The perceived reward function is shown in Equation (13). In
the early exploration training stage, unlike the case for the equilibrium state in Equation (17), the ratios
π(s)for dierent state shave not converge and might be dierent. For example, for a state snear the
initial state of the rst auxiliary environment M(1), the ratio ρ(1)
π(s)might near 1and the rst auxiliary
environment contributes to most visits of state s. Then the the agent’s current perceived reward for100
the tuple (s, a, s)for action aand the consecutive state sis inuenced mostly by the reward function
R(1)(s, a, s)in auxiliary environment M(1). As the training progress develops, the main environments
reach state smore frequently. The perceived reward for (s, a, s)will be inuenced by the original reward
structure from the main environment M(0) more than the auxiliary environment M(1).
3.2. Application in the mini-game ‘Build Marines’ of StarCraft II105
There is a screenshot of StarCraft II shown in Figure 2. The diculties of the mini-game Build Marines
lie in the huge exploration space, the long chain of action dependency, and the sparse and delayed reward.
All of them make the agent extremely hard to build even the rst marine. This work utilizes the domain
knowledge that the building process can be divided into three simpler sub-tasks: building supply depots,
building barracks, building marines. Then, we encodes them into three smaller, tailored environments,110
each of which corresponds to every key intermediate step. every new auxiliary environment starts at
dierent initial states and has dierent rewards. Their state spaces Sand transition probabilities T
between states are kept the same as the original MDP. In order to satisfy ergodicity, we introduce the
destroying action as the invertible way of building action in these auxiliary environments. The detailed
congurations are listed in Table 1.115
Figure 3(a) illustrates the initial state of SubEnv-A, where only some minerals are provided. This
auxiliary environment rewards the agent +1 if one supply depot is built. SubEnv-B instructs the agent to
build some barracks when there are some supply depots. Note that its initial state contains some pre-built
supply depots, as shown in Figure 3(b). When barracks are present and minerals are adequate, the agent
can build marines. Under SubEnv-C, the agent will be given +1 reward when one marine is successfully120
Command Center
Supply Depot
Figure 2: A screenshot of mini-game Build Marines. Some common unit types are marked in the gure. The way the
agent plays the game resembles a human player: through issuing commands together with mouse clicks and drags on the
screen. The game can be divided into three consecutive sub-tasks: building supply depots, building barracks and building
built. To make the agent focus on this sub-task, the initial state of SubEnv-C is equipped with barracks
and supply depots as shown in Figure 3(c). Since Attack is a valid action for agents, we observe that
important facilities might be destroyed by the angry marines. The burning barrack in Figure 5(a) is such
an example. To discourage those harmful actions, a penalty of 1reward is added into the auxiliary
environments when an attack or destroy event happens.125
These sub-environments can serve as simple curriculums to teach the agent about the basic knowledge
of building a marine. There are two notable facts. First, the auxiliary environments barely give some
rough instructions to bootstrap the agent, and it is far from covering all useful knowledge. For instance,
the agent needs to learn how to manage minerals by itself and the sequence of building. Secondly,
knowledge is conveyed indirectly through the design of initial states and rewards. There is no explicit130
demonstration of how to achieve each sub-task. The agent needs to learn how to build supply depots,
how to build barracks and how to build marines through its interaction with environments.
During the training of PMES, every auxiliary environment uses its reward shaping function to award
any successful accomplishment of the corresponding sub-task. As a result, the agent perceives an adap-
tively changing reward function. We use the reward from building a barrack as an example to illustrate135
this adaptive reward shaping mechanism. Almost all barrack-building events occur in SubEnv-B, and the
overall reward for building a barrack is +1. At rst, the agent in the main environment cannot build
barracks spontaneously from the initial states. As the training proceeds, suppose that the agent is now
able to build barracks in the main environment, then the overall reward for building a barrack adaptively
Table 1: The setting of sub-environments for the task Build Marines. The middle 3 columns are the conguration of initial
states, and the last 4 columns show rewards for actions Build/Destroy/Attack. Here [a, b]denotes a number randomly
drawn in range ato b. ‘Center’ is the Control center in the game.
Initial State Build/Destroy/Attack
Mineral Depot Barrack Center Depot Barrack Marine
OrigEnv 50 0 0 — +1/0/0
SubEnv-A [50,200] 0 0 — +1/-1/0
SubEnv-B [50,200] [5,20] 0 — +1/-1/0
SubEnv-C [50,200] [5,20] [1,10] 0/0/-1 0/0/-1 0/0/-1 +1/-1/-1
Supply depots Barracks
Supply depots
(a) Initial state of SubEnv-A (b) Initial state of SubEnv-B (c) Initial state of SubEnv-C
Figure 3: The initial states of auxiliary environments SubEnv-A, B and C. The initial state of SubEnv-A is similar to that
of the main environment. SubEnv-B starts with some randomly positioned supply depots, while SubEnv-C is in addition
initialized with some random barracks.
changes to a lower level. According to Equation (13), the overall reward for building barracks is smaller140
than +1 because there is no reward for building a barrack in the main environment. We conclude that
when the agent learns to accomplish this sub-task starting from the initial states of the main environ-
ment, the reward for this sub-task will decrease. On one hand, high rewards guide the agent to master
tasks quickly in the beginning; on the other hand, the adaptive reduction of rewards prevents the agent
from being trapped by the rewards from sub-tasks. This adaptive mechanism will be demonstrated in145
Section 4.1.
4. Experiment
In this section, we compare PMES algorithm with four commonly used baselines, including A2C [43,
44], Reward Shaping, Curriculum Learning, and PLAID [25] algorithms. All of them are implemented
based on the A2C algorithm with the objective function dened in Equation (8). The structure of the A2C150
model applied to the PMES algorithm is illustrated in Figure 1. The input of the network includes two
sources of information: spatial features and non-spatial features. The spatial features include information
from the mini-map and the game screen, while non-spatial features include additional information like
the remaining minerals or available actions. The network outputs both values and actions. The actions
are then passed to PySC2 [43], a Python library for the StarCraft II Learning Environment (SC2LE),155
which provides an interface for RL agents to interact with the game. We implement the algorithms using
Tensorow [45] and run them on GPUs. Due to the constraint of computation resource, the screen and
mini-map resolutions are set to 16 ×16 pixels, and the total paralleled environments are xed to 16 for
all algorithms.
The used environments are listed in Table 1, including the original environment OrigEnv, and 3auxil-160
iary environments SubEnv-A,SubEnv-B, and SubEnv-C. Their designs have been discussed in Section 3.2.
Below, we describe each algorithm.
A2C Algorithm. The vanilla A2C is implemented with 16 paralleled original environments. A detailed
grid search is performed to nd the best hyper-parameters, which are then adopted for all other compared
algorithms, including the PMES algorithm. Specically, we use the Adam optimizer with learning rate165
1e-5, and (β1, β2) = (0.9,0.999). In the loss Equation (8), the value loss coecient c1is 0.25 and the the
entropy coecient c2is 9e-5. The discount factor γis 0.99, and gradients are clipped at norm 1. We
also train the other six mini-games using the same hyper-parameters and observed that they can achieve
high scores. This ensures that the hyper-parameters do not overt to one mini-game.
Reward Shaping Algorithm. We alter the original environment by adding reward shaping functions, and170
denote the resulting environment as ShapingEnv. Specically, when a supply depot is built (destroyed),
the agent will be rewarded (deducted) mpoints; when a barrack is built (destroyed), the agent will
be rewarded (deducted) npoints; and when a marine is created (killed), the agent will be rewarded
(deducted) ppoints. Same as SubEnv-A, B, C, the agent will be deducted 1 point when the command
center, supply depots, and barracks are attacked. The best congurations of m, n, p is found by the grid175
searched in the experiment, that is, (m, n, p) = (1/9,1/3,1).
Curriculum Learning Algorithm. We use the three auxiliary environments to form curriculums before
training in the original environment. Firstly, the agent is trained under SubEnv-A to learn the ‘build
supply depot’ sub-task for 50K updates. Then, the model is trained under SubEnv-B to learn ‘build
barracks’ for 50K updates. Next, the model is trained under SubEnv-C to learn ‘build marines’ for 50K180
updates. In each stage, the model is initialized by the weights from the previous stage. Lastly, the agent
is trained under the original environment OrigEnv for 300K updates.
PLAID Algorithm. The PLAID algorithm improves over the curriculum learning algorithm by inserting
knowledge ‘distillation’ steps. Firstly, an agent is trained under SubEnv-A and learns the policy πA. Then,
an agent is trained under SubEnv-B with weights initialized from πA. The learned policy is denoted as πB.185
After that, a new agent is trained to distill the knowledge from experts πAand πBusing the DAGGER
algorithm [25, 46]. The distilled policy is called ˆπB. Next, an agent is initialized by the parameters of
ˆπBto learn the knowledge of SubEnv-C, and we get πC. Another distillation process is then performed
to distill knowledge from ˆπBand πC, and we get ˆπC. Lastly, we train the agent under the original
environment OrigEnv based on ˆπCfor 300K updates.190
PMES Algorithm. The proposed PMES algorithm is implemented under the 16 paralleled environments
containing the main and auxiliary environments, i.e.OrigEnv,SubEnv-A,SubEnv-B, and SubEnv-C. Since
the number of each environment inuences the performance of the algorithm, the best conguration of
the ratio of sub-environments is found by grid search.
Figure 4: The comparison of mean episodic rewards of A2C, Curriculum Learning, Reward Shaping, PLAID, and PMES.
For Curriculum Learning and PLAID, only their last stages are drawn. The Score is the mean episodic reward for building
marines in the original environment. Each training curve is the average of 5 runs, and the shadow represents the 90%
condence interval.
4.1. Result Analysis195
Figure 4 shows the training processes of the ve algorithms, and Table 2 summarizes the maximum
score and average score of the ve algorithms and the existing work at the end of the train and also
their scores when testing. Furthermore, the game screenshot for each algorithm is illustrated in Figure 5.
We also provide the recordings of the game replay for all algorithms in the supplementary video. In
Supply depots
(a) A2C
Supply depots
(b) Reward Shaping
Supply depot
(c) Curriculum Learning
Supply depot
Supply depots
Barracks Marines
(e) PMES
Figure 5: The replay scene of 5 algorithms. All snapshots are captured at the last 1 minute of an episode (total duration
is 15min). Important units are labeled in the images. These screenshots illustrate how many marines each algorithm can
build, and also intuitively reveal reasons if they fail to build marines.
Table 2: Comparison of the scores between PMES algorithm and other algorithms, which contains 4 baseline algorithms
and other existing work. The score represents the reward for building marines in the original environment. The max (mean)
score is the maximum (average) reward overall paralleled main environments and all run 5 times. The train results show
the performance near the end of training at 300K, and the test results are the statistics of running the agent in evaluation
mode for 1024 ×5episodes (1024 episodes for each run). The K, M, B means thousand, million, billion, respectively.
Settings Main Environment Sub-Environments Mean Reward (Train) Max Reward (Train) Mean Reward (Test) Max Reward (Test) Updates
A2C OrigEnv 8 49 12.9 44 300K
Reward Shaping ShapingEnv 20 120 29.5 126 300K
Curriculum Learning OrigEnv A, B, C 27 108 48.6 111 300K
PLAID OrigEnv A, B, C 57 122 51.7 123 300K
PMES OrigEnv A, B, C 85 129 91.6 134 300K
Grandmaster [43] OrigEnv 142 —
FullyConv LSTM [43] OrigEnv 62 600M
Relational agent [42] OrigEnv 123 — 10B
the results, the PMES algorithm gets the highest score, and it is also very ecient for that it can reach200
high scores after 100K updates. Compared to PMES, the maximum score and sample eciency of other
4algorithms are worse. From the results in Table 2, the PMES algorithm can get 134 points when
testing, which is the best among the ve algorithms. The number is very close to the level (144 points) of
StarCraft Grandmaster [43]. The mean score of A2C,Reward Shaping and Curriculum Learning are very
low, which means that these algorithms do not learn how to build marines well. PLAID has a medium205
performance with a mean reward of 57. Although the mean score of PMES algorithm doesn’t exceed the
Relational agent [42], but it takes fewer update times to reach a comparable level (300 thousand is far
less than 10 billion). Therefore, we conclude that our algorithm is superior to others.
In the following, we analyze the training process of each algorithm qualitatively according to the
changing of scores. Recalling from Equations (1), (2) and (5), the value of state directly reects how210
many rewards it could gather in future. Here the agent will get a reward when a task or subtask is done,
so the value of the initial state is an indicator of whether the agent has mastered the task. Next, we will
illustrate the adaptive mechanism of PMES from the values changing of the three types of initial states.
The Training Process of the A2C Algorithm. Figure 6 shows the training process of A2C algorithm in
detail. From Figure 6(a), the score is unstable during training and uctuates around 12. Figure 6(b)215
shows the values change of initial states of SubEnv-A, B, C. The value of C curve gradually increases as
the training progresses. This indicates the agent receives the reward from building marines occasionally.
However, the value of A curve is very low, which means the agent cannot stably build marines from the
initial state of the original environment. This is veried intuitively by the game screenshot in Figure 5(a).
Compared with the screenshot of PMES in Figure 5(e), few supply depots are built in the A2C algorithm.220
Thus, the reason for its failure is that the agent even does not learn the rst step of building marines.
Another interesting phenomenon in Figure 5(a) is that barracks might be destroyed by the randomly
050 100 150 200 250 300
Training Updates (K)
(a) The changing of scores of A2C
050 100 150 200 250 300
Training Updates (K)
A Value B Value C Value
A Value Trend B Value Trend C Value Trend
(b) The changing of values of A2C
Figure 6: The training process of the A2C algorithm. Note that only the original environments are involved. (a) The Score
is the reward of building marines. (b) The changing of the value of the initial states of SubEnv-A, B, C. The values of the
initial states (see Figure 3) of SubEnv-A, B, C are monitored. Each data point is the average value of 704 sampled initial
states in the same environment. To clearly illustrate the trend, we t the curves using polynomials with a sliding window,
which are denoted as A (B, C) Value Trends.
built marines without attack punishment.
0 50 100 150 200 250 300
Training Updates (K)
All Rewards
Shaping Rewards
Main Rewards
(a) The changing of scores and rewards of Reward Shaping
0 50 100 150 200 250 300
Training Updates (K)
A Value B Value C Value
A Value Trend B Value Trend C Value Trend
(b) The changing of values of Reward Shaping
Figure 7: The training process of the Reward Shaping algorithm. (a) The rewards of the Reward Shaping algorithm has two
components. Main Rewards is the reward for building marines; Shaping Rewards is the auxiliary reward such as building
supply depots and barracks; All Rewards is the actual reward used in training, which equals to the sum of Main Reward
and Shaping Rewards. (b) The changing of values during the training process of Reward Shaping algorithm. The meanings
of A, B, C curves are the same as in Figure 6.
The Training Process of the Reward Shaping Algorithm. From the screenshot in Figure 5(b), we observe
that the agent devotes most resources into building supply depots, a situation which also reects in the225
changing process of rewards in Figure 7(a). There are 3 curves: the main rewards is the reward for building
marines which is inherited from the original environment; the shaping rewards is the sum of auxiliary
rewards for actions such as building/destroying/attacking supply depots and barracks (excluding the
action of building marines); and their sum, the all rewards, is the actual reward used in training. The
Reward Shaping algorithm stabilizes around 5during the training, which is very large considering that230
building one supply depot only gains +1/9rewards. As a result, the agent does not have enough resources
to build barracks and marines. In Figure 7(b), the values of A and B curve stay almost unchanged during
training. Without the adaptive shaping mechanism, the agent is allured by the easy, small immediate
reward of building supply depots, and fails to explore for the long-term, large reward of building marines.
Thus, the reason for the poor performance of the Reward Shaping algorithm is the short-sighted strategy235
and bad allocation of resources.
The Training Process of the Curriculum Learning Algorithm. The training process is highly unstable.
In 5 runs of the algorithm, the agent can build about 70 marines just for one time; in the other four
experiments, the agent can only build 5 to 10 marines. It implies that the Curriculum Learning algorithm
cannot stably instruct the agent to learn the steps of building marines in most circumstances. Figure 8(a)240
shows the training process of Curriculum Learning algorithm at each stage. The process can be divided
into 4 stages according to the training environments: SubEnv-A,SubEnv-B,SubEnv-C and OrigEnv.
The scores represent the rewards under the corresponding sub-environments. For example, the score
of the model under SubEnv-A is the sum of the rewards for building a supply depot and the penalties
for destroying it. From the gure, we can see the score of each stage can reach a relatively high level,245
which indicates that the sub-tasks are easy to learn. But the score of the OrigEnv is only around 5
points, which means that learnings in the early stages have little eect on building marines. Further, the
values of initial states of three sub-environments in Figure 8(b) diers greatly, which implies that the
agent under new environments will quickly forget what it has learned in the previous curriculum. In the
screenshot as shown in Figure 5(c), there is a very limited number of supply depots and barracks, which250
conrms the catastrophic forgetting problem: forgetting how to build supply depots and barracks.
The Training Process of the PLAID Algorithm. Dierent from Curriculum Learning, the PLAID algo-
rithm contains 6 stages, which are denoted by SubEnv-A,SubEnv-B,Distill-AB,SubEnv-C,Distill-BC,
and OrigEnv, respectively. The new stages are the two ‘distillation’ processes. In Distil l-AB, an agent
is trained to learn knowledge from two expert polices which are previously trained in stages SubEnv-A255
and SubEnv-B. After the process of Distill-AB, the agent masters the knowledge of both SubEnv-A and
SubEnv-B, as indicated by the reward curves shown in Figure 9(a). There are two reward curves in dis-
tillation stages, corresponding to the performance of the agent under the two related sub-environments.
At the end of Distill-BC, the agent remembers how to build barracks (indicated by the reward curve
of SubEnv-B) and how to build marines (indicated by the reward curve of SubEnv-C). The distillation260
process mitigates the catastrophic forgetting problem of transfer learning, and the agent is able to build
marines at the very beginning of stage OrigEnv. Also, in Figure 9(b), the C value is lower than the B
value, because the agent has not learned the knowledge of SubEnv-C during the process of Distill-AB. In
0 50 100 150 200 250 300 350 400
Training Updates (K)
(a) The changing of rewards of Curriculum Learning
0 50 100 150 200 250 300 350 400
Training Updates (K)
A Value B C
A Value Trend B Value Trend C Value Trend
(b) The changing of values of Curriculum Learning
Figure 8: The training process of the Curriculum Learning algorithm. (a) The rewards of Curriculum Learning. It contains
4 stages under dierent environments: SubEnv-A,SubEnv-B,SubEnv-C, and OrigEnv. The Score of each stage is the
reward under its corresponding environment. (b) The changing of values during the training process of Curriculum Leaning.
The meanings of A, B, C curves are the same as in Figure 6.
the Distill-BC stage, the C value has exceeded the B value, which shows that the agent has learned how
to build marines after being trained under SubEnv-C. In the end, the agent achieves a better score than265
the agent of the Curriculum Learning algorithm. So the distillation is eective to remember knowledge.
The screenshot in Figure 5(d) gives an explanation that the building of marines is restricted by the num-
ber of barracks. Thus the agent has not learned how to better allocate resources. Further performance
improvements for the PLAID algorithm might require more training updates. However, PLAID does not
allocate resources eciently like the PMES algorithm, which doesn’t have an adaptive reward mechanism.270
050 100 150 200 250 300 350 400 450 500 550
Training Updates (K)
(a) The changing of rewards of PLAID
0 50 100 150 200 250 300 350 400 450 500 550
Training Updates (K)
A Value B Value C Value
A Value Trend B Value Trend C Value Trend
(b) The changing of values of PLAID
Figure 9: The training process of the PLAID algorithm. (a) The rewards of the PLAID algorithm. It contains 6 stages
under dierent environments: SubEnv-A,SubEnv-B,Distil l-AB,SubEnv-C,Distil l-BC, and OrigEnv. The Score of each
stage is the reward under its corresponding environments. Note that distill stages involves two environments at the same
time. (b) The changing of values during the training of PLAID. The meanings of A, B, C curves are the same as in Figure 6.
The Training Process of the PMES Algorithm. At the beginning stage of the training process shown in
Figure 10, all the values of A, B, and C increase concurrently, which shows the eciency of parallelism in
learning the sub-tasks. Refer to Figure 10(a), after updating for 60K, the agent has learned how to build
marines in the main environment from scratch, thus the overall perceived rewards for building supply275
depots and barracks are reduced due to the adaptive shaping mechanism. As a result, the values of curves
A and B decrease after 60K. In contrast, the value of curve C is mainly aected by the rewards of building
marines, which is relatively stable during the training. Particularly, comparing the changing of values of
the PMES algorithm in Figure 10 with the changing of values of other algorithms in Figures 6(b), 7(b)
and 8(b), we nd that the adaptive reward shaping is unique in the PMES algorithm. From the screenshot280
shown in Figure 5(e), the proportion of supply depots and barracks is well balanced, and there are a
lot of marines being built. It implies that the agent has learned to build marines step by step, and can
allocate the resource reasonably based on the mechanism of adaptive reward shaping. Therefore, the
PMES algorithm has incomparable superiority in performance than other algorithms.
050 100 150 200 250 300
Training Updates (K)
(a) The changing of rewards of PMES
0 50 100 150 200 250 300
Training Updates (K)
A Value B Value C Value
A Value Trend B Value Trend C Value Trend
(b) The changing of values of PMES
Figure 10: The change of values during the training of the PMES algorithm. (a) The rewards of the PMES algorithm. (b)
The changing of values during the training of the PMES algorithm. The meanings of A, B, C curves are the same as in
Figure 6.
4.2. Ablation Study285
In this section, we demonstrate the importance of domain knowledge in the design of auxiliary envi-
ronments, and study the design choices in the PMES algorithm.
Importance of Domain Knowledge. Recall that in the design of sub-environments, we not only consider
the build action of buildings and marines, but also take account of the destroy and attack actions. To
demonstrate the importance of this domain knowledge, we design two comparative experiments that290
gradually remove domain knowledge from the environments used in the PMES algorithm. The rst one
is called the PMES_RP model. It consists of 3 sub-environments SubEnv-A,SubEnv-B,SubEnv-C**
and the main environment OrigEnv, where SubEnv-C** removes the attack action from SubEnv-C sub-
environment. Another model, namely the PMES_R model, removes both the attack and destroy actions
from the 3 sub-environments and the main environment. The compositional environments are SubEnv-295
A*,SubEnv-B*,SubEnv-C* and OrigEnv. The settings of the new sub-environments are listed in Table 3.
Their results are shown in Figure 11. Note that the mean score of PMES_RP model is far less than
the score of the original PMES model after removing attack action. Further, the PMES_R model which
removes attack and destroy gets a lower score than PMES_RP, and also has a slower convergence speed.
Therefore, adding more domain knowledge into the model can greatly improve the performance of the300
Table 3: Conguration of extra auxiliary environments. Their initial states since they are the same as in Table 1. The 3
numbers in the cells correspond to rewards for actions Build/Destroy/Attack. ‘Center’ is the Control center in the game.
Setting Center Depot Barrack Marine
SubEnv-A* — +1/0/0
SubEnv-B* — +1/0/0
SubEnv-C* — +1/0/0
SubEnv-C** — +1/-1/0
Table 4: Comparison of using one sub-environment (A, B or C) in the PMES algorithm. Each Setting is trained to 300K
updates with 2 runs.
Conguration Rewards
OrigEnv SubEnv-A SubEnv-B SubEnv-C Mean Max
Setting A 10 6 0 0 5 10
Setting B 10 0 6 0 45 69
Setting C 10 0 0 6 32 65
The Sensitivity on the Number of Environments. In this work, the model is trained based on 16 paralleled
environments. We analyze the impact of the number of dierent environments on the performance of the
model, as shown in Figure 12. There are some dierences between the results of dierent settings. The
performance of the PMES algorithm is related to the number of dierent environments. Nonetheless,305
even in the imperfect conguration of environments, the performance of the PMES algorithm is better
than all other compared baseline algorithms (Note that experiments in Figure 12 are trained only for
75K updates).
Figure 11: Three settings encode gradually more knowledge into the sub-environment design. PMES_R model rewards the
building actions; PMES_RP model also punishes the destroying consequences; PMES model in addition discourages the
attack actions. Note that the more knowledge being encoded, the better the performance is obtained. 5 runs are performed
for each setting.
8-2-4-2 9-1-4-2 9-2-3-2 10-1-3-2 10-2-2-2
Environment Configuration
Figure 12: The eect of environment congurations. The four numbers of conguration represent the number of four
environments (OrigEnv,SubEnv-A,SubEnv-B and SubEnv-C). Each experiment is trained for 75K updates.
The Role of Auxiliary Environments. In order to show the importance of each auxiliary environment, we
single out SubEnv-A,SubEnv-B, and SubEnv-C respectively, and only use one of them as the auxiliary310
environment. The results are shown in the Table 4, where the setting of the hyper-parameters is same as
before. By comparing the results of dierent settings, we conclude that introducing part of the auxiliary
environments into the main environment can eectively guide the agent to learn. Among them, SubEnv-B
and SubEnv-C is more eective than SubEnv-A. This is reasonable since SubEnv-A would misguide the
agent to spend resources into building supply depots rather than building marines.315
5. Conclusion
In this paper, we propose a novel deep reinforcement learning algorithm to solve these problems of
complex multi-step task, called the PMES algorithm, which introduces auxiliary environments into the
main environment to eectively convey domain knowledge to the learning agent. In the PMES algorithm,
the agent interacts with dierent environments at the same time, leading to the speedup of the training320
process. More importantly, the PMES algorithm has an adaptive shaping mechanism, where the overall
reward function adjusts adaptively as the training progresses. Not only can the mechanism of adaptive
reward shaping instruct the agent nd the way to complete the task, but also it can prevent the agent
from being trapped into short-sighted, sub-optimal policies. Experimental results on the StarCraft II
mini-game demonstrate the eectiveness of the proposed PMES algorithm, which is much more eective325
and ecient than the traditional A2C, Reward Shaping, Curriculum Learning, and PLAID algorithms.
We would like to thank the members of The Killers team and their instructors Lisen Mu and Jingchu
Liu in DeeCamp 2018 for the insightful discussions. This work was supported by three grants from the
National Natural Science Foundation of China (No. 61976174, No. 61877049, and No. 11671317), and in330
part from the program of China Scholarships Council (No.201906280201).
Conict of Interest
The authors declare that there is no conict of interest regarding the publication of this article.
[1] S. Li, L. Ding, H. Gao, C. Chen, Z. Liu, Z. Deng, Adaptive neural network tracking control-based335
reinforcement learning for wheeled mobile robots with skidding and slipping, Neurocomputing 283
(2018) 20–30.
[2] F. Li, Q. Jiang, S. Zhang, M. Wei, R. Song, Robot skill acquisition in assembly process using deep
reinforcement learning, Neurocomputing 345 (2019) 92–102.
[3] M. Wulfmeier, D. Z. Wang, I. Posner, Watch this: Scalable cost-function learning for path planning340
in urban environments, in: 2016 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), IEEE, 2016, pp. 2089–2095.
[4] A. E. Sallab, M. Abdou, E. Perot, S. Yogamani, Deep reinforcement learning framework for au-
tonomous driving, Electronic Imaging 2017 (2017) 70–76.
[5] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,345
I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., Mastering the game of go with deep neural
networks and tree search, nature 529 (2016) 484.
[6] H. Jiang, H. Zhang, K. Zhang, X. Cui, Data-driven adaptive dynamic programming schemes for
non-zero-sum games of unknown discrete-time nonlinear systems, Neurocomputing 275 (2018) 649–
[7] H. Jiang, H. Zhang, X. Xie, J. Han, Neural-network-based learning algorithms for cooperative games
of discrete-time multi-player systems with control constraints via adaptive dynamic programming,
Neurocomputing 344 (2019) 13–19.
[8] Z. Wang, L. Liu, H. Zhang, G. Xiao, Fault-tolerant controller design for a class of nonlinear mimo
discrete-time systems via online reinforcement learning algorithm, IEEE Transactions on Systems,355
Man, and Cybernetics: Systems 46 (2015) 611–622.
[9] A. Y. Ng, D. Harada, S. Russell, Policy invariance under reward transformations: Theory and
application to reward shaping, in: ICML, volume 99, 1999, pp. 278–287.
[10] G. Konidaris, A. Barto, Autonomous shaping: Knowledge transfer in reinforcement learning, in:
Proceedings of the 23rd international conference on Machine learning, ACM, 2006, pp. 489–496.360
[11] E. Wiewiora, G. W. Cottrell, C. Elkan, Principled methods for advising reinforcement learning
agents, in: Proceedings of the 20th International Conference on Machine Learning (ICML-03), 2003,
pp. 792–799.
[12] A. Laud, G. DeJong, The inuence of reward on the speed of reinforcement learning: An analysis
of shaping, in: Proceedings of the 20th International Conference on Machine Learning (ICML-03),365
2003, pp. 440–447.
[13] M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. van de Wiele, V. Mnih, N. Heess,
J. T. Springenberg, Learning by playing solving sparse reward tasks from scratch, in: Proceedings
of the 35th International Conference on Machine Learning, volume 80, 2018, pp. 4344–4353.
[14] B. Marthi, Automatic shaping and decomposition of reward functions, in: Proceedings of the 24th370
International Conference on Machine learning, ACM, 2007, pp. 601–608.
[15] S. M. Devlin, D. Kudenko, Dynamic potential-based reward shaping, in: Proceedings of the 11th
International Conference on Autonomous Agents and Multiagent Systems, IFAAMAS, 2012, pp.
[16] J. A. Arjona-Medina, M. Gillhofer, M. Widrich, T. Unterthiner, J. Brandstetter, S. Hochreiter,375
Rudder: Return decomposition for delayed rewards, in: Advances in Neural Information Processing
Systems, 2019, pp. 13544–13555.
[17] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, P. Abbeel, Overcoming exploration in rein-
forcement learning with demonstrations, in: 2018 IEEE International Conference on Robotics and
Automation (ICRA), IEEE, 2018, pp. 6292–6299.380
[18] S. Narvekar, J. Sinapov, M. Leonetti, P. Stone, Source task creation for curriculum learning, in:
Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems,
International Foundation for Autonomous Agents and Multiagent Systems, 2016, pp. 566–574.
[19] C. Florensa, D. Held, M. Wulfmeier, M. Zhang, P. Abbeel, Reverse curriculum generation for
reinforcement learning, in: Conference on Robot Learning, 2017, pp. 482–495.385
[20] M. Svetlik, M. Leonetti, J. Sinapov, R. Shah, N. Walker, P. Stone, Automatic curriculum graph
generation for reinforcement learning agents., in: The Conference on Articial Intelligence, AAAI,
2017, 2017, pp. 2590–2596.
[21] W. M. Czarnecki, S. M. Jayakumar, M. Jaderberg, L. Hasenclever, Y. W. Teh, N. Heess, S. Osindero,
R. Pascanu, Mix & match agent curricula for reinforcement learning, in: International Conference390
on Machine Learning, 2018, pp. 1095–1103.
[22] Z. Ren, D. Dong, H. Li, C. Chen, Self-paced prioritized curriculum learning with coverage penalty in
deep reinforcement learning, IEEE transactions on neural networks and learning systems 29 (2018)
[23] N. Jiang, S. Jin, C. Zhang, Hierarchical automatic curriculum learning: Converting a sparse reward395
navigation task into dense reward, Neurocomputing (2019).
[24] E. Parisotto, J. L. Ba, R. Salakhutdinov, Actor-mimic: Deep multitask and transfer reinforcement
learning, in: 4th International Conference on Learning Representations, 2016.
[25] G. Berseth, C. Xie, P. Cernek, M. Van de Panne, Progressive reinforcement learning with distillation
for multi-skilled motion control, in: 6th International Conference on Learning Representations, 2018.400
[26] F. Tanaka, M. Yamamura, Multitask reinforcement learning on the distribution of mdps, in: Com-
putational Intelligence in Robotics and Automation, 2003. Proceedings. 2003 IEEE International
Symposium on, volume 3, IEEE, 2003, pp. 1108–1113.
[27] A. Wilson, A. Fern, S. Ray, P. Tadepalli, Multi-task reinforcement learning: a hierarchical bayesian
approach, in: Proceedings of the 24th international conference on Machine learning, ACM, 2007, pp.405
[28] M. Snel, S. Whiteson, Multi-task evolutionary shaping without pre-specied representations, in:
Proceedings of the 12th annual conference on Genetic and evolutionary computation, ACM, 2010,
pp. 1031–1038.
[29] D. Calandriello, A. Lazaric, M. Restelli, Sparse multi-task reinforcement learning, in: Advances in410
Neural Information Processing Systems, 2014, pp. 819–827.
[30] Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, R. Pascanu,
Distral: Robust multitask reinforcement learning, in: Advances in Neural Information Processing
Systems, 2017, pp. 4496–4506.
[31] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, K. Kavukcuoglu, Rein-415
forcement learning with unsupervised auxiliary tasks, in: 5th International Conference on Learning
Representations, 2017.
[32] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino, M. Denil, R. Goroshin,
L. Sifre, K. Kavukcuoglu, et al., Learning to navigate in complex environments, in: 5th International
Conference on Learning Representations, 2017.420
[33] S. Cabi, S. G. Colmenarejo, M. W. Homan, M. Denil, Z. Wang, N. Freitas, The intentional
unintentional agent: Learning to solve many continuous control tasks simultaneously, in: Conference
on Robot Learning, 2017, pp. 207–216.
[34] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu,
Asynchronous methods for deep reinforcement learning, in: International conference on machine425
learning, 2016, pp. 1928–1937.
[35] A. Gruslys, W. Dabney, M. G. Azar, B. Piot, M. G. Bellemare, R. Munos, The reactor: A fast
and sample-ecient actor-critic agent for reinforcement learning, in: International Conference on
Learning Representations, ICLR 2018, 2018.
[36] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley,430
I. Dunning, et al., Impala: Scalable distributed deep-rl with importance weighted actor-learner
architectures, in: International Conference on Machine Learning, 2018, pp. 1406–1415.
[37] D. Lee, H. Tang, J. O. Zhang, H. Xu, T. Darrell, P. Abbeel, Modular architecture for starcraft
ii with deep reinforcement learning, in: Fourteenth Articial Intelligence and Interactive Digital
Entertainment Conference, 2018.435
[38] Z. Pang, R. Liu, Z. Meng, Y. Zhang, Y. Yu, T. Lu, On reinforcement learning for full-length game
of starcraft, in: The Conference on Articial Intelligence, AAAI, 2019, 2019, pp. 4691–4698.
[39] T. Rashid, M. Samvelyan, C. S. Witt, G. Farquhar, J. Foerster, S. Whiteson, Qmix: Monotonic value
function factorisation for deep multi-agent reinforcement learning, in: International Conference on
Machine Learning, 2018, pp. 4292–4301.440
[40] V. Zambaldi, D. Raposo, A. Santoro, V. Bapst, Y. Li, I. Babuschkin, K. Tuyls, D. Reichert, T. Lil-
licrap, E. Lockhart, et al., Deep reinforcement learning with relational inductive biases, in: 7th
International Conference on Learning Representations, 2019.
[41] J. Gehring, D. Ju, V. Mella, D. Gant, N. Usunier, G. Synnaeve, High-level strategy selection under
partial observability in starcraft: Brood war, in: NeurIPS Workshop on Reinforcement Learning445
under Partial Observability, 2018.
[42] V. F. Zambaldi, D. Raposo, A. Santoro, V. Bapst, Y. Li, I. Babuschkin, K. Tuyls, D. P. Reichert,
T. P. Lillicrap, E. Lockhart, M. Shanahan, V. Langston, R. Pascanu, M. Botvinick, O. Vinyals, P. W.
Battaglia, Deep reinforcement learning with relational inductive biases, in: International Conference
on Learning Representations, ICLR 2019, 2019.450
[43] O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Küttler,
J. Agapiou, J. Schrittwieser, et al., Starcraft ii: A new challenge for reinforcement learning, arXiv
preprint arXiv:1708.04782 (2017).
[44] R. Ring, T. Matiisen, Replicating deepmind starcraft ii reinforcement learning benchmark with
actor-critic methods (2018).455
[45] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,
M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,
M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,
B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals,
P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: Large-scale machine learn-460
ing on heterogeneous systems, 2015. URL:, software available from
[46] S. Ross, G. J. Gordon, J. A. Bagnell, No-regret reductions for imitation learning and structured
prediction, in: In AISTATS, Citeseer, 2011.
... As the Alpha Go beats the human professional player in 2015 [10], deep reinforcement learning (DRL) algorithm has received widespread attention from academia and industry [11][12][13][14][15], which combines the advantages of deep learning and reinforcement learning (RL). The stock trading problem can be regarded as a Markov Decision Process (MDP), and the optimal strategy can be solved by dynamic programming under the DRL framework. ...
... The stock trading problem can be regarded as a Markov Decision Process (MDP), and the optimal strategy can be solved by dynamic programming under the DRL framework. 15 Thus, some researchers have built some automated trading strategies using the DRL algorithm. For example, Neuneier et al. [16] firstly use RL algorithm in Foreign Exchange, getting better performance than the previous supervised learning algorithms. ...
Full-text available
In recent years, deep reinforcement learning (DRL) algorithm has been widely used in algorithmic trading. Many fully automated trading systems or strategies have been built using DRL agents, which integrate price prediction and trading signal generation in one system. However, the previous agents extract the current state from the market data without considering the long-term market historical trend when making decisions. Besides, plenty of related and useful information has not been considered. To address these two problems, we propose a novel model named Parallel multi-module deep reinforcement learning (PMMRL) algorithm. Here, two parallel modules are used to extract and encode the feature: one module employing Fully Connected (FC) layers is used to learn the current state from the market data of the traded stock and the fundamental data of the issuing company; another module using Long Short-Term Memory (LSTM) layers aims to detect the long-term historical trend of the market. The proposed model can extract features from the whole environment by the above two modules simultaneously, taking the advantages of both LSTM and FC layers. Extensive experiments on China stock market illustrate that the proposed PMMRL algorithm achieves a higher profit and a lower drawdown than several state-of-the-art algorithms.
... Another approach was to devise a way that the agent could be informed about the importance of the steps even though there is no corresponding reward. One way to achieve this is by incorporating sub-environments, which can be considered as intermediate steps [29]. The agent is trained to learn under multiple parallel environments by automatically assigning rewards from these sub-environments. ...
Full-text available
Sharing prior knowledge across multiple robotic manipulation tasks is a challenging research topic. Although the state-of-the-art deep reinforcement learning (DRL) algorithms have shown immense success in single robotic tasks, it is still challenging to extend these algorithms to be applied directly to resolve multi-task manipulation problems. This is mostly due to the problems associated with efficient exploration in high-dimensional state and continuous action spaces. Furthermore, in multi-task scenarios, the problem of sparse reward and sample inefficiency of DRL algorithms is exacerbated. Therefore, we propose a method to increase the sample efficiency of the soft actor-critic (SAC) algorithm and extend it to a multi-task setting. The agent learns a prior policy from two structurally similar tasks and adapts the policy to a target task. We propose a prioritized hindsight with dual experience replay to improve the data storage and sampling technique, which, in turn, assists the agent in performing structured exploration that leads to sample efficiency. The proposed method separates the experience replay buffer into two buffers to contain real trajectories and hindsight trajectories to reduce the bias introduced by the hindsight trajectories in the buffer. Moreover, we utilize high-reward transitions from previous tasks to assist the network in easily adapting to the new task. We demonstrate the proposed method based on several manipulation tasks using a 7-DoF robotic arm in RLBench. The experimental results show that the proposed method outperforms vanilla SAC in both a single-task setting and multi-task setting.
In this paper, we focus on tasks that require multi-step motions to achieve the goal (defined as a ‘multi-step task’), and we describe a method for a robot to automatically achieve the final goal of a multi-step task. We proposed a method based on reinforcement learning and ‘Teaching by Showing’ for multi-step tasks. A robot can learn how to complete a task automatically by referring to the motions of a human operator, even if the task consists of multi-step motions. Because a human operator is not required to operate the robot during the learning process, we believe that our proposed method can reduce the burden on the human operator. Finally, we conducted experiments to validate the effectiveness of the proposed method and compared it to a conventional reinforcement learning method.
Full-text available
Presently, the volatile and dynamic aspects of stock prices are significant research challenges for stock markets or any other financial sector to design accurate and profitable trading strategies in all market situations. To meet such challenges, the usage of computer-aided stock trading techniques has grown in prominence in recent decades owing to their ability to rapidly and accurately analyze stock market situations. In the recent past, deep reinforcement learning (DRL) methods and trading bots are commonly utilized for algorithmic trading. However, in the existing literature, the trading agents employ the historical and present trends of stock prices as an observing state to make trading decisions without taking into account the long-term market future pattern of stock prices. Therefore, in this study, we proposed a novel decision support system for automated stock trading based on deep reinforcement learning that observes both past and future trends of stock prices whether single and multi-step ahead as an observing state to make the optimal trading decisions of buying, selling, and holding the stocks. More specifically, at every time step, future trends are monitored concurrently using a forecasting network whose output is concatenated with past trends of stock prices. The concatenated vectors are subsequently supplied to the DRL agent as an observation state. In addition, the suggested forecasting network is built on a Gated Recurrent Unit (GRU). The GRU-based agent captures more informative and inherent aspects of time-series financial data. Furthermore, the suggested decision support system has been tested on several stock markets such as Tesla, IBM, Amazon, CSCO, and Chinese Stocks as well as equity markets i-e SSE Composite Index, NIFTY 50 Index, US Commodity Index Fund, and has achieved encouraging profit values while trading.
Full-text available
Financial portfolio management is reallocating the asset into financial products, whose goal is to maximize the profit under a certain risk. Since AlphaGo debated human professional players, deep reinforcement learning (DRL) algorithm has been widely used in various fields, including quantitative trading. The multi-agent system is a relatively new research branch in DRL, and its performance is better than that of a single agent in most cases. In this paper, we propose a novel multi-agent deep reinforcement learning algorithm with trend consistency regularization (TC-MARL) to find the optimal portfolio. Here, we divide the trend of stocks of one portfolio into two categories and train two different agents to learn the optimal trading strategy under these two stock trends. First, we build a trend consistency (TC) factor to recognize the consistency of several stocks from one portfolio. When the trend of these stocks is consistent, the factor is defined as 1; the trend is inconsistent, the factor is defined as \(-\) 1. Based on it, a novel regularization related to the weights is proposed and added to the reward function, named TC regularization. And the TC factor value is used as the sign of the regularization term. In this way, two agents with different reward functions are constructed, which have the same policy model and value model. Afterward, the proposed TC-MARL algorithm will dynamically switch between the two trained agents to find the optimal portfolio strategy according to the market status. Extensive experimental results on the Chinese Stock Market show the effectiveness of the proposed algorithm.
Full-text available
Reinforcement Learning (RL) is a subfield of Artificial Intelligence (AI) that deals with agents navigating in an environment with the goal of maximizing total reward. Games are good environments to test RL algorithms as they have simple rules and clear reward signals. Theoretical part of this thesis explores some of the popular classical and modern RL approaches, which include the use of Artificial Neural Network (ANN) as a function approximator inside AI agent. In practical part of the thesis we implement Advantage Actor-Critic RL algorithm and replicate ANN based agent described in [Vinyals et al., 2017]. We reproduce the state-of-the-art results in a modern video game StarCraft II, a game that is considered the next milestone in AI after the fall of chess and Go.
Full-text available
In many real-world settings, a team of agents must coordinate their behaviour while acting in a decentralised way. At the same time, it is often possible to train the agents in a centralised fashion in a simulated or laboratory setting, where global state information is available and communication constraints are lifted. Learning joint action-values conditioned on extra state information is an attractive way to exploit centralised learning, but the best strategy for then extracting decentralised policies is unclear. Our solution is QMIX, a novel value-based method that can train decentralised policies in a centralised end-to-end fashion. QMIX employs a network that estimates joint action-values as a complex non-linear combination of per-agent values that condition only on local observations. We structurally enforce that the joint-action value is monotonic in the per-agent values, which allows tractable maximisation of the joint action-value in off-policy learning, and guarantees consistency between the centralised and decentralised policies. We evaluate QMIX on a challenging set of StarCraft II micromanagement tasks, and show that QMIX significantly outperforms existing value-based multi-agent reinforcement learning methods.
We propose RUDDER, a novel reinforcement learning approach for delayed rewards in finite Markov decision processes (MDPs). In MDPs the Q-values are equal to the expected immediate reward plus the expected future rewards. The latter are related to bias problems in temporal difference (TD) learning and to high variance problems in Monte Carlo (MC) learning. Both problems are even more severe when rewards are delayed. RUDDER aims at making the expected future rewards zero, which simplifies Q-value estimation to computing the mean of the immediate reward. We propose the following two new concepts to push the expected future rewards toward zero. (i) Reward redistribution that leads to return-equivalent decision processes with the same optimal policies and, when optimal, zero expected future rewards. (ii) Return decomposition via contribution analysis which transforms the reinforcement learning task into a regression task at which deep learning excels. On artificial tasks with delayed rewards, RUDDER is significantly faster than MC and exponentially faster than Monte Carlo Tree Search (MCTS), TD(λ), and reward shaping approaches. At Atari games, RUDDER on top of a Proximal Policy Optimization (PPO) baseline improves the scores, which is most prominent at games with delayed rewards. Source code is available at and demonstration videos at
StarCraft II poses a grand challenge for reinforcement learning. The main difficulties include huge state space, varying action space, long horizon, etc. In this paper, we investigate a set of techniques of reinforcement learning for the full-length game of StarCraft II. We investigate a hierarchical approach, where the hierarchy involves two levels of abstraction. One is the macro-actions extracted from expert’s demonstration trajectories, which can reduce the action space in an order of magnitude yet remain effective. The other is a two-layer hierarchical architecture, which is modular and easy to scale. We also investigate a curriculum transfer learning approach that trains the agent from the simplest opponent to harder ones. On a 64×64 map and using restrictive units, we train the agent on a single machine with 4 GPUs and 48 CPU threads. We achieve a winning rate of more than 99% against the difficulty level-1 built-in AI. Through the curriculum transfer learning algorithm and a mixture of combat model, we can achieve over 93% winning rate against the most difficult noncheating built-in AI (level-7) within days. We hope this study could shed some light on the future research of large-scale reinforcement learning.
Mastering the sparse reward or long-horizon task is critical but challenging in reinforcement learning. To tackle this problem, we propose a hierarchical automatic curriculum learning framework (HACL), which intrinsically motivates the agent to hierarchically and progressively explore environments. The agent is equipped with a target area during training. As the target area progressively grows, the agent learns to explore from near to far, in a curriculum fashion. The pseudo target-achieving reward converts the sparse reward into dense reward, thus the long-horizon difficulty is alleviated. The whole system makes hierarchical decisions, in which a high-level conductor travels through different targets, and a low-level executor operates in the original action space to complete the instructions given by the high-level conductor. Unlike many existing works that manually set curriculum training phases, in HACL, the total curriculum training process is automated and suits the agent’s current exploration capability. Extensive experiments on three sparse reward tasks, long-horizon stochastic chain, grid maze, and the challenging Atari game Montezuma’s Revenge, show that HACL achieves comparable or even better performance but with significantly less training frames.
Reinforcement learning has been successfully employed as a powerful tool in designing adaptive optimal controllers. Recently, off-policy learning has emerged to design optimal controllers for systems with completely unknown dynamics. However, current approaches for optimal tracking control design either result in bounded tracking error, rather than zero tracking error, or require partial knowledge of the system dynamics. Moreover, they usually require to collect a large set of data to learn the optimal solution. To obviate these limitations, this paper applies a combination of off-policy learning and experiencereplay for output regulation tracking control of continuous-time linear systems with completely unknown dynamics. To this end, the off-policy integral reinforcement learning-based technique is first used to obtain the optimal control feedback gain, and to explicitly identify the involved system dynamics using the same data. Secondly, a data-efficient based experience replay method is developed to compute the exosystem dynamics. Finally, the output regulator equations are solved using data measured online. It is shown that the proposed control method stabilizes the closedloop tracking error dynamics, and gives an explicit exponential convergence rate for the output tracking error. Simulation results show the effectiveness of the proposed approach.
Adaptive dynamic programming (ADP), an important branch of reinforcement learning, is a powerful tool in solving various optimal control problems. However, the cooperative game issues of discrete-time multi-player systems with control constraints have rarely been investigated in this field. In order to address this issue, a novel policy iteration (PI) algorithm is proposed based on ADP technique, and its associated convergence analysis is also studied in this brief paper. For the proposed PI algorithm, an online neural network (NN) implementation scheme with multiple-network structure is presented. In the online NN-based learning algorithm, critic network, constrained actor networks and unconstrained actor networks are employed to approximate the value function, constrained and unconstrained control policies, respectively, and the NN weight updating laws are designed based on the gradient descent method. Finally, a numerical simulation example is illustrated to show the effectiveness.
Uncertain factors in environments restrict the intelligence level of industrial robots. Based on deep reinforcement learning, a skill-acquisition method is used to solve the posed problems of uncertainty in a complex assembly process. Under the frame of the Markov decision process, a quaternion sequence of the assembly process is represented. The reward function uses a trained classification model, which mainly recognizes whether the assembly is successful. The proposed skill-acquisition method is designed to make robots acquire assembly skills. The input of the model is the contact state of the assembly process, and the output is the robot action. The robot can complete the assembly by self-learning with little prior knowledge. To evaluate the performance of the proposed skill-acquisition method, simulations and real-world experiments were performed in a low-voltage apparatus assembly. The assembly success rate increases with the learning time. In the case of a random initial position and orientation, the assembly success rate was greater than 80% with little prior knowledge. The results show that the robot has a capability to complex assembly through skill acquisition.
We propose Scheduled Auxiliary Control (SAC-X), a new learning paradigm in the context of Reinforcement Learning (RL). SAC-X enables learning of complex behaviors - from scratch - in the presence of multiple sparse reward signals. To this end, the agent is equipped with a set of general auxiliary tasks, that it attempts to learn simultaneously via off-policy RL. The key idea behind our method is that active (learned) scheduling and execution of auxiliary policies allows the agent to efficiently explore its environment - enabling it to excel at sparse reward RL. Our experiments in several challenging robotic manipulation settings demonstrate the power of our approach.