Content uploaded by Alexey Skrynnik
Author content
All content in this area was uploaded by Alexey Skrynnik on Aug 29, 2022
Content may be subject to copyright.
Long-Term Exploration in Persistent MDPs
Leonid Ugadiarov1, Alexey Skrynnik1,2, and Aleksandr I.
Panov1,2[0000−0002−9747−3837]
1Moscow Institute of Physics and Technology, Moscow, Russia
2Artificial Intelligence Research Institute FRC CSC RAS, Moscow, Russia
skrynnik@isa.ru
Abstract. Exploration is an essential part of reinforcement learning,
which restricts the quality of learned policy. Hard-exploration environ-
ments are defined by huge state space and sparse rewards. In such condi-
tions, an exhaustive exploration of the environment is often impossible,
and the successful training of an agent requires a lot of interaction steps.
In this paper, we propose an exploration method called Rollback-Explore
(RbExplore), which utilizes the concept of the persistent Markov deci-
sion process, in which agents during training can roll back to visited
states. We test our algorithm in the hard-exploration Prince of Per-
sia game, without rewards and domain knowledge. At all used levels
of the game, our agent outperforms or shows comparable results with
state-of-the-art curiosity methods with knowledge-based intrinsic moti-
vation: ICM and RND. An implementation of RbExplore can be found
at https://github.com/cds-mipt/RbExplore.
Keywords: Reinforcement learning ·Curiosity based exploration ·State
space clustering
1 Introduction
Exploration is an essential component of reinforcement learning (RL). During
training, agents have to choose between exploiting the current policy and ex-
ploring the environment. On the one hand, exploration can make the training
process more efficient and improve the current policy. On the other hand, exces-
sive exploration may waste computing resources visiting task-irrelevant regions
of the environment [4] [6].
Exploration is essential to solving sparse-reward tasks in environments with
high dimensional state space. In this case, an exhaustive exploration of the en-
vironment is impossible in practice. A considerable amount of interaction data
is required to train an effective policy due to the sparseness of the reward. A
common approach is to use knowledge-based or competence-based intrinsic moti-
vation [10]. In the first more commonly used approach, it is proposed to augment
an extrinsic reward with the additional dense intrinsic reward that encourages
exploration [2,3,15]. Another approach is to separate an exploration phase from
a learning phase [6]. As noted by the authors of [6], the disadvantage of the first
approach is that an intrinsic reward is a non-renewable resource. After exploring
2 L. Ugadiarov et al.
an area and consuming the intrinsic reward, the agent likely will never return to
the area to continue exploration due to catastrophic forgetting and inability to
rediscover the path because it has already consumed the intrinsic reward that
could lead to the area.
Implementing a mechanism that reliably returns the agent to the neighbor-
hood of known states from which further exploration might be most effective is
a challenging task for both approaches. In the case of resettable environments
(e.g., Atari games or some robotic simulators), it is possible to save the cur-
rent state of the simulator and restore it in the future. Many real-world RL
applications are inherently reset-free and require a non-episodic learning pro-
cess. Examples of this class of problems include robotics problems in real-world
settings and problems in domains where effective simulators are not available
and agents have to learn directly in the real world. Recent work has focused
on reset-free setting [17,18]. On the other hand, for many domains, simulators
are available and widely used at least in the pretraining phase (e.g., robotics
simulators [9]). Specific properties of resettable environments make it possible
to reliably return to previously visited states and increase exploration efficiency
by reducing the required number of interactions with the environment. There-
fore, exploration algorithms should effectively visit all states of an environment.
However, factoring in the high dimension of the state space, it is intractable
in practice to store all the visited states. Therefore, effective exploration of the
environment remains a difficult problem, even for resettable environments.
In this paper, we propose to formalize the interaction with resettable envi-
ronments as a persistent Markov decision process (pMDP). We introduce the
RbExplore algorithm, which combines the properties of pMDP with clustering
of the state space based on similarity of states to approach long-term explo-
ration problems. The distance between states in trajectories is used as a feature
for clustering. The states located close to each other are considered similar. The
states distant from each other are considered dissimilar. Clusters are organized
into a directed graph where vertices correspond to clusters, and arcs correspond
to possible transitions between states belonging to different clusters. RbExplore
uses a novelty detection module as a filter of perspective states. We introduce
the Prince of Persia game environment as a hard-exploration benchmark suitable
for comparing various exploration methods. The percentage coverage metric of
the game’s levels is proposed to evaluate exploration. RbExplore outperforms
or shows comparable performance with state-of-the-art curiosity methods ICM
and RND on different levels of the Prince of Persia environment.
2 Related Work
Three types of exploration policies can be indicated. Exploration policies of the
first type use an intrinsic reward as an exploration bonus. Exploration strategies
of the second type are specific to multi-goal RL settings where exploration is
driven by selecting sub-goals. Exploration policy of the third type use clustered
representation of the set of visited states.
Long-Term Exploration in Persistent MDPs 3
In recent works [4,3,11,15], the curiosity-driven exploration of deep RL agents
is investigated. The exploration methods proposed by these works can be at-
tributed to the first type. The extrinsic sparse reward is replaced or augmented
by a dense intrinsic reward measuring the curiosity or uncertainty of the agent
at a given state. In this way, the agent is encouraged to explore unseen sce-
narios and unexplored regions of the environment. It has been shown that such
a curiosity-driven policy can improve learning efficiency, overcome the sparse
reward problem to some extent, and successfully learn challenging tasks in no-
reward settings.
Another line of recent work focuses on multi-goal RL and can be attributed to
the second type. Algorithm HER [1] augments trajectories in the memory buffer
by replacing the original goals with the actually achieved goals. It helps to get
a positive reward for the initially unsuccessful experience, makes reward signal
denser, and learning more efficient especially in sparse-reward environments. A
number of RL methods [13,7] focus on developing a better policy for selecting
sub-goals for augmentation of failure trajectories in order to improve HER. These
policies ensure that the distribution of the selected goals adaptively changes
throughout training. The distribution should have greater variance in the early
stages of training and direct the agent to the original goal in the latter stages.
Other works [8,12,16] propose methods to generate goals that are feasible, and
their complexity corresponds to the quality of the agent’s policy. The distribution
of generated goals changes adaptively to support sufficient variance ensuring
exploration in goal space.
The Go-Explore [6] algorithm could be attributed to the third type of explo-
ration policy. It builds a clustered lower-dimensional representation of a set of
visited states in the form of an archive of cells. Two types of representation are
proposed for Montezuma’s Revenge environment: with domain knowledge based
on discretized agent coordinates, room number, collected items, and without
domain knowledge based on compressed grayscale images with discretized pixel
intensity into eight levels.
Exploration of the state space is implemented as an iterative process. At
each iteration, a cell is sampled from the archive, its state is restored in the
environment, and the agent starts exploration with stochastic exploration policy.
If the agent visits new cells during the run, they are added to the archive. The
statistic of visits is updated for existing cells in each iteration. For both types
of representation, the cell stores the highest score that the agent had when it
visited the cell. A cell is sampled from the archive by heuristic, preferring more
promising cells.
Exploiting domain-specific knowledge makes it difficult to use Go-Explore
in a new environment. In our work, we use the idea of clustering of a set of
visited states and propose to use a supervised learning model to perform clus-
tering based on the similarity of states. We use a reachability network from the
Episodic Curiosity Module [15] as a similarity model predicting similarity score
for a pair of states. The clusters are organized into a graph using connectivity
information between their states in a similar way as the Memory graph [14] is
4 L. Ugadiarov et al.
built. RND module [4] is used to detect novel states. Our approach does not
exploit domain knowledge, which allows us to apply RbExplore to the Prince of
Persia environment without feature handcrafting.
3 Background
3.1 Markov Decision Processes
A Markov Decision Process (MDP) for a fully observable environment is consid-
ered as a model for interaction of an agent with an environment:
U= (S,A, p, r, γ, sinit ),(1)
S— a state space, A— an action space, p:S × A × S → R— a state transition
distribution, r:S × A × S → R— a reward function, γ∈[0; 1] — a discount
factor, and sinit ∈ S — an initial state of the environment.
An episode starts in the state s0. Each step tthe agent samples an action at
based on the current state st:at∼π(·|st), where π:S × A → R— a stochastic
policy, which defines the conditional distribution over the action space. The en-
vironment responds with a reward rt=r(st, at, st+1) and moves into a new state
st+1 ∼p(·|st, at). The result of the episode is a return R0— a discounted sum of
the rewards obtained by the agent during the episode, where Rt=PT
i=tγi−tri.
Action-value function Qπis defined as the expected return for using action atin
a certain state st:Qπ(st, at) = Est+1∼p(·|st,at),at+1 ∼π(·|st+1)[Rt|st, at]. State-value
function Vπcan be defined via action-value function Qπ:Vπ(s) = maxaQπ(s, a).
The goal of reinforcement learning is to find the optimal policy π∗:
π∗= argmaxπQπ(s, a)∀s∈ S,∀a∈ A (2)
3.2 Persistent MDPs
The persistent data structure allows access to any version of it at any time [5].
Inspired by that structures, we propose persistent MDPs for RL. We consider
an MDP to have a persistence property if for any state sv∈ S exists policy πp
sv,
which transits agent from the initial state sinit to state sv, in a finite number of
timesteps T. Thus, a persistent MDP is expressed as:
Up= (S,A, p, r, γ, sinit , πp),(3)
However, the way of returning to visited states can differ. For example, instead
of policy πp
sv, it could be an environment property, that allows one to save and
load states.
Long-Term Exploration in Persistent MDPs 5
4 Exploration via State Space Clustering
In this paper, we propose the RbExplore algorithm that uses similarity of states
to build clustered representation of a set of visited states. There are two essen-
tial components of the algorithm: a similarity model, which predicts a similarity
measure for a pair of states, and a graph of clusters, which is a clustered rep-
resentation of a set of visited states organized as a graph. The scheme of the
algorithm is shown in Fig. 1.
A high-level overview of one iteration of the RbExplore algorithm:
1. Generate exploration trajectories: sample Mclusters from the graph of clus-
ters Gbased on cluster visits statistics (e.g., preferring the least visited
clusters), roll back to corresponding states, and run exploration.
2. Generate training data for the similarity model Rfrom the exploration tra-
jectories and additional trajectories starting from novel states filtered by
the novelty detection module. Full trajectory prefixes are used to generate
negative examples.
3. Train the similarity model R.
4. Update the graph Gwith states from the exploration trajectories and merge
its clusters. A state is added to the graph Gand forms new clusters if it is
dissimilar to states which are already in the graph G. The similarity model
is used to select such states.
5. Train the novelty detection module on the states from the exploration tra-
jectories.
As a result of one iteration, novel states are added to the graph G, the statistics
of visits to existing clusters are updated, the similarity model and the novelty
detection module are trained on the data collected during the current iteration.
4.1 Similarity Model
As a feature for clustering, it is proposed to use the distance between states in
trajectories. The states located close to each other are considered similar, the
states distant from each other are considered dissimilar. A supervised model is
used to estimate the similarity measure between states R:S×S → [0,1]. It takes
a pair of states as input and outputs a similarity measure between them. The
training dataset is produced by labeling pairs of states for the same trajectory
{τk=sk
1, . . . , sk
T}: triples (sk
i, sk
j, yk
ij ) are constructed, where yk
ij is a class label.
States sk
i, sk
jare considered similar (yk
ij = 1) if the distance between them in the
trajectory τkis less than nsteps: |i−j|< n. Negative examples (yk
ij = 0) are
obtained from pairs of states that are more than Nsteps apart from each other:
|i−j|> N. The model Ris trained as a binary classifier predicting whether two
states are close in the trajectory (class 1) or not (class 0).
Fig. 2 illustrates the training data generation procedure. A neural network
model is used as a similarity model Ras the experiments are performed in en-
vironments with high-dimensional state spaces. The network Rwith parameters
6 L. Ugadiarov et al.
Fig. 1. Scheme of the RbExplore algorithm: Mexploration trajectories are generated
by running exploration from clusters sampled from G.Ladditional trajectories are
generated by running exploration from novel states selected from the exploration tra-
jectories by the novelty detection module. Training data is generated for the similarity
model from the exploration trajectories and the additional trajectories. The similar-
ity model is trained on the generated data. Gis updated based on the states from
the exploration trajectories and their similarity score with clusters of G. The novelty
detection module is trained on the states from the exploration trajectories.
wis trained on the training data set Dusing binary cross-entropy as a loss
function:
E(s1,s2,y)∼D[−ylog Rw(s1, s2)−(1 −y) log (1 −Rw(s1, s2))] →min
w
4.2 Graph of Clusters
The clustering of the state space Sis an iterative process using the similarity
model Rand the chosen stochastic exploration policy πexplore (a|s) (e.g. uniform
distribution over actions). A cluster v= (s, snap) is a pair of state s— the center
of the cluster and the corresponding snapshot of the simulator snap. At each iter-
ation, a cluster is selected from which exploration will be continued. The state of
Long-Term Exploration in Persistent MDPs 7
Fig. 2. Generation of training data for the similarity model. a,b) Generation of positive
and negative examples from states of the same trajectory. c) Generation of negative
examples using the full prefix of the starting cluster of the trajectory.
the selected cluster is restored in the environment using the corresponding snap-
shot, and the agent starts exploration with stochastic exploration policy πexplore.
For each state siof the obtained trajectory τ= (s1, snap1),...,(sT, snapT) a
measure of similarity with the current set of clusters is calculated. A state siis
considered as belonging to the cluster v= (s, snap) if the measure of similarity
between the state and the cluster’s state is greater than the selected threshold
θsim:R(s, si)> θsim . Otherwise, a new cluster is created.
Clusters are organized into a directed graph (V, E). Each vertex of the graph
corresponds to a cluster. If two successive states skand sk+1 in the same trajec-
tory τ=s1, . . . , sTbelong to different clusters viand vj, an arc between those
clusters (vi, vj) is added to the graph. Cluster visit statistics and arc visit statis-
tics are updated each iteration. The graph is initialized with the initial state of
the environment (sinit, snapinit). The cluster is selected from the graph for ex-
ploration using sampling strategy σthat can take into account the structure of
the graph and the collected statistics (e.g., the probability of sampling a cluster
is inversely proportional to the number of visits). Each iteration of the graph
building procedure can be alternated with training the similarity model Ron
the obtained trajectories. The search for a cluster to which the current state si
of the trajectory belongs can be accelerated by considering first those vertices
which are adjacent to the cluster to which the previous state si−1was assigned.
In order to improve the quality of the similarity model Ron states from novel
regions of the state space S, an RND module is used. For each state, the RND
module outputs an intrinsic reward, which is used as a measure of the state’s
novelty. The state is considered novel if the intrinsic reward is greater than
βintrinsic. At each iteration, all states from the trajectory τ, which are detected
by the RND module as a novel, are placed into the buffer of novel states B. The
buffer Bis used to generate additional training data for the similarity model
8 L. Ugadiarov et al.
Rwhich includes states from novel regions of the state space S. A set of states
{(s∗
i, snap∗
i)} ∼ Bis randomly sampled from the buffer B, and the agent starts
an exploration with an exploration strategy by restoring the sampled simulator
states. The resulting trajectories are used solely to generate additional training
data for the similarity model R.
When an exploration trajectory is processed and a new cluster vis added to
the graph Ga new arc from vto the parent cluster from which vwas created is
added to a set of arcs to parent cluster Eparent. A prefix — a sequence of states
from the parent cluster to vis stored along with the arc. Thus, for any cluster
in the graph G, it is possible to construct a sequence of states that leads to the
initial cluster (sinit, snapinit ). This property is used to add negative pairs (s1, s2)
to the training data set of the similarity model Rsuch that s1∈τ1and s2∈τ2
are distant from each other and τ16=τ2. If the exploration trajectory τstarted
from a cluster (s, snap), a full prefix τpref ix =sinit, . . . , s for the trajectory
τ= (s, snap), . . . is constructed to obtain additional negative examples. For a
state s1∈τ, a sufficiently distant state s2∈τprefix is randomly selected to form
a negative example (s1, s2). Fig. 2 (c) illustrates this procedure.
Redundant clusters are created at each iteration due to the inaccuracy of
the similarity model R. A cluster merge procedure is proposed to mitigate the
issue. It tests all pairs of clusters in the graph Gand merges the pair v1=
(s1, snap1), v2= (s2, snap2) into a new cluster if the similarity measure of their
states is greater than the selected threshold θmerge :R(s1, s2)> θmerge . The new
cluster is incident to any arc that was incident to the two original clusters. Cluster
visits statistics are summarized during merging. As a state and a snapshot of
the new cluster, the states and the snapshot of the cluster that was added to the
graph Gearlier are selected.
5 The Prince of Persia Domain
We evaluate our algorithm on the challenging Prince of Persia game environment,
which has over ten complex levels. Fig. 3 shows the first level of the game. To
pass it, the prince needs to: find a sword, avoiding traps; return to the starting
point; defeat the guard, and end the level. In most cases, the agent goes to the
next level when he passes the final door, which he also needs to open somehow.
The input of the agent is a 96x96 grayscale image. The agent chooses from
seven possible joystick actions no-op, left, right, up, down, A, B. The same action
may work differently depending on the situation in the game. For example, the
agent can jump forward for a different distance, depending on the take-off run.
The agent can jump after using action A or strike with a sword if in combat.
Also, the agent can interact with various objects: ledges, pressing plates, jugs,
breakable plates.
The environment is difficult for RL algorithms because it has the action space
with changing causal relationships and requires mastery in many game aspects,
such as fighting. Also, the first reward is thousands of steps apart from the initial
agent position.
Long-Term Exploration in Persistent MDPs 9
C
S
D
I
S
G
D
P
P
a) b)
c)
Fig. 3. The Prince of Persia environment. a,b) Examples of environment observations
from the agent’s view. c) The complete map of the first level of the game. The agent’s
task is to get from the initial location (I) to the final door (D). To solve this problem,
the agent needs to: pick up a sword (S), go back and defeat a guard (G), stand on the
pressure plate (P) to open the door, proceed to the exit. The environment has many
obstacles, such as cell doors (C) and various traps.
We use the percentage coverage metric % Cov = |Uvisited |
|Ufull |, i.e. the ratio of
coverage to the max coverage in the corresponding level, where Uvisited is a set
of the visited units and Uf ull full coverage set. We consider the minimum unit
to be the area that roughly corresponds to space above the plate. For example,
for the first level in the room with the sword, the full area has 36 units, but the
agent can visit only 34 of them.
6 Experiments
We evaluate the exploration performance of our RbExplore algorithm on the
first three levels of the Prince of Persia environment alongside state-of-the-art
curiosity methods ICM and RND.
6.1 Experimental Setup
Raw environment frames are preprocessed by applying a frameskip of 4 frames,
converting into grayscale, and downsampling to 96x96. Frame pixels are rescaled
to the 0-1 range. The neural network Rconsists of two subnetworks: ResNet-18
network and a four-layer fully connected neural network. ResNet-18 accepts a
frame as input and produces its embedding of dimension 512. Embeddings of
the pair’s frames are concatenated and fed into the fully connected network that
performs binary classification of the pair of embeddings.
The same algorithm parameters are used for all levels. The maximum number
of frames per exploration trajectory is 1,500. Episodes are terminated on the loss
of life. Training data for the similarity model Ris generated with parameters
n= 5 and N= 25. The similarity threshold θsimilarity = 0.5. The similarity
10 L. Ugadiarov et al.
threshold for the cluster merging θmerge = 0.5. Merge is run every 15 iteration.
For exploration, clusters are sampled from the graph Gwith probabilities in-
versely proportional to the number of visits. Every iteration M= 30 clusters
are sampled from the graph. The uniform distribution overall actions are used
as the exploration policy πexplore . RND module intrinsic reward threshold for
detecting novel states betaintrinsic = 2.5. L= 1000 states are sampled from the
buffer of novel states.
To prevent the formation of a large number of clusters at the early stages due
to the low quality of the similarity model R, the similarity model is pretrained
for 500,000 steps with a gradual increase of the value of the similarity threshold
from 0 to θsimilarity . At the same time, the necessary normalization parameters
of the RND module are initialized. After pretraining the graph Gis reset to the
environment’s initial state (sinit, snapinit) and RbExplore is restarted with the
fixed similar threshold θsimilairty .
6.2 Exploring the Prince of Persia Environment
By design of the Prince of Persia environment, the agent’s observation does not
always contain information about whether the agent carries a sword. To get
around this issue the agent starts the first level with the sword. Also, the agent
is placed at the point where the sword is located. This location is far enough
from the final door, so reaching the final door is still a challenging task. For the
other levels, we did not make any changes to the initial state.
Evaluation of RbExplore, ICM, and RND on the first three levels of the
Prince of Persia environment is shown in Fig. 5. On the first level, RbExplore
performed significantly outperforms ICM and RND, and also have visited all
possible rooms of the level. The visualization of the coverage is shown in Fig. 4.
On levels two and three, none of the algorithms were able to visit all the
rooms in 15 million steps. On level two RbExpore shows slightly worse results
than RND and ICM. We explain it by the fact that the learning process in RND
and ICM is driven by an exploration bonus, which helps them to explore local
areas inside rooms more accurately. Each of the algorithms was able to visit only
seven rooms at the very beginning of the level. On level three, RbExplore shows
slightly better coverage than RND, and both of them outperform ICM.
RbExplore
ICM
RND
Fig. 4. The visualization of the first level coverage for the best RND, ICM, and Rb-
Explore runs. RbExplore significantly outperforms RND and ICM.
Long-Term Exploration in Persistent MDPs 11
Level 1 (start with sword)
RbExplore RND ICM
0 2M 4M 6M 8M 10M 12M 14M
steps
0
0.2
0.4
0.6
0.8
1
% Cov
Level 2
RbExplore RND ICM
0 2M 4M 6M 8M 10M 12M 14M
steps
0
0.2
0.4
0.6
0.8
1
% Cov
Level 3
RbExplore RND ICM
0 2M 4M 6M 8M 10M 12M 14M
steps
0
0.2
0.4
0.6
0.8
1
% Cov
Fig. 5. Performance of RbExplore, RND, and ICM for the first three levels of the Prince
of Persia environment. Left: Level 1 — RbExplore significantly outperforms RND and
ICM; Center: Level 2 — All methods resulted in visits to seven rooms. RND and
ICM outperform RbExplore as they deal better with local explorations and explore
rooms more thoroughly; Right: Level 3 — RbExplore significantly outperforms ICM
and show comparable performance with RND; curves are averaged over three runs.
The shading indicates the min-max range.
6.3 Ablation Study
In order to evaluate the contribution of each component of the RbExplore algo-
rithm, we perform an ablation study. Experiments were run on level one with two
versions of RbExplore. The first version does not merge clusters in the graph G.
The second one does not build the full trajectory prefix when generating negative
examples for the training data of the similarity model; thus, states of negative
pairs are sampled only from the same trajectory. Fig. 6 shows that disabling
these components hurts the performance of the algorithm.
Additional experiments were run on level one to study the impact of the value
of θmerge parameter on the performance. Fig. 6 shows that the performance of
RbExplore with θmerge ∈ {0.25,0.75}is worse than that of RbExplore with
θmerge = 0.5.
7 Conclusion
In this paper, we introduce a pure exploration algorithm RbExplore that uses
the formalized version of a resettable environment named persistent MDP. The
experiments showed that RbExplore coupled with a simple exploration policy,
which is a uniform distribution over actions, demonstrates performance compa-
rable with RND and ICM methods in the hard-exploration environment of the
Prince of Persia game in a no-reward setting. RbExplore, ICM, and RND got
stuck on the second and third levels roughly in the same locations where the
agent must perform a very specific sequence of actions over a long-time horizon
to go further. The combining of RbExplore exploration and exploitation of RL
approaches, which also utilize pMDPs, is an important direction for future work
to resolve this problem.
12 L. Ugadiarov et al.
Ablation study
No prefix No merge RbExplore
0 2M 4M 6M 8M 10M 12M 14M
steps
0
0.2
0.4
0.6
0.8
1
% Cov
Similarity threshold impact
0.25 RbExplore 0.5 0.75
0 2M 4M 6M 8M 10M 12M 14M
steps
0
0.2
0.4
0.6
0.8
1
% Cov
Fig. 6. Left: Level 1 - Comparison of RbExplore with its versions that do not merge
clusters (No merge) and do not build the full trajectory prefix to generate a negative
example for the training data for the similarity model (No prefix). Disabling merge
procedure and generation of negative examples with the use of trajectory prefixes hurts
performance of the RbExplore algorithm; Right: Level 1 — Comparison of RbExplore
with similarity thresholds for merging θmerge ∈ {0.25,0.5,0.75}. The performance of
RbExplore with θmerge ∈ {0.25,0.75}is worse than that of RbExplore with θmerge =
0.5; curves are averaged over three runs. The shading indicates the min-max range.
Acknowledgements. This work was supported by the Russian Science Foun-
dation (Project No. 20-71-10116).
References
1. Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., Mc-
Grew, B., Tobin, J., Pieter Abbeel, O., Zaremba, W.: Hindsight experience replay.
In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan,
S., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30,
pp. 5048–5058. Curran Associates, Inc. (2017), https://proceedings.neurips.
cc/paper/2017/file/453fadbd8a1a3af50a9df4df899537b5-Paper.pdf
2. Bellemare, M.G., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., Munos,
R.: Unifying count-based exploration and intrinsic motivation. In: Lee, D.D.,
Sugiyama, M., von Luxburg, U., Guyon, I., Garnett, R. (eds.) Advances
in Neural Information Processing Systems 29: Annual Conference on Neu-
ral Information Processing Systems 2016, December 5-10, 2016, Barcelona,
Spain. pp. 1471–1479 (2016), https://proceedings.neurips.cc/paper/2016/
hash/afda332245e2af431fb7b672a68b659d-Abstract.html
3. Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T., Efros, A.A.: Large-
scale study of curiosity-driven learning. In: ICLR (2019)
4. Burda, Y., Edwards, H., Storkey, A., Klimov, O.: Exploration by random net-
work distillation. In: International Conference on Learning Representations (2019),
https://openreview.net/forum?id=H1lJJnR5Ym
5. Driscoll, J.R., Sarnak, N., Sleator, D.D., Tarjan, R.E.: Making data structures
persistent. Journal of computer and system sciences 38(1), 86–124 (1989)
Long-Term Exploration in Persistent MDPs 13
6. Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K.O., Clune, J.: Go-explore: a new
approach for hard-exploration problems (2021)
7. Fang, M., Zhou, T., Du, Y., Han, L., Zhang, Z.: Curriculum-guided hindsight
experience replay. In: Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch´e-Buc,
F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems.
vol. 32, pp. 12623–12634. Curran Associates, Inc. (2019), https://proceedings.
neurips.cc/paper/2019/file/83715fd4755b33f9c3958e1a9ee221e1-Paper.pdf
8. Florensa, C., Held, D., Geng, X., Abbeel, P.: Automatic goal generation for rein-
forcement learning agents. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th
International Conference on Machine Learning. Proceedings of Machine Learning
Research, vol. 80, pp. 1515–1528. PMLR, Stockholmsm¨assan, Stockholm Sweden
(7 2018), http://proceedings.mlr.press/v80/florensa18a.html
9. OpenAI, Andrychowicz, M., Baker, B., Chociej, M., Jozefowicz, R., McGrew, B.,
Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., Schneider, J., Sidor,
S., Tobin, J., Welinder, P., Weng, L., Zaremba, W.: Learning dexterous in-hand
manipulation (2019)
10. Oudeyer, P.Y., Kaplan, F.: How can we define intrinsic motivation ? In: Pro-
ceedings of the 8th International Conference on Epigenetic Robotics: Modeling
CognitiveDevelopment in Robotic Systems (2008)
11. Pathak, D., Agrawal, P., Efros, A.A., Darrell, T.: Curiosity-driven exploration by
self-supervised prediction. In: ICML (2017)
12. Racaniere, S., Lampinen, A., Santoro, A., Reichert, D., Firoiu, V., Lillicrap, T.:
Automated curriculum generation through setter-solver interactions. In: Interna-
tional Conference on Learning Representations (2020), https://openreview.net/
forum?id=H1e0Wp4KvH
13. Ren, Z., Dong, K., Zhou, Y., Liu, Q., Peng, J.: Exploration via hindsight goal
generation. In: Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch´e-Buc, F., Fox,
E., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 32,
pp. 13485–13496. Curran Associates, Inc. (2019), https://proceedings.neurips.
cc/paper/2019/file/57db7d68d5335b52d5153a4e01adaa6b-Paper.pdf
14. Savinov, N., Dosovitskiy, A., Koltun, V.: Semi-parametric topological memory
for navigation. In: International Conference on Learning Representations (2018),
https://openreview.net/forum?id=SygwwGbRW
15. Savinov, N., Raichuk, A., Vincent, D., Marinier, R., Pollefeys, M., Lillicrap,
T., Gelly, S.: Episodic curiosity through reachability. In: International Confer-
ence on Learning Representations (2019), https://openreview.net/forum?id=
SkeK3s0qKQ
16. Skrynnik, A., Panov, A.I.: Hierarchical Reinforcement Learning with Clustering
Abstract Machines. In: Kuznetsov, S.O., Panov, A.I. (eds.) Artificial Intelligence.
RCAI 2019. Communications in Computer and Information Science. vol. 1093, pp.
30–43. Springer (2019). https://doi.org/10.1007/978-3-030-30763-9 3
17. Xu, K., Verma, S., Finn, C., Levine, S.: Continual learning of control primitives:
Skill discovery via reset-games (2020)
18. Zhu, H., Yu, J., Gupta, A., Shah, D., Hartikainen, K., Singh, A., Kumar, V., Levine,
S.: The ingredients of real world robotic reinforcement learning. In: International
Conference on Learning Representations (2020), https://openreview.net/forum?
id=rJe2syrtvS