Available via license: CC BY 4.0
Content may be subject to copyright.
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2024.0429000
Rethinking Exploration and Experience
Exploitation in Value-Based Multi-Agent
Reinforcement Learning
ANATOLII BORZILOV1,2, ALEXEY SKRYNNIK1,3 , and ALEKSANDR PANOV1,2,3
1Federal Research Center "Computer Science and Control", 9 60-Letiya Oktyabrya Ave., Moscow, 117312, Russia
2Moscow Institute of Physics and Technology, 9 Institutsky per., Dolgoprudny, 141701, Russia
3AIRI, 32 Kutuzovsky Ave., Moscow, 121170, Russia
Corresponding author: Anatolii Borzilov (e-mail: borzilov.av@gmail.com).
This work was partially supported by the Analytical Center for the Government of the Russian Federation in accordance with the subsidy
agreement (agreement identifier 000000D730321P5Q0002; grant No. 70-2021-00138).
ABSTRACT Cooperative Multi-Agent Reinforcement Learning (MARL) focuses on developing strategies
to effectively train multiple agents to learn and adapt policies collaboratively. Despite being a relatively new
area of research, most MARL methods are based on well-established approaches used in single-agent deep
learning tasks due to their proven effectiveness. In this paper, we focus on the exploration problem inherent in
many MARL algorithms. These algorithms often introduce new hyperparameters and incorporate auxiliary
components, such as additional models, which complicate the adaptation process of the underlying RL
algorithm to better fit multi-agent environments. We aim to optimize a deep MARL algorithm with minimal
modifications to the well-known QMIX approach. Our investigation of the exploitation-exploration dilemma
shows that the performance of state-of-the-art MARL algorithms can be matched by a simple modification of
the ϵ-greedy policy. This modification depends on the ratio of available joint actions to the number of agents.
We also improve the training aspect of the replay buffer to decorrelate experiences based on recurrent rollouts
rather than episodes. The improved algorithm is not only easy to implement, but also aligns with state-of-the-
art methods without adding significant complexity. Our approach outperforms existing algorithms in four of
seven scenarios across three distinct environments while remaining competitive in the other three.
INDEX TERMS Exploration, multi-agent reinforcement learning, value based methods.
I. INTRODUCTION
Multi-Agent Reinforcement Learning (MARL) is an emerg-
ing field in artificial intelligence that aims to develop robust
strategies for training multiple agents to learn and adapt their
policies collaboratively [1], [2]. MARL methods are designed
for both antagonistic and cooperative problems. Both formu-
lations require taking into account the actions of other agents,
and most practical applications also work under conditions of
partial observability, where information about the goals and
actions of other agents is not fully known.
Notable examples of MARL success include its application
to complex tasks such as autonomous driving, where multiple
vehicles must coordinate to navigate, as demonstrated by the
Nocturne framework [3]. Examples of important applications
of the multi-agent paradigm include IMP-MARL, which pro-
vides a platform for evaluating the scalability of cooperative
MARL methods responsible for scheduling inspections and
repairs of specific system components to minimize mainte-
nance costs [4]. Another example, MATE addresses target
coverage control challenges in real-world scenarios by pre-
senting an asymmetric cooperative-competitive game with
two sets of learning agents, cameras, and targets, each with
opposing goals [5]. These successes highlight MARL’s poten-
tial for solving real-world problems that require coordinated
action among multiple agents.
Despite these successes, even for tasks that are close to
real-world applications, MARL algorithms are often tested in
game-like environments. For example, the StarCraft Multi-
agent Challenge (SMAC) and SMACv2 simulators are based
on the strategy game StarCraft II1, in which teams of agents
work together to defeat opposing groups [6], [7]. Similarly,
research in Google Football demonstrates the applicability of
MARL to complex, dynamic tasks [8]. Although relatively
new, MARL methods often leverage basic techniques from
1StarCraft and StarCraft II are trademarks of Blizzard EntertainmentTM.
VOLUME 11, 2023 1
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3530974
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Borzilov et al.: Rethinking Exploration and Experience Exploitation in Value-Based Multi-Agent Reinforcement Learning
Notation Description
SSet of all the environment states
sState of the environment
ASet of all agents
aAgent
nNumber of agents
USet of all actions
uaAction of an agent a
uJoint action of agents
P(s′|s,u)Transition function
r(s,u)Reward function
γDiscount factor
RtDiscounted return at a step t
ZSet of all individual observations
O(s,a)Observation function
τaAction-observation history of an agent a
τJoint action-observation of agents
πa(ua|τa)Policy of an agent a
Qπ(st,ut)Action-value function
θNetwork parameters
Qtot (τt,ut, θ)Joint action-value function
L(θ)TD error
DReplay buffer
DReplay buffer size
BBatch of transitions
BBatch size
TABLE 1. List of notations.
single-agent deep reinforcement learning tasks due to their
proven success.
One of the key challenges in designing MARL algorithms
is to account for stochasticity in the set of experiences from
the environment on which the agent is trained. In addition
to changes in the agent’s own policy, which updates the
distribution of observations it receives, the policies of other
agents change dynamically. This leads to difficulties in adapt-
ing classical statistical learning methods, including neural
networks, to MARL problems. Another core challenge in
MARL is the exploration problem, which becomes increas-
ingly difficult in complex scenarios that require sophisticated
cooperation among multiple agents. Agents risk falling into
local optima, preventing them from acquiring the complex
strategies needed to solve the problem. While the common
approach to exploration in value-based MARL is the ϵ-greedy
strategy [9], a variety of novel methods, such as MAVEN [10]
and SMMAE [11], focus on coordinated exploration.
In our paper, we focus on the popular value-based method
called QMIX [12]. This method has become the theoretical
and technical foundation for many MARL approaches [11],
[13]–[16]. First, we investigate how the exploration strategy
of value-based methods can be improved and propose a sim-
ple approach based on the number of available joint actions.
Second, we examine the implementation of experience ex-
ploitation, which is often not well-articulated in the literature.
We address how to effectively sample data from the replay
buffer for methods using recurrent networks. These networks
are vital in MARL, as they allow agents to retain historical
information essential for decision-making in partially observ-
able environments typical of multi-agent systems.
To summarize, we make the following contributions:
Contribution 1:
We propose a novel exploration strategy for value-based
methods, which leverages the number of available joint ac-
tions, improving the exploration-exploitation trade-off.
Contribution 2:
We analyze the under-explored area of experience exploita-
tion in value-based MARL methods, specifically focusing on
how to effectively sample data from the replay buffer when
recurrent networks are used.
Contribution 3:
We extensively study our proposed modifications on two
benchmarks, SMAC and POGEMA, demonstrating that our
approach achieves comparable or even superior results to
state-of-the-art value-based MARL methods.
The paper is organized as follows: Section II outlines the
background of multi-agent reinforcement learning field, Sec-
tion III provides a review of related literature, Section IV
describes the methodology, and Sections V, VI details the
experimental setup and results.
II. BACKGROUND
This paper addresses the problem of multi-agent cooper-
ative tasks, formalized as decentralized partially observ-
able Markov decision process (Dec-POMDP) tuple G=
⟨S,A,U,P,r,Z,O,n, γ⟩. State s∈Sdescribes the complete
state of the environment at the moment. At each timestep each
agent a∈A≡1, ..., nchooses an action ua∈U; chosen
actions of all agents form a joint action u∈U≡Un. These
actions result in environment transition to a new state accord-
ing to the transition function P(s′|s,u) : S×U×S→[0,1],
at each timestep t. The rewards are given according to the
reward function r(s,u) : S×S→R, which is shared by all
agents, and γ∈[0,1) is a discount factor.
At each timestep, each agent areceives an individual obser-
vation za∈Zaccording to the observation function O(s,a) :
S×A→Z. Each agent maintains an action-observation
history τa∈T≡(Z×U)∗, on which the agent’s policy
πa(ua|τa) : T×U→[0,1] is conditioned. The joint policy
πassociated with an action-value function Qπ(st,ut) =
Est+1:∞,ut+1:∞[Rt|st,ut], where Rt=P∞
i=0 γirt+irepresents
the discounted return. The training objective is to find the
optimal action-value function.
DQN [17] is a popular algorithm for single-agent tasks,
which learns agent’s action-value function. For multi-agent
tasks, we learn the joint action-value function Qtot (τt,ut, θ),
where τ∈Tis a joint action-observation history and θ
are network parameters. During learning, the replay buffer
D, consisted of tuples (τt,ut,rt,τt+1), is utilized. Thus, the
network parameters θwill be learned by the TD error:
2VOLUME 11, 2023
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3530974
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Borzilov et al.: Rethinking Exploration and Experience Exploitation in Value-Based Multi-Agent Reinforcement Learning
L(θ) = E(τt,ut,rt,τt+1)∼Dr+γmax
ut+1
Qtot (τt+1,at+1 , θ−)2
,
(1)
where θ−are the parameters of the target network that are
periodically updated with θ.
One of the core issues of cooperative MARL is that simul-
taneous learning of multiple agents induces non-stationarity
of the environment. That leads to the problem that decen-
tralized learning of multiple agents is unstable. To address
this issue and improve training stability under non-stationary
conditions, the centralized training with decentralized exe-
cution (CTDE) paradigm was introduced. According to this
approach, the execution is decentralized, which means that
each agent achooses actions according to its local action-
observation history τa. Despite that, the training is central-
ized, and during training the learning algorithm has the access
to the state sof the environment. In order to allow each agent
ato participate in a decentralized execution, it is important to
assert that:
arg max
u
Qπ(s,u) = (arg max
u1
Q1(τ1,u1)... arg max
un
Qn(τn,un))
(2)
One of the popular methods to solve the problem is
QMIX [12]. This method is a variant of DQN [17] for mul-
tiagent tasks, and is based on the ideas of VDN [18]. For
each agent, QMIX uses a DRQN network to calculate the
individual value function Qa(τa,ua). These networks receive
the current observation as input at each timestep. QMIX
then employs a mixing network, which takes the outputs of
the agents’ networks as input and produces the total value
Qtot (τ,u). The weights of the mixing network are generated
by a hypernetwork based on the current state and are non-
negative. The use of non-negative weights in the mixing
network ensures that ∂Qtot
∂Qa>= 0,∀a∈A, which in turn
guarantees condition (2).
III. RELATED WORK
The exploration-exploitation dilemma in MARL is closely
related to similar challenges in deep RL. Many techniques
originally developed for single-agent settings have been
adapted for MARL. For instance, curiosity-driven explo-
ration, a method that enhances the exploration process, has
been effectively integrated into MARL to manage complex-
ities arising from multiple interacting agents. Additionally,
tools such as replay buffers and recurrent neural networks are
employed to better handle and utilize data collected during
agent interactions. However, while these adaptations improve
exploration and data utilization, they do not directly address
the non-stationarity problem inherent in MARL. Below, we
provide an overview of such techniques in single-agent RL
and their application in MARL, highlighting their strengths
and limitations in the multi-agent context.
A. EXPLORATION IN MARL.
The exploration problem is a well-studied topic in reinforce-
ment learning. Bootstrapped DQN [19] learns several sepa-
rate Q-value functions, and at the beginning of each episode
samples one of these functions. Then, the agent follows the
greedy policy for that sampled function. This way, the method
allows the agent to use temporally-extended exploration dur-
ing the whole episode. ϵz-greedy [20] modifies the ϵ-greedy
method, and instead of sampling single actions, it samples
options of actions, which agent follows for the number of
steps that is sampled according to the distribution z. Another
approach is to use an intrinsic reward to direct the explo-
ration process. ICM [21] adopts an inverse model to extract
features out of inputs that ignore uncontrollable aspects of
the environment, and then uses the prediction error of these
features as an intrinsic reward. VIME [22] uses Bayesian
neural networks to approximate environment dynamics and
then maximizes the information gain about the agent’s belief
of environment dynamics. VDM [23] models the stochas-
ticity in dynamics to enhance predictions and computes the
intrinsic reward using the environmental state-action transi-
tions probabilities. RND [24] computes the exploration bonus
based on state novelty, which is estimated by distilling a
fixed randomly initialized network into another one. State
marginal matching technique [25] (SMM) learns a policy to
match its state marginal distribution with a target state dis-
tribution. NGU [26] computes intrinsic reward based on two
components: exploration bonus for lifelong novelty, which
is computed using RND, and episodic novelty bonus. To
compute the episodic novelty bonus, it uses episodic memory,
which contains all the visited states in the current episode.
Then, it encourages the agent to visit as many different states
as possible during a single episode. Agent57 [27], being
based on NGU, also learns a family of policies with different
degrees of exploration and exploitation. It uses an adaptive
meta-controller to choose from these policies, which allows
to control the intensity of exploration during the training
process. Instead of exploring novel states, SMiRL [28] tries
to minimize a surprise from new states, thus developing
behavior that decreases entropy. Such an approach allows
the learning agent to develop meaningful skills in unstable
environments, where unexpected events happen on their own.
Generally, multi-agent methods try to adopt existing
single-agent approaches for exploration. LIIR [29] uses an
individual intrinsic reward for each agent, which allows the
agents to be stimulated differently. The parameters of intrinsic
rewards are learned using the centralized critic to maximize
the team reward. EMC [30] utilizes a curiosity module, which
is trained to predict individual Q-values of agents. These
prediction errors are used as additional intrinsic rewards.
Wang et al. [31] introduce two methods that are based on
measuring of the interactions between agents to compute
intrinsic rewards: EITI uses the mutual information between
agents’ trajectories, and EDTI quantifies the influence of
an agent on expected returns of other agents. MAVEN [10]
VOLUME 11, 2023 3
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3530974
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Borzilov et al.: Rethinking Exploration and Experience Exploitation in Value-Based Multi-Agent Reinforcement Learning
uses a latent variable, which is generated by a hierarchical
policy, to perform coordinated exploration in different modes.
Then, MAVEN maximizes the mutual information between
the observed trajectories to achieve diverse behavior.
CMAE [32] utilizes restricted space exploration and shared
goals. It first explores goals from a low-dimensional restricted
space and then trains exploration policies to reach these goals,
which represent under-explored states. This method showed
significant improvement in sparse-reward environments. SM-
MAE [11] enhances exploration in two different ways. Firstly,
it introduces an intrinsic reward based on SMM. Secondly,
it uses adaptive exploration, and bases each agent’s proba-
bilities of choosing random actions on correlation between
agents. It predicts the actions of each agent based on other
agents’ observations to measure the correlation between them
and increases the probability of choosing random actions if
the correlation is too high.
B. EXPERIENCE EXPLOITATION IN MARL
While classic algorithms during training process replay whole
episode sequences, it may create number of practical issues
because of varying episode length and correlated states in tra-
jectories. To solve this issue, R2D2 [33] trains on sequences
of transitions of fixed length, which overlap by half of their
length and never cross boundaries of the episode. Though that
method allows to overcome some issues, which are created
by learning on the whole episodes, it also creates an issue of
necessity to properly initialize hidden recurrent states during
training. R2D2 introduces two strategies to solve this issue:
storing the recurrent state in replay buffer, and using "burn-
in" phase during training, i.e. using the first half of the training
sequence only for initialization of the recurrent states, and
apply the training objective to the second half of the sequence.
Number of methods also utilizes prioritized experience
replay (PER) [34]. Ape-X [35] suggests to use the absolute
TD error for experience priorities. R2D2 [33] and R2D3 [36]
use the mixture of the maximum absolute TD error and the
mean TD error in the sequence.
Most works focused on modifying experience replay con-
sider the single-agent domain, though some adopt this con-
cept for multi-agent tasks. MAC-PO [37] uses weights for
weighted error and samples training transitions with a uni-
form distribution. [38] adopts PER for multi-agent tasks,
but without recurrence, setting priorities for each transition.
Number of methods, like QMIX [12] and its derivatives [10],
[11], [13], sample for training uniformly full episodes. To
the best of our knowledge, there are no works that consider
modifying experience replay to use fixed-length sequences
for training MARL approaches, with or without prioritization.
IV. METHOD
This section outlines the key methodological advancements
introduced in our research to enhance exploration and train-
ing efficiency in reinforcement learning. The scheme of the
proposed approach is sketched in Figure 1. We first present
a novel modification to the traditional ϵ-greedy policy, where
the exploration probability is dynamically adjusted based on
the count of available joint actions.
Following this, we describe our enhanced replay buffer
strategy designed to improve the training process’s efficiency
and stability. Instead of relying on full episodes, we utilize
overlapping sequences of fixed length, allowing for more
effective learning across episode boundaries.
a: Modification of ϵ-greedy policy.
As ϵ-greedy policy uses a constant value of a probability
of choosing random actions ϵ, it may be hard to adapt the
exploration degree to the current environment situation. The
number of available to agents actions may vary in some
environments, and the number of agents may change during
the episode as well. The varying number of the available joint
actions leads to the necessity of dynamically adaptation of the
exploration extent in order to properly explore environment
states.
In contrast to ϵ-greedy policy, where a constant value of
ϵis used, we compute the exploration probability ϵusing the
available actions count. We assume that the more joint actions
Uare available, the more intense exploration it is required
to find the optimal strategy. According to that reasoning, we
introduce the following way to compute the value of ϵt:
ϵt=tanh(α·plog(|Ut|)) (3)
where |Ut|is a count of the available joint actions at the step
t;αis a constant hyperparameter. Also, we set minimum and
maximum boundaries ϵmin and ϵmax , so that ϵtwould always
stay in reasonable limits.
Scaling the value of ϵton available joint actions count may
allow adapt exploration intensity to the current state. In some
environments, like SMAC [6], number of actions, available to
agents, may greatly vary.
b: Replay buffer enhancement.
Replay buffer Dconsists of a fixed number of episodes’
steps D. Instead of training on full episodes, we train using
sequences of steps of fixed length m, in order to decrease the
dependency of the learning process on the episodes length.
These sequences aren’t restricted by episodes boundaries and
may contain steps of different episodes. At the beginning of
each train iteration, recurrent state is initialized to zero. The
first half of each sequence is used for the initialization of the
recurrent state, and the training objective is only applied to
the second half of a sequence. If a sequence contains parts of
different episodes, at the step of switching between episodes
the recurrent state is zero initialized again. This approach
helps decorrelate experiences during training, resulting in
more diverse and representative training data. As a result,
correlations in the training data are mitigated, improving the
stability and generalization of the learning process. Also,
to decrease dependency on the environment, we run train
iterations after the constant number of rollout time-steps per-
formed.
4VOLUME 11, 2023
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3530974
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Borzilov et al.: Rethinking Exploration and Experience Exploitation in Value-Based Multi-Agent Reinforcement Learning
Episode 1
S
O1
OM
…
S
O1
OM
…
t2
S
O1
OM
…
tN
…
…
…
…
State Episode 2
S
O1
OM
…
S
O1
OM
…
t1t2
S
O1
OM
…
tN
…
…
…
…
Agent 1
Agent M
Sequence for Warming Up the RNN
Randomly sampled
starting indexes
t1
(a) Replay Buffer Episode 3
S
O1
OM
…
S
O1
OM
…
t1t2
S
O1
OM
…
tN
…
…
…
…
Training Sequence
Sampling Batch
…
…
…
…
…
…
…
…
…
…
GRU
Ot
Qa
ht-1 ht
Mixing Network
Hyper-
networks
St
Qtot
GRU
Qa
ht-1 ht
…
Agent MAgent 1
Replay
Buffer
MLP
MLP
MLP
MLP
(b) Training
t1T
Environ-
ment
Agent
Network
available actions
…
|Ut|
Number of available
joint actions
Epsilon
Greedy
adapt ϵ
using (3)
Qa
ua
(c) Sampling
Ot
Ot
M1
a
FIGURE 1. The scheme of the modified QMIX approach presents three alternating phases of learning: (a) This phase highlights how data is presented in
the Replay Buffer. For each state, the tuple consists of agents’ observations. For clarity, additional information stored in the Replay Buffer-such as rewards,
episode termination and truncation flags, selected actions, and action masking-is excluded. Unlike the original QMIX approach, our modifications involve
sampling data slices across multiple episodes (rather than training on entire episodes). This forms a batch of a specific length, consisting of two parts: a
warming-up sequence (used to better initialize the recurrent network’s hidden state ht) and a training sequence (used for actual training). (b) A batch of
fixed size is sampled from the Replay Buffer and used to train the joint value function Qtot via hyper-networks through the Mixing Network. This process
also optimizes the agent networks using the global environment state, following the CTDE paradigm. Each agent has its own observation, but all agents
share the same network weights. (c) This phase describes how the Replay Buffer is filled with new experiences. It also highlights our extension to the
ϵ-greedy policy, where ϵis adjusted based on the number of available joint actions |Ut|for the entire team population, as described in Equation (3).
VOLUME 11, 2023 5
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3530974
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Borzilov et al.: Rethinking Exploration and Experience Exploitation in Value-Based Multi-Agent Reinforcement Learning
The process of insertion of data in replay buffer is described
in Algorithm 1. As we sample a new episode of length T, we
need to store it in replay buffer D. The maximum amount of
transitions that we store in Dis D, so, if the size of Dexceeds
that limit, we remove the oldest transitions.
Algorithm 1: Inserting transitions in replay buffer
Input: List of transitions D, buffer size D
Output: List of transitions D
1Sample transition tuples ρ←
st,rtza
t,ua
t,za
t+1|a= 1, ..., n|t= 0, ..., T−1
2for each step t = 0, ..., T−1do
3if size(D) = Dthen
4D ← D[1 :] // Pop oldest index
5end
6D ← concat(D, ρt)
7end
We sample training sequences following the Algorithm 2.
As we train the network on batches of size B, we uniformly
choose Bstarting indices. Then, for each sampled index iwe
put in batch Ba sequence which consists of mtransitions start-
ing from index iup to index i+m. If the value i+mexceeds
the capacity of D, we select in the sequence transitions from
iup to the last transition stored in D, and select the rest of
transitions starting from the index 0 to fill the sequence up to
the size m.
Algorithm 2: Sample transitions from replay buffer
Input: List of transitions D, sequence size m, batch
size B
Output: Batch of transitions B
1B ← () // Initialize batch as an
empty list
2while size(B)<Bdo
3i∼ U(0,size(D)−1) // Randomly sample
starting index of a sequence
4if i+m<size(D)then
5b← D[i:i+m]
6else
7b←concat(D[i:],D[: size(D)−i])
8end
9B ← concat(B,b)
10 end
V. EXPERIMENTAL SETUP
In this section, we present the detailed description of the
environments used for the evaluation of the proposed mod-
ifications and overall setup description. We consider three
environments: SMAC [6], SMACv2 [7] and POGEMA [39],
[40].
We use the QMIX [12] algorithm as the base method for
the proposed modifications. Our implementation is based on
Hyper-parameter Value
Sequence length 128 steps
Length of a burn-in phase 64 steps
ϵanneal time 50000 steps
ϵfinish 0.05
ϵmin 0.05
ϵmax 0.1
Learning rate 0.001
α0.04
γ0.99
GRU hidden size 64
Mixing network size 32
Mixer’s hypernet layers 2
Mixer’s hypernet hidden size 64
Optimizer Adam
Improved Replay Buffer (w/ RB)
Size for SMAC 200000 steps
Size for POGEMA 800000 steps
Target update interval 10000 steps
Default Replay Buffer (w/o RB)
Size for SMAC 5000 episodes
Size for POGEMA 1000 episodes
Target update interval 200 episodes
TABLE 2. The hyper-parameters of proposed modifications and base
version of QMIX algorithm.
PyMARL [6]. Following SMMAE [11], we changed RM-
SProp [41] optimizer with Adam [42] optimizer with default
hyper-parameters.
At the beginning of training, we linearly anneal ϵfrom 1.0
to 0.05 over 50,000 steps. For methods without the replay
buffer modification, the buffer size is set to 5,000 episodes,
and after sampling a new episode from the environment, we
select a batch of 32 episodes for training. For methods with
the replay buffer modification, for every 128 steps sampled
from the environment, we select a batch of 64 sequences for
training. Each sequence is 128 steps long, with 64 steps used
for the “burn-in” phase and the remaining 64 steps used for
training. In the POGEMA environment, the replay buffer size
is set to 1,000 episodes or 800,000 steps. The agents’ network
architecture is the same as that of QMIX, with a GRU [43]
recurrent layer having a 64-dimensional hidden state. The
target network is updated every 200 training episodes for
methods without the replay buffer modification, and every
10,000 steps for methods with the replay buffer modification.
The complete hyper-parameters setup is shown in Table 2.
The hyperparameters of QMIX, SMMAE, QPLEX [13]
and WQMIX [16] were selected based on their default im-
plementations. Hyperparameters, specific for our approach,
including the value of α, the ϵmin and ϵmax, the sequence length
and the target update interval, were determined through pre-
liminary experiments. We selected the value of αwithin the
range of 0.01 to 0.05, the sequence length from 16 to 256, the
target update interval from 2,500 to 40,000, ϵmin was set to
0.05, and ϵmax was chosen from a range of 0.1 to 0.5.
6VOLUME 11, 2023
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3530974
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Borzilov et al.: Rethinking Exploration and Experience Exploitation in Value-Based Multi-Agent Reinforcement Learning
FIGURE 2. Screenshot examples of SMAC scenarios from the StarCraft II game: (left) Corridor; (middle) 2c_vs_64zg; (right) MMM2. In all scenarios, the red
units are controlled by RL agents, while the blue units are controlled by the game’s built-in AI. The RL agents are trained jointly during learning but make
independent decisions during testing, following the centralized training, decentralized execution paradigm. In the Corridor scenario, the RL-controlled
zealots must coordinate their movement towards the lower left corner of the map, where a narrow corridor is located. This positioning allows them to
defeat a large number of zerg units. In the 2c_vs_64zg scenario, the RL agents control two Colossus, exploiting the units’ ability to traverse high ground to
defeat a large number of zergs. In the MMM2 scenario, a group of 2 Marauders, 7 Marines, and 1 Medivac, controlled by RL agents, attempt to defeat a
larger group consisting of 3 Marauders, 8 Marines, and 1 Medivac.
FIGURE 3. Illustration of a random POGEMA maps with a size of 16 ×16 and a population of 16 agents. The agents are represented as colored filled
circles, and their targets are shown as circles of the same color. Each agent has a single, unique target. The first image shows the task from the agent’s
perspective, where the rectangular area represents the field of view. e study a lifelong scenario, in which an agent, upon reaching a target, is immediately
assigned a new one. POGEMA is a challenging environment that requires high generalization abilities from RL algorithms, as the placement of obstacles
can vary. During testing, we used scenarios different from the training set.
A. SMAC AND SMACV2 ENVIRONMENTS
SMAC [6] benchmark is focused on micromanagement task
of the popular game StarCraft II, where each unit is controlled
by a different agent in order to defeat the opponent army
controlled by a game’s build-in scripted AI. The game is
considered won if agents managed to defeat every enemy
unit within the time limit, and the quality metrics is win rate.
Initially, following SMMAE [11], we selected three scenarios
for the experiments on SMAC: 6h_vs_8z,2c_vs_64zg, and
corridor. However, for 6h_vs_8z, we observed that the agent
learned to exploit the reward system. The enemies had shields
that regenerated over time, and under the standard settings in
SMAC, agents were rewarded for regeneration of enemies’
shields as if it were damage. As a result, it was more ad-
vantageous for agents to damage the shields and then retreat
out of the enemies’ line of sight, which lead to low win rate.
As it was a known issue, which wasn’t planned to be fixed2,
we decided to replace the map 6h_vs_8z with a map MMM2,
which is also considered to be "super-hard". Figure 2 includes
2https://github.com/oxwhirl/smac/issues/72
screenshots of the selected maps. The version of StarCraft II
used for the evaluation is SC2.4.10 (B75689).
Among the chosen scenarios the total number of actions of
each agent varied from 18 to 70, and number of agents varied
from 2 to 10. Given that, the maximum theoretical amount of
unique joint actions is 1810, though in actual experiments lots
of actions are often unavailable.
While SMAC remains a popular benchmark, it has lim-
itations, such as fixed starting positions and unit types. It
has been shown [7] that certain methods can learn only se-
quences of actions, disregarding observations, yet still solve
some SMAC scenarios. To address these limitations, the
SMACv2 [7] environment was introduced. SMACv2 utilizes
procedural content generation, allowing agents trained with
these methods to achieve better generalization and solve a
wider range of scenarios.
We selected three SMACv2 scenarios for evaluation:
zerg_10_vs_11,protoss_10_vs_11, and terran_10_vs_11.
For these scenarios, we used the default parameters, including
unit type distributions and starting positions, as described in
the original SMACv2 publication.
VOLUME 11, 2023 7
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3530974
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Borzilov et al.: Rethinking Exploration and Experience Exploitation in Value-Based Multi-Agent Reinforcement Learning
0.0 0.5 1.0 1.5 2.0
Steps 1e6
0.0
0.2
0.4
0.6
0.8
1.0
Test Win Rate
corridor
QMIX-RB-P
QMIX-RB
QMIX-P
QMIX
SMMAE
0.0 0.5 1.0 1.5 2.0
Steps 1e6
0.0
0.2
0.4
0.6
0.8
1.0
Test Win Rate
2c_vs_64zg
QMIX-RB-P
QMIX-RB
QMIX-P
QMIX
SMMAE
0.0 0.5 1.0 1.5 2.0
Steps 1e6
0.0
0.2
0.4
0.6
0.8
1.0
Test Win Rate
MMM2
QMIX-RB-P
QMIX-RB
QMIX-P
QMIX
SMMAE
FIGURE 4. Comparison of the mean test win rate of proposed modifications with QMIX and SMMAE algorithms on SMAC. QMIX-RB-P stands for QMIX
with replay buffer and exploration policy modifications; QMIX-RB stands for QMIX with replay buffer modification, and QMIX-P stands for QMIX with
exploration policy modification. Plots show the mean and 95% confidence interval across five runs. For the corridor scenario, algorithms with enhanced
exploration – QMIX-P, QMIX-RB-P and SMMAE – have better performance compared with QMIX; replay buffer modification also improves the results, and
the best performance is achieved by QMIX-RB-P. For the 2c_vs_64zg scenario, replay buffer modification leads to worse performance, and the other
algorithms have similar results. In the MMM2 scenario, QMIX-RB-P has the steepest learning curve, with similar final performance across algorithms.
B. POGEMA ENVIRONMENT
POGEMA [39] is a grid-based multi-agent pathfinding envi-
ronment where multiple agents must navigate to their respec-
tive goals, with each goal considered reached when an agent
steps on it. The task of decentralized multi-agent pathfinding
is particularly challenging, as highlighted by several special-
ized methods [44]–[48]. The example of POGEMA environ-
ment is shown in Figure 3.
We consider the LifeLong scenario, where when an agent
accomplishes its goal, a new goal is set for it. There are
obstacles present on the map, and agents cannot pass through
a cell occupied by another agent, which necessitates adopting
cooperative behavior to maximize rewards. We use random
generated maps for training, which means that the agents’
training goal is not to memorize the map, but to be able to
adapt to a new layout and find the way to the goal on an
unknown map. The quality metric used is throughput, i.e., the
ratio of the number of the accomplished goals (by all agents)
to the episode length.
For this experiment, we do not consider the exploration
policy modification as it depends on the current number of
available actions, which is constant for POGEMA environ-
ment – this means that the usage of the modified exploration
policy would still result in a constant value of ϵ.
VI. EXPERIMENTAL RESULTS
In this section, we present our experimental results on
SMAC [6] and POGEMA [39] benchmarks. As we study
the effectiveness of QMIX [12] method with the proposed
modifications of the exploration policy and replay buffer, we
consider the version of QMIX with both of the modifications,
QMIX with only replay buffer modification, and QMIX with
only exploration policy modification. The source code of
these methods is available at 3.
3https://github.com/tolyan3212/re-qmix
A. COMPARISON ON SMAC
To study the impact of the modified exploration policy
and replay buffer modification, we experiment on different
SMAC [6] scenarious, and compare results with QMIX [12]
and SMMAE [11]. Here, SMMAE is a specialized approach
that enhances the exploration abilities of QMIX. We also
conduct ablation experiments to study the impact of each
algorithm modifications separately, and here the QMIX ver-
sion with replay buffer modification is called here QMIX-
RB; QMIX version with exploration policy modification is
called QMIX-P, and QMIX with both of these modifications
is QMIX-RB-P.
Following [11], we use in our experiments QMIX with
Adam optimizer with default hyperparameters. ϵanneal time
is 50000 steps, and the exploration hyperparameter αis set to
0.04 for corridor and 0.02 for 2c_vs_64zg and MMM2.
The results of the experiments on SMAC are shown in Fig-
ure 4. On a super-hard scenario corridor poor QMIX results
compared to other algorithms indicate that additional explo-
ration significantly increases learning performance. SMMAE
and QMIX with adaptive ϵachieve similar results, and QMIX
with both modifications has slightly better performance. On a
hard scenario 2c_vs_64zg a replay buffer modification results
in worse learning performance, meanwhile QMIX, QMIX
with modified exploration policy and SMMAE achieve the
similar results. We believe that the performance decrease of
the replay buffer modification in this scenario may be due
to inappropriate hyperparameters, as we did not adjust the
parameters of the replay buffer modification for this specific
scenario. On a super-hard scenario MMM2 an algorithm with
both exploration policy and replay buffer modifications have
a slightly steeper learning curve, though the final winning
rates of algorithms are almost identical.
Comparison of the proposed algorithm with SMMAE,
which uses additional attention-based and VAE modules to
enhance exploration, shows that it’s possible to achieve the
similar exploration effectiveness with a simple in terms of
computation and implementation modification of ϵ-greedy
8VOLUME 11, 2023
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3530974
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Borzilov et al.: Rethinking Exploration and Experience Exploitation in Value-Based Multi-Agent Reinforcement Learning
0.0 0.5 1.0 1.5 2.0
Steps 1e6
0.0
0.2
0.4
0.6
0.8
1.0
Test Win Rate
corridor
QMIX-RB-P
QMIX-RB
QMIX-P
QPLEX
OW-QMIX
CW-QMIX
0.0 0.5 1.0 1.5 2.0
Steps 1e6
0.0
0.2
0.4
0.6
0.8
1.0
Test Win Rate
2c_vs_64zg
QMIX-RB-P
QMIX-RB
QMIX-P
QPLEX
OW-QMIX
CW-QMIX
0.0 0.5 1.0 1.5 2.0
Steps 1e6
0.0
0.2
0.4
0.6
0.8
1.0
Test Win Rate
MMM2
QMIX-RB-P
QMIX-RB
QMIX-P
QPLEX
OW-QMIX
CW-QMIX
FIGURE 5. Comparison of the mean test win rate of proposed modifications with state-of-the-art MARL algorithms QPLEX and WQMIX on SMAC.
QMIX-RB-P stands for QMIX with replay buffer and exploration policy modifications; QMIX-RB stands for QMIX with replay buffer modification, and
QMIX-P stands for QMIX with exploration policy modification. Plots show the mean and 95% confidence interval across five runs. The corridor scenario
proved to be too challenging for the QPLEX and WQMIX algorithms. In the 2c_vs_64zg scenario, CW-QMIX performed relatively close to QMIX-RB and
QMIX-RB-P, and QMIX-P with QPLEX showed the best results. In the MMM2 scenario, QPLEX showed the worst results, CW-QMIX has a slightly worse
performance compared to QMIX-RB and QMIX-P, while QMIX-RB-P and OW-QMIX achieved the best results.
012345
Steps 1e6
0.0
0.1
0.2
0.3
0.4
0.5
Test Win Rate
zerg_10_vs_11
QMIX-RB-P
QMIX-RB
QMIX-P
QMIX
SMMAE
012345
Steps 1e6
0.0
0.1
0.2
0.3
0.4
0.5
Test Win Rate
protoss_10_vs_11
QMIX-RB-P
QMIX-RB
QMIX-P
QMIX
SMMAE
012345
Steps 1e6
0.0
0.1
0.2
0.3
0.4
0.5
Test Win Rate
terran_10_vs_11
QMIX-RB-P
QMIX-RB
QMIX-P
QMIX
SMMAE
FIGURE 6. Comparison of the mean test win rate of the proposed modifications with the QMIX and SMMAE algorithms on SMACv2. QMIX-RB-P
represents QMIX with replay buffer and exploration policy modifications, QMIX-RB represents QMIX with the replay buffer modification, and QMIX-P
represents QMIX with the exploration policy modification. The plots display the mean and 95% confidence interval across five runs. On zerg_10_vs_11,
algorithms with the replay buffer modification demonstrate significantly better performance compared to other algorithms. On protoss_10_vs_11, the
results of all algorithms differ insignificantly. On terran_10_vs_11, algorithms with the replay buffer modification show slightly better performance than
those without it, while the exploration policy modification slightly reduces performance; consequently, the QMIX-RB algorithm achieves the best results.
policy.
B. ADDITIONAL COMPARISON ON SMAC
We also conducted comparison of the proposed modifica-
tions with state-of-the-art value-based MARL algorithms, as
QPLEX [13] and WQMIX [16], which are not enhance idea
of QMIX further in other way.
The results are presented in Figure 5. The implementation
of QPLEX algorithm used in experiments was provided by
the repository 4, and the implementation of WQMIX was
provided by the repository 5.
According to the results, the corridor map proved to be
challenging for both QPLEX and WQMIX algorithms to
solve during the training period. QPLEX showed competitive
results on the 2c_vs_64zg map but underperformed on the
MMM2 scenario. OW-QMIX performed similarly to QMIX-
RB-P on MMM2 and had a steep learning curve on the
2c_vs_64zg map, although its final results on that map were
4https://github.com/wjh720/QPLEX
5https://github.com/hijkzzz/pymarl2
worse than those of QPLEX and QMIX-P. Overall, CW-
QMIX demonstrated slightly lower performance compared to
the other algorithms.
In summary, while algorithms such as QPLEX and
WQMIX may outperform QMIX with the proposed modifi-
cations on certain scenarios, their performance is less stable
across various scenarios, and these methods struggle to suc-
ceed in more challenging scenarios, such as the corridor.
C. COMPARISON ON SMACV2
To further investigate the impact of the proposed modifica-
tions, we conducted experiments in the SMACv2 environ-
ment, comparing our approach with QMIX and SMMAE.
For this purpose, we selected three SMACv2 scenarios:
zerg_10_vs_11,protoss_10_vs_11, and terran_10_vs_11, us-
ing the default configuration. As outlined in the previous
sections, QMIX with the replay buffer modification is ab-
breviated as QMIX-RB, QMIX with the exploration policy
modification as QMIX-P, and QMIX with both modifications
as QMIX-RB-P.
The evaluation results on SMACv2 are shown in Figure 6.
VOLUME 11, 2023 9
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3530974
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Borzilov et al.: Rethinking Exploration and Experience Exploitation in Value-Based Multi-Agent Reinforcement Learning
The selected scenarios are asymmetric, meaning the enemy
has additional units, placing the agents at a disadvantage.
Consequently, these scenarios are particularly challenging
for the algorithms. On zerg_10_vs_11, QMIX, QMIX-P, and
SMMAE perform relatively similarly, while the algorithms
with the replay buffer modification (QMIX-RB and QMIX-
RB-P) demonstrate significantly better performance. On pro-
toss_10_vs_11, the performance of all algorithms is relatively
similar. On terran_10_vs_11, the modified replay buffer
slightly improves results, while the modified exploration pol-
icy slightly reduces performance. SMMAE performs compa-
rably to QMIX, and the best performance on this scenario is
achieved by QMIX-RB.
In summary, algorithms with increased exploration, such
as SMMAE, QMIX-P, and QMIX-RB-P, without specific
parameter adjustments, perform the same or worse than their
counterparts using the plain epsilon-greedy policy (QMIX
and QMIX-RB). This suggests that the exploration enforce-
ment provided by SMMAE and the modified exploration pol-
icy is too aggressive for the SMACv2 environment, where the
default epsilon-greedy policy provides sufficient exploration.
Conversely, the replay buffer modification improves perfor-
mance in the zerg_10_vs_11 and terran_10_vs_11 scenarios,
while having minimal impact on protoss_10_vs_11.
D. COMPARISON ON POGEMA
To further study the impact of the replay buffer modifica-
tion, we conduct experiments on POGEMA environment with
large episode length of 1000 steps and with batch size 64. The
goal of that experiment is to simulate the situation, when the
environment’s episodes are very large and contain an amount
of information which is hard to process during the training.
012345
Steps 1e6
0.0
0.2
0.4
0.6
0.8
1.0
Mean Throughput
QMIX-RB
QMIX-ET
QMIX
FIGURE 7. Comparison of the mean average throughput on POGEMA with
large episodes. QMIX-RB stands for QMIX with replay buffer modification;
QMIX-ET stands for QMIX with early training start. Plots show the mean
and 95% confidence interval across five runs. QMIX-RB starts quickly
improving its performance from the beginning of training, achieving the
best results. QMIX-ET has a steeper learning curve compared to QMIX, but
it still takes a lot of time to achieve competitive results. Please note that
the average throughput can exceed 1.0.
We compared QMIX with proposed replay buffer modifi-
cation with other two versions of QMIX: one is a usual QMIX
Number Average Throughput ↑
of Agents QMIX QMIX-ET QMIX-RB
4 0.311 0.228 0.350
8 0.492 0.425 0.612
16 0.705 0.632 0.811
32 0.724 0.684 1.103
64 0.393 0.295 0.890
TABLE 3. The average throughput of the trained models was evaluated
on random POGEMA environment maps with varying agent counts.
Throughput was averaged over 500 episodes for each method. QMIX-ET
refers to QMIX with an early training start, QMIX-RB denotes QMIX with
replay buffer modifications. The QMIX-RB algorithm demonstrated better
performance across all scenarios.
implementation, where training starts when enough episodes
are sampled to form a full batch of the given size. The second
version of QMIX has no restrictions on the start of training,
so that training starts right after the first episode was sampled,
but with each next sampled episode the batch size is increased
up to 64. The results of that comparison are shown in Figure 7.
In this experiment, we use QMIX with standard parameters
in an environment featuring long episodes. This leads to
poor training performance because a large amount of data
is sampled for each batch. We attempted to address this
issue by removing the limitation on the number of sampled
episodes in the replay buffer, but the results were still worse
compared to proposed QMIX-RB approach. In contrast, our
algorithm demonstrated strong performance without task-
specific parameter tuning, highlighting its reduced depen-
dency on environment characteristics such as episode length.
This suggests that the proposed replay buffer modification
simplifies hyperparameter selection, including batch size, by
mitigating episode length dependence.
We also evaluated the trained models on random maps
with varying agent counts to test their generalization ability.
The evaluation results are presented in Table 3. Initially,
as the number of agents increases, the average throughput
also increases because more agents are available to reach
their goals. However, when the number of agents becomes
too high, the map becomes overcrowded, making it more
difficult for agents to reach their goals, leading to a decrease
in throughput. Across all scenarios, QMIX-RB demonstrates
better performance, indicating that it has learned to generalize
more effectively.
VII. CONCLUSION AND LIMITATIONS
In this paper, we have investigated the impact of a modified
exploration policy and replay buffer modification in the con-
text of cooperative MARL, using the SMAC and POGEMA
environments as benchmarks. Our results show that these
modifications can significantly improve the performance of
the QMIX algorithm without introducing significant com-
plexity. Across seven different scenarios in three distinct
environments, our proposed approach outperforms other al-
gorithms in four scenarios while remaining competitive in
10 VOLUME 11, 2023
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3530974
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Borzilov et al.: Rethinking Exploration and Experience Exploitation in Value-Based Multi-Agent Reinforcement Learning
the other three. Our enhancements provide a streamlined
alternative to complex MARL methods, achieving results
comparable to state-of-the-art methods with minimal changes
to the original algorithm, thereby simplifying the adaptation
process for diverse multi-agent environments.
A promising direction for future research is the study of
prioritized replay buffers for MARL, with our proposed mod-
ification serving as a potential first step in this area. Addition-
ally, we see large-scale MARL setups-where agents train for
over 1 billion steps in the environment-as an underexplored
area within the community.
A potential limitation of our work is the reliance on the cen-
tralized training and decentralized execution paradigm, which
may be restrictive in settings where full state information is
unavailable or the state space is large. While our current ap-
proach does not explicitly use local views, it can be adapted to
decentralized settings by leveraging local agent actions, mak-
ing it flexible for broader applications. Additionally, our re-
play buffer modification can be used independently of CTDE
and is compatible with decentralized approaches like Inde-
pendent Q-Learning.
A further limitation is the lack of formal theoretical guar-
antees regarding the exploration component. This could be
addressed by framing the exploration process as a multi-
armed bandit problem, where the number of available arms
changes at each time step. While this theoretical perspective
could provide more rigorous guarantees, fully developing it
within the scope of this paper would be challenging.
REFERENCES
[1] C. Yu, A. Velu, E. Vinitsky, J. Gao, Y. Wang, A. Bayen, and Y. Wu,
‘‘The surprising effectiveness of ppo in cooperative multi-agent games,’’
Advances in Neural Information Processing Systems, vol. 35, pp. 24611–
24 624, 2022.
[2] A. Skrynnik, A. Yakovleva, V. Davydov, K. Yakovlev, and A. I.
Panov, ‘‘Hybrid Policy Learning for Multi-Agent Pathfinding,’’ IEEE
Access, vol. 9, pp. 126 034–126 047, 2021. [Online]. Available: https:
//ieeexplore.ieee.org/document/9532001/
[3] E. Vinitsky, N. Lichtlé, X. Yang, B. Amos, and J. Foerster, ‘‘Nocturne:
a scalable driving benchmark for bringing multi-agent learning one step
closer to the real world,’’ Advances in Neural Information Processing
Systems, vol. 35, pp. 3962–3974, 2022.
[4] P. Leroy, P. G. Morato, J. Pisane, A. Kolios, and D. Ernst, ‘‘Imp-marl: a
suite of environments for large-scale infrastructure management planning
via marl,’’ Advances in Neural Information Processing Systems, vol. 36,
2024.
[5] X. Pan, M. Liu, F. Zhong, Y. Yang, S.-C. Zhu, and Y. Wang, ‘‘Mate:
Benchmarking multi-agent reinforcement learning in distributed target
coverage control,’’ Advances in Neural Information Processing Systems,
vol. 35, pp. 27 862–27 879, 2022.
[6] M. Samvelyan, T. Rashid, C. S. De Witt, G. Farquhar, N. Nardelli, T. G.
Rudner, C.-M. Hung, P. H. Torr, J. Foerster, and S. Whiteson, ‘‘Thestarcraft
multi-agent challenge,’’ arXiv preprint arXiv:1902.04043, 2019.
[7] B. Ellis, J. Cook, S. Moalla, M. Samvelyan, M. Sun, A. Mahajan, J. Foer-
ster, and S. Whiteson, ‘‘Smacv2: An improved benchmark for cooperative
multi-agent reinforcement learning,’’ Advances in Neural Information Pro-
cessing Systems, vol. 36, 2024.
[8] Y. Song, H. Jiang, H. Zhang, Z. Tian, W. Zhang, and J. Wang, ‘‘Boost-
ing studies of multi-agent reinforcement learning on google research
football environment: the past, present, and future,’’ arXiv preprint
arXiv:2309.12951, 2023.
[9] T. Yang, H. Tang, C. Bai, J. Liu, J. Hao, Z. Meng, P. Liu, and Z. Wang,
‘‘Exploration in deep reinforcement learning: a comprehensive survey,’’
arXiv preprint arXiv:2109.06668, 2021.
[10] A. Mahajan, T. Rashid, M. Samvelyan, and S. Whiteson, ‘‘Maven: Multi-
agent variational exploration,’’ Advances in neural information processing
systems, vol. 32, 2019.
[11] S. Zhang, J. Cao, L. Yuan, Y. Yu, and D.-C. Zhan, ‘‘Self-motivated multi-
agent exploration,’’ in Proceedings of the 2023 International Conference
on Autonomous Agents and Multiagent Systems, 2023, pp. 476–484.
[12] T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster, and
S. Whiteson, ‘‘Monotonic value function factorisation for deep multi-agent
reinforcement learning,’’ The Journal of Machine Learning Research,
vol. 21, no. 1, pp. 7234–7284, 2020.
[13] J. Wang, Z. Ren, T. Liu, Y. Yu, and C. Zhang, ‘‘Qplex: Duplex dueling
multi-agent q-learning,’’ in International Conference on Learning Repre-
sentations, 2020.
[14] H. M. R. U. Rehman, B.-W. On, D. D. Ningombam, S. Yi, and G. S.
Choi, ‘‘Qsod: Hybrid policy gradient for deep multi-agent reinforcement
learning,’’ IEEE Access, vol. 9, pp. 129 728–129 741, 2021.
[15] K. Son, D. Kim, W. J. Kang, D. E. Hostallero, and Y. Yi, ‘‘Qtran: Learning
to factorize with transformation for cooperative multi-agent reinforcement
learning,’’ in International conference on machine learning. PMLR, 2019,
pp. 5887–5896.
[16] T. Rashid, G. Farquhar, B. Peng, and S. Whiteson, ‘‘Weighted qmix:
Expanding monotonic value function factorisation for deep multi-agent
reinforcement learning,’’ Advances in neural information processing sys-
tems, vol. 33, pp. 10 199–10 210, 2020.
[17] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Belle-
mare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al.,
‘‘Human-level control through deep reinforcement learning,’’ nature, vol.
518, no. 7540, pp. 529–533, 2015.
[18] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V.Zambaldi, M. Jader-
berg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls et al., ‘‘Value-
decomposition networks for cooperative multi-agent learning based on
team reward,’’ in Proceedings of the 2018 International Conference on Au-
tonomous Agents and Multiagent Systems, vol. 3. ASSOC COMPUTING
MACHINERY, 2018, pp. 2085–2087.
[19] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy, ‘‘Deep exploration via
bootstrapped dqn,’’ Advances in neural information processing systems,
vol. 29, 2016.
[20] W. Dabney, G. Ostrovski, and A. Barreto, ‘‘Temporally-extended
{\epsilon}-greedy exploration,’’ arXiv preprint arXiv:2006.01782, 2020.
[21] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, ‘‘Curiosity-driven
exploration by self-supervised prediction,’’ in International conference on
machine learning. PMLR, 2017, pp. 2778–2787.
[22] R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel,
‘‘Vime: Variational information maximizing exploration,’’ Advances in
neural information processing systems, vol. 29, 2016.
[23] C. Bai, P. Liu, K. Liu, L. Wang, Y. Zhao, L. Han, and Z. Wang, ‘‘Variational
dynamic for self-supervised exploration in deep reinforcement learning,’’
IEEE Transactions on neural networks and learning systems, 2021.
[24] Y. Burda, H. Edwards, A. Storkey, and O. Klimov,‘‘Exploration by random
network distillation,’’ in International Conference on Learning Represen-
tations, 2018.
[25] L. Lee, B. Eysenbach, E. Parisotto, E. Xing, S. Levine, and R. Salakhut-
dinov, ‘‘Efficient exploration via state marginal matching,’’ arXiv preprint
arXiv:1906.05274, 2019.
[26] A. P. Badia, P. Sprechmann, A. Vitvitskyi, D. Guo, B. Piot, S. Kapturowski,
O. Tieleman, M. Arjovsky, A. Pritzel, A. Bolt et al., ‘‘Never give up:
Learning directed exploration strategies,’’ in International Conference on
Learning Representations, 2019.
[27] A. P. Badia, B. Piot, S. Kapturowski, P. Sprechmann, A. Vitvitskyi, Z. D.
Guo, and C. Blundell, ‘‘Agent57: Outperforming the atari human bench-
mark,’’ in International conference on machine learning. PMLR, 2020,
pp. 507–517.
[28] G. Berseth, D. Geng, C. M. Devin, N. Rhinehart, C. Finn, D. Jayaraman,
and S. Levine, ‘‘Smirl: Surprise minimizing reinforcement learning in
unstable environments,’’ in International Conference on Learning Repre-
sentations, 2020.
[29] Y. Du, L. Han, M. Fang, J. Liu, T. Dai, and D. Tao, ‘‘Liir: Learning indi-
vidual intrinsic reward in multi-agent reinforcement learning,’’ Advances
in Neural Information Processing Systems, vol. 32, 2019.
[30] L. Zheng, J. Chen, J. Wang, J. He, Y. Hu, Y. Chen, C. Fan, Y. Gao, and
C. Zhang, ‘‘Episodic multi-agent reinforcement learning with curiosity-
driven exploration,’’ Advances in Neural Information Processing Systems,
vol. 34, pp. 3757–3769, 2021.
VOLUME 11, 2023 11
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3530974
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Borzilov et al.: Rethinking Exploration and Experience Exploitation in Value-Based Multi-Agent Reinforcement Learning
[31] T. Wang, J. Wang, Y. Wu, and C. Zhang, ‘‘Influence-based multi-agent
exploration,’’ in International Conference on Learning Representations,
2019.
[32] I.-J. Liu, U. Jain, R. A. Yeh, and A. Schwing, ‘‘Cooperative exploration for
multi-agent deep reinforcement learning,’’ in International Conference on
Machine Learning. PMLR, 2021, pp. 6826–6836.
[33] S. Kapturowski, G. Ostrovski, J. Quan, R. Munos, and W. Dabney, ‘‘Re-
current experience replay in distributed reinforcement learning,’’ in Inter-
national conference on learning representations, 2018.
[34] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, ‘‘Prioritized experience
replay,’’ arXiv preprint arXiv:1511.05952, 2015.
[35] D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. Van Has-
selt, and D. Silver, ‘‘Distributed prioritized experience replay,’’ arXiv
preprint arXiv:1803.00933, 2018.
[36] T. L. Paine, C. Gulcehre, B. Shahriari, M. Denil, M. Hoffman, H. Soyer,
R. Tanburn, S. Kapturowski, N. Rabinowitz, D. Williams et al., ‘‘Making
efficient use of demonstrations to solve hard exploration problems,’’ arXiv
preprint arXiv:1909.01387, 2019.
[37] Y. Mei, H. Zhou, T. Lan, G. Venkataramani, and P. Wei, ‘‘Mac-po:
Multi-agent experience replay via collective priority optimization,’’ arXiv
preprint arXiv:2302.10418, 2023.
[38] Y. Wang and Z. Zhang, ‘‘Experience selection in multi-agent deep rein-
forcement learning,’’ in 2019 IEEE 31st International Conference on Tools
with Artificial Intelligence (ICTAI). IEEE, 2019, pp. 864–870.
[39] A. Skrynnik, A. Andreychuk, A. Borzilov, A. Chernyavskiy, K. Yakovlev,
and A. Panov, ‘‘Pogema: A benchmark platform for cooperative multi-
agent navigation,’’ arXiv preprint arXiv:2407.14931, 2024.
[40] A. Skrynnik, A. Andreychuk, K. Yakovlev, and A. I. Panov, ‘‘Pogema:
partially observable grid environment for multiple agents,’’ arXiv preprint
arXiv:2206.10944, 2022.
[41] T. Tieleman, ‘‘Lecture 6.5-rmsprop: Divide the gradient by a running aver-
age of its recent magnitude,’’ COURSERA: Neural networks for machine
learning, vol. 4, no. 2, p. 26, 2012.
[42] D. P. Kingma, ‘‘Adam: A method for stochastic optimization,’’ arXiv
preprint arXiv:1412.6980, 2014.
[43] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, ‘‘Empirical evaluation of
gated recurrent neural networks on sequence modeling,’’ arXiv preprint
arXiv:1412.3555, 2014.
[44] A. Andreychuk, K. Yakovlev, A. Panov, and A. Skrynnik, ‘‘Mapf-gpt:
Imitation learning for multi-agent pathfinding at scale,’’ arXiv preprint
arXiv:2409.00134, 2024.
[45] Y. Wang, B. Xiang, S. Huang, and G. Sartoretti, ‘‘Scrimp: Scalable com-
munication for reinforcement-and imitation-learning-based multi-agent
pathfinding,’’ in 2023 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS). IEEE, 2023, pp. 9301–9308.
[46] A. Skrynnik, A. Andreychuk, K. Yakovlev, and A. Panov, ‘‘Decentralized
monte carlo tree search for partially observable multi-agent pathfinding,’’
in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38,
no. 16, 2024, pp. 17 531–17540.
[47] A. Skrynnik, A. Andreychuk, K. Yakovlev, and A. I. Panov, ‘‘When to
switch: planning and learning for partially observable multi-agent pathfind-
ing,’’ IEEE Transactions on Neural Networks and Learning Systems, 2023.
[48] G. Sartoretti, J. Kerr, Y. Shi, G. Wagner, T. S. Kumar, S. Koenig, and
H. Choset, ‘‘Primal: Pathfinding via reinforcement and imitation multi-
agent learning,’’ IEEE Robotics and Automation Letters, vol. 4, no. 3, pp.
2378–2385, 2019.
ANATOLII BORZILOV earned an M.S. degree in
Computer Science from MIREA – Russian Tech-
nological University, Moscow, Russia, in 2024.
That same year, he began his Ph.D. studies at
Moscow Institute of Physics and Technology,
Moscow, Russia.
Since 2023, he have been contributing as a Ju-
nior Researcher at the Cognitive Dynamic System
Laboratory at the Moscow Institute of Physics and
Technology, Moscow, Russia, focusing on rein-
forcement learning and multi-agent systems.
ALEXEY SKRYNNIK received his M.S. degree
in computer science from Rybinsk State Aviation
Technical University, Rybinsk, Russia, in 2017. He
defended his Ph.D. thesis in the field of artificial
intelligence and machine learning in 2023.
From 2021, he has been working as a senior
researcher at the AIRI institute in the Cognitive AI
Systems laboratory. His current research focuses
on reinforcement learning, learning and planning,
and multi-agent systems.
ALEKSANDR I. PANOV earned an M.S. in
Computer Science from the Moscow Institute of
Physics and Technology, Moscow, Russia, 2011
and a Ph.D. in Theoretical Computer Science from
the Institute for Systems Analysis, Moscow, Rus-
sia, in 2015.
Since 2010, he has been a research fellow with
the Federal Research Center ‘‘Computer Science
and Control’’ of the Russian Academy of Sciences.
Since 2018, he has headed the Cognitive Dynamic
System Laboratory at the Moscow Institute of Physics and Technology,
Moscow,Russia. He authored three books and more than 100 research papers.
In 2021, he joined the research group on Neurosymbolic Integration at the
Artificial Intelligence Research Institute. His academic focus areas include
behavior planning, reinforcement learning, embodied AI, and cognitive
robotics.
12 VOLUME 11, 2023
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3530974
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/