ArticlePDF Available

Ubiquitous Distributed Deep Reinforcement Learning at the Edge: Analyzing Byzantine Agents in Discrete Action Spaces

Authors:

Abstract

The integration of edge computing in next-generation mobile networks is bringing low-latency and high-bandwidth ubiquitous connectivity to a myriad of cyber-physical systems. This will further boost the increasing intelligence that is being embedded at the edge in various types of autonomous systems, where collaborative machine learning has the potential to play a significant role. This paper discusses some of the challenges in multi-agent distributed deep reinforcement learning that can occur in the presence of byzantine or malfunctioning agents. As the simulation-to-reality gap gets bridged, the probability of malfunctions or errors must be taken into account. We show how wrong discrete actions can significantly affect the collaborative learning effort. In particular, we analyze the effect of having a fraction of agents that might perform the wrong action with a given probability. We study the ability of the system to converge towards a common working policy through the collaborative learning process based on the number of experiences from each of the agents to be aggregated for each policy update, together with the fraction of wrong actions from agents experiencing malfunctions. Our experiments are carried out in a simulation environment using the Atari testbed for the discrete action spaces, and advantage actor-critic (A2C) for the distributed multi-agent training.
ScienceDirect
Available online at www.sciencedirect.com
Procedia Computer Science 177 (2020) 324–329
1877-0509 © 2020 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the Conference Program Chairs.
10.1016/j.procs.2020.10.043
10.1016/j.procs.2020.10.043 1877-0509
© 2020 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the Conference Program Chairs.
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2020) 000–000
www.elsevier.com/locate/procedia
The 11th International Conference on Emerging Ubiquitous Systems and Pervasive Networks
(EUSPN 2020)
November 2-5, 2020, Madeira, Portugal
Ubiquitous Distributed Deep Reinforcement Learning at the Edge:
Analyzing Byzantine Agents in Discrete Action Spaces
Wenshuai Zhaoa,, Jorge Pe˜
na Queraltaa, Li Qingqinga, Tomi Westerlunda
aTurku Intelligent Embedded and Robotic Systems Lab, University of Turku, Finland
Abstract
The integration of edge computing in next-generation mobile networks is bringing low-latency and high-bandwidth ubiquitous
connectivity to a myriad of cyber-physical systems. This will further boost the increasing intelligence that is being embedded at
the edge in various types of autonomous systems, where collaborative machine learning has the potential to play a significant role.
This paper discusses some of the challenges in multi-agent distributed deep reinforcement learning that can occur in the presence
of byzantine or malfunctioning agents. As the simulation-to-reality gap gets bridged, the probability of malfunctions or errors must
be taken into account. We show how wrong discrete actions can significantly aect the collaborative learning eort. In particular,
we analyze the eect of having a fraction of agents that might perform the wrong action with a given probability. We study the
ability of the system to converge towards a common working policy through the collaborative learning process based on the number
of experiences from each of the agents to be aggregated for each policy update, together with the fraction of wrong actions from
agents experiencing malfunctions. Our experiments are carried out in a simulation environment using the Atari testbed for the
discrete action spaces, and advantage actor-critic (A2C) for the distributed multi-agent training.
©2020 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the Conference Program Chairs.
Keywords: Reinforcement Learning; Edge Computing; Multi-Agent Systems; Collaborative Learning; RL; Deep RL; Adversarial RL;
1. Introduction
The edge computing paradigm is bringing higher degrees of intelligence to connected cyber-physical systems
across multiple domains. This intelligence is being in turn enabled by lightweight deep learning (DL) models deployed
at the edge for real-time computation. Among the multiple DL approaches, reinforcement learning (RL) has been
increasingly adopted in various types of cyber-physical systems over the past decade, and, in particular, multi-agent
Corresponding author: Wenshuai Zhao.
E-mail address: wezhao@utu.fi
1877-0509 ©2020 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the Conference Program Chairs.
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2020) 000–000
www.elsevier.com/locate/procedia
The 11th International Conference on Emerging Ubiquitous Systems and Pervasive Networks
(EUSPN 2020)
November 2-5, 2020, Madeira, Portugal
Ubiquitous Distributed Deep Reinforcement Learning at the Edge:
Analyzing Byzantine Agents in Discrete Action Spaces
Wenshuai Zhaoa,, Jorge Pe˜
na Queraltaa, Li Qingqinga, Tomi Westerlunda
aTurku Intelligent Embedded and Robotic Systems Lab, University of Turku, Finland
Abstract
The integration of edge computing in next-generation mobile networks is bringing low-latency and high-bandwidth ubiquitous
connectivity to a myriad of cyber-physical systems. This will further boost the increasing intelligence that is being embedded at
the edge in various types of autonomous systems, where collaborative machine learning has the potential to play a significant role.
This paper discusses some of the challenges in multi-agent distributed deep reinforcement learning that can occur in the presence
of byzantine or malfunctioning agents. As the simulation-to-reality gap gets bridged, the probability of malfunctions or errors must
be taken into account. We show how wrong discrete actions can significantly aect the collaborative learning eort. In particular,
we analyze the eect of having a fraction of agents that might perform the wrong action with a given probability. We study the
ability of the system to converge towards a common working policy through the collaborative learning process based on the number
of experiences from each of the agents to be aggregated for each policy update, together with the fraction of wrong actions from
agents experiencing malfunctions. Our experiments are carried out in a simulation environment using the Atari testbed for the
discrete action spaces, and advantage actor-critic (A2C) for the distributed multi-agent training.
©2020 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the Conference Program Chairs.
Keywords: Reinforcement Learning; Edge Computing; Multi-Agent Systems; Collaborative Learning; RL; Deep RL; Adversarial RL;
1. Introduction
The edge computing paradigm is bringing higher degrees of intelligence to connected cyber-physical systems
across multiple domains. This intelligence is being in turn enabled by lightweight deep learning (DL) models deployed
at the edge for real-time computation. Among the multiple DL approaches, reinforcement learning (RL) has been
increasingly adopted in various types of cyber-physical systems over the past decade, and, in particular, multi-agent
Corresponding author: Wenshuai Zhao.
E-mail address: wezhao@utu.fi
1877-0509 ©2020 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the Conference Program Chairs.
2Wenshuai Zhao et al. /Procedia Computer Science 00 (2020) 000–000
Synchronous
Experience Upload
(Data and Rewards)
Common
Policy Download
Agent 1
Edge Backend
Cloud Backend
Baselines Offline training Transfer learning
Periodic
Weight Updates
Aggregated
Data
Periodic
Weight Updates
Periodic
Weight Updates
Common
Policy
Time
Discrete step by one agent (action + reward)
Byzantine step (wrong action + reward)
Synchronous upload to edge backend
Synchronous download to all agents
Agent 2
Agent N
Fig. 1: Conceptual view of the system architecture proposed in this paper. Multiple agents are collaborating towards learning a
common task, with the deep neural network updates being synchronized at the edge layer. Each agent, however, individually
explores its environments gathering experience in a series of episodes, and calculating the corresponding rewards.
RL [1]. Deep reinforcement learning (DRL) algorithms are motivated by the way natural learning happens: through
trial and error, learning from experiences based on the performance outcome of dierent actions. Among other fields,
DRL algorithms have had success in robotic manipulation [2], but also in finding more optimal approaches to complex
multi-dimensional problems involved in edge computing [3].
We are particularly interested on how edge computing can enable more ecient real-time distributed and col-
laborative multi-agent RL. When discussing multi-agent DRL, two dierent views emerge from the literature: those
where multiple agents are utilized to improve the learning process (e.g., faster learning through parallelization [4],
higher diversity by means of exploration of dierent environments [5], or increased robustness with redundancy [1]),
and those where multiple agents are learning a policy emerging from an interactive behavior (e.g., formation control
algorithms [6], or collision avoidance [7]).
This work explores an scenario where multiple agents are collaboratively learning towards the same task, which is
often individual, as illustrated in Fig. 1. Deep reinforcement learning has been identified as one of the key areas that
will see wider adoption within the Internet of Things (IoT) owing to the rise of edge computing and 5G-and-beyond
connectivity [8]. Multiple application fields can benefit from this synergy in dierent domains, from robotic agents in
the industrial IoT, to information fusion in wireless sensor networks, and including all types of connected autonomous
systems and other types of cyber-physical systems.
Multiple challenges still exist in distributed multi-agent DRL. Among the most relevant ones within the scope of
this paper are the development of novel techniques to increase robustness in the presence of adversarial agents or
perturbed environments [9,10,11], as well as closing the simulation-to-reality gap [2,12]. In relation to the former
area, recent works have focused on exploring dierent types of noise or perturbations in the agents or environments to
better understand how these potentially adversarial conditions aect the collaborative learning process. For example,
Gu et al. study in [13] the eect of network delays and propose an asynchronous method for o-policy updates, while
Yu et al. have studied the eect of adversarial conditions in the network connection between the agents [14]. We have
seen a lack of research, however, on the analysis of adversarial conditions in discrete action spaces. These type of
scenarios occur when agents need to make a decision from a finite and discrete set of actions.
Wenshuai Zhao et al. / Procedia Computer Science 177 (2020) 324–329 325
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2020) 000–000
www.elsevier.com/locate/procedia
The 11th International Conference on Emerging Ubiquitous Systems and Pervasive Networks
(EUSPN 2020)
November 2-5, 2020, Madeira, Portugal
Ubiquitous Distributed Deep Reinforcement Learning at the Edge:
Analyzing Byzantine Agents in Discrete Action Spaces
Wenshuai Zhaoa,, Jorge Pe˜
na Queraltaa, Li Qingqinga, Tomi Westerlunda
aTurku Intelligent Embedded and Robotic Systems Lab, University of Turku, Finland
Abstract
The integration of edge computing in next-generation mobile networks is bringing low-latency and high-bandwidth ubiquitous
connectivity to a myriad of cyber-physical systems. This will further boost the increasing intelligence that is being embedded at
the edge in various types of autonomous systems, where collaborative machine learning has the potential to play a significant role.
This paper discusses some of the challenges in multi-agent distributed deep reinforcement learning that can occur in the presence
of byzantine or malfunctioning agents. As the simulation-to-reality gap gets bridged, the probability of malfunctions or errors must
be taken into account. We show how wrong discrete actions can significantly aect the collaborative learning eort. In particular,
we analyze the eect of having a fraction of agents that might perform the wrong action with a given probability. We study the
ability of the system to converge towards a common working policy through the collaborative learning process based on the number
of experiences from each of the agents to be aggregated for each policy update, together with the fraction of wrong actions from
agents experiencing malfunctions. Our experiments are carried out in a simulation environment using the Atari testbed for the
discrete action spaces, and advantage actor-critic (A2C) for the distributed multi-agent training.
©2020 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the Conference Program Chairs.
Keywords: Reinforcement Learning; Edge Computing; Multi-Agent Systems; Collaborative Learning; RL; Deep RL; Adversarial RL;
1. Introduction
The edge computing paradigm is bringing higher degrees of intelligence to connected cyber-physical systems
across multiple domains. This intelligence is being in turn enabled by lightweight deep learning (DL) models deployed
at the edge for real-time computation. Among the multiple DL approaches, reinforcement learning (RL) has been
increasingly adopted in various types of cyber-physical systems over the past decade, and, in particular, multi-agent
Corresponding author: Wenshuai Zhao.
E-mail address: wezhao@utu.fi
1877-0509 ©2020 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the Conference Program Chairs.
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2020) 000–000
www.elsevier.com/locate/procedia
The 11th International Conference on Emerging Ubiquitous Systems and Pervasive Networks
(EUSPN 2020)
November 2-5, 2020, Madeira, Portugal
Ubiquitous Distributed Deep Reinforcement Learning at the Edge:
Analyzing Byzantine Agents in Discrete Action Spaces
Wenshuai Zhaoa,, Jorge Pe˜
na Queraltaa, Li Qingqinga, Tomi Westerlunda
aTurku Intelligent Embedded and Robotic Systems Lab, University of Turku, Finland
Abstract
The integration of edge computing in next-generation mobile networks is bringing low-latency and high-bandwidth ubiquitous
connectivity to a myriad of cyber-physical systems. This will further boost the increasing intelligence that is being embedded at
the edge in various types of autonomous systems, where collaborative machine learning has the potential to play a significant role.
This paper discusses some of the challenges in multi-agent distributed deep reinforcement learning that can occur in the presence
of byzantine or malfunctioning agents. As the simulation-to-reality gap gets bridged, the probability of malfunctions or errors must
be taken into account. We show how wrong discrete actions can significantly aect the collaborative learning eort. In particular,
we analyze the eect of having a fraction of agents that might perform the wrong action with a given probability. We study the
ability of the system to converge towards a common working policy through the collaborative learning process based on the number
of experiences from each of the agents to be aggregated for each policy update, together with the fraction of wrong actions from
agents experiencing malfunctions. Our experiments are carried out in a simulation environment using the Atari testbed for the
discrete action spaces, and advantage actor-critic (A2C) for the distributed multi-agent training.
©2020 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the Conference Program Chairs.
Keywords: Reinforcement Learning; Edge Computing; Multi-Agent Systems; Collaborative Learning; RL; Deep RL; Adversarial RL;
1. Introduction
The edge computing paradigm is bringing higher degrees of intelligence to connected cyber-physical systems
across multiple domains. This intelligence is being in turn enabled by lightweight deep learning (DL) models deployed
at the edge for real-time computation. Among the multiple DL approaches, reinforcement learning (RL) has been
increasingly adopted in various types of cyber-physical systems over the past decade, and, in particular, multi-agent
Corresponding author: Wenshuai Zhao.
E-mail address: wezhao@utu.fi
1877-0509 ©2020 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the Conference Program Chairs.
2Wenshuai Zhao et al. /Procedia Computer Science 00 (2020) 000–000
Synchronous
Experience Upload
(Data and Rewards)
Common
Policy Download
Agent 1
Edge Backend
Cloud Backend
Baselines Offline training Transfer learning
Periodic
Weight Updates
Aggregated
Data
Periodic
Weight Updates
Periodic
Weight Updates
Common
Policy
Time
Discrete step by one agent (action + reward)
Byzantine step (wrong action + reward)
Synchronous upload to edge backend
Synchronous download to all agents
Agent 2
Agent N
Fig. 1: Conceptual view of the system architecture proposed in this paper. Multiple agents are collaborating towards learning a
common task, with the deep neural network updates being synchronized at the edge layer. Each agent, however, individually
explores its environments gathering experience in a series of episodes, and calculating the corresponding rewards.
RL [1]. Deep reinforcement learning (DRL) algorithms are motivated by the way natural learning happens: through
trial and error, learning from experiences based on the performance outcome of dierent actions. Among other fields,
DRL algorithms have had success in robotic manipulation [2], but also in finding more optimal approaches to complex
multi-dimensional problems involved in edge computing [3].
We are particularly interested on how edge computing can enable more ecient real-time distributed and col-
laborative multi-agent RL. When discussing multi-agent DRL, two dierent views emerge from the literature: those
where multiple agents are utilized to improve the learning process (e.g., faster learning through parallelization [4],
higher diversity by means of exploration of dierent environments [5], or increased robustness with redundancy [1]),
and those where multiple agents are learning a policy emerging from an interactive behavior (e.g., formation control
algorithms [6], or collision avoidance [7]).
This work explores an scenario where multiple agents are collaboratively learning towards the same task, which is
often individual, as illustrated in Fig. 1. Deep reinforcement learning has been identified as one of the key areas that
will see wider adoption within the Internet of Things (IoT) owing to the rise of edge computing and 5G-and-beyond
connectivity [8]. Multiple application fields can benefit from this synergy in dierent domains, from robotic agents in
the industrial IoT, to information fusion in wireless sensor networks, and including all types of connected autonomous
systems and other types of cyber-physical systems.
Multiple challenges still exist in distributed multi-agent DRL. Among the most relevant ones within the scope of
this paper are the development of novel techniques to increase robustness in the presence of adversarial agents or
perturbed environments [9,10,11], as well as closing the simulation-to-reality gap [2,12]. In relation to the former
area, recent works have focused on exploring dierent types of noise or perturbations in the agents or environments to
better understand how these potentially adversarial conditions aect the collaborative learning process. For example,
Gu et al. study in [13] the eect of network delays and propose an asynchronous method for o-policy updates, while
Yu et al. have studied the eect of adversarial conditions in the network connection between the agents [14]. We have
seen a lack of research, however, on the analysis of adversarial conditions in discrete action spaces. These type of
scenarios occur when agents need to make a decision from a finite and discrete set of actions.
326 Wenshuai Zhao et al. / Procedia Computer Science 177 (2020) 324–329
Wenshuai Zhao et al. /Procedia Computer Science 00 (2020) 000–000 3
In this paper, we study the eect of byzantine agents that perform the wrong action with certain probabilities and
report initial results that let us understand the limitations of the state-of-the-art in multi-agent RL in the presence
of byzantine agents for discrete action spaces. In particular, we utilize the synchronous advantage actor-critic (A2C)
algorithm on two of the standard Atari environments typically used for benchmarking DRL methods. This is, to the
best of our knowledge, the first paper analyzing the eects terms of policy convergence in collaborative multi-agent
DRL caused by having dierent fractions of wrong actions in discrete action spaces. Our results show that in some
environments with totally 16 distributed agents the training process is highly sensitive to having a single agent acting
in the wrong manner over a relatively small fraction of its actions, and unstable convergence appears with just a
single agent having 2.5% of its actions wrong. In other environments the threshold is higher, with the network still
converging to a working policy at over 10% of wrong actions in a single byzantine agent.
The remainder of this document is organized as follows. Section 2 presents related works in the area of adversarial
RL and applications combining RL with edge computing. Section 3 describes the basic theory behind A2C and the
simulation environments. Section 4 then presents our results on the convergence of the system when a fraction of the
actions is wrong, and Section 5 concludes the work and outlines our future work directions.
2. Related Works
Adversarial RL has attracted many researchers’ interest in recent years. Multiple deep learning algorithms are
known to be vulnerable to manipulation by perturbed inputs [9]. This problem also aects various reinforcement
learning algorithms under dierent scenarios. In multi-agent environments, an attacker can significantly increase the
adversarial observations ability [10]. Ilahi et al. review emerging adversarial attacks in DRL-based systems and the
potential countermeasures to defend against these attacks [15]. The authors classify the attacks as attacks targeting
(i) rewards, (ii) policies, (iii) observations, and (iv) the environment. In this paper, instead, we consider targeting the
agents’ actions, which can happen in real-world applications when agents interact with their environment.
Similarly considering how to better transfer learning from simulations to the real-world, multiple researchers have
been working on a simulation-to-reality transfer for specific applications in dierent environments [12,2,16]. In this
paper, we will analyze the eect of the adversarial of byzantine eects in multi-agents reinforcement learning, and
introduce a fraction of byzantine actions, which has not been studied before.
Other researchers have explored the influence of noisy rewards in RL. Wang et al. present a robust RL framework
that enables agents to learn in noisy environments where only perturbed rewards are observed, and analyzes dierent
algorithms performance under their proposed framework, including PPO, DQN, and DDPG [11]. In this paper, the
perturbances on the DRL process will be explored, but we focus on analyzing discrete action spaces and a fraction of
byzantine actions performed by a small number of byzantine agents.
Multiple works have been presented in the convergence of DRL and edge computing. However, rather than focusing
on exploiting edge computing for distributed RL, most of the current literature is exploiting RL for edge service. For
instance, Ning et al. apply DRL for more ecient ooading orchestration [3], while Wang et al. have applied DRL
to optimize resource allocation at the edge [17]. In our work, however, we focus on analyzing some of the challenges
that can appear when edge computing is exploited for distributed multi-agent DRL in real-world applications.
3. Methodology
This section describes the methods and simulation environments utilized for the analysis in Section 4: the advantage
actor-critic (A2C) algorithm for distributed DRL, and the simulation environment.
Actor-critic methods combine the advantages of value based and policy based methods, and has been regarded as
the base of many modern RL algorithms. In A2C, two neural networks represent the actor and critic, where the actor
controls the agent’s behavior and the critic evaluate how good the action taken is. As value-based methods tend to
high variability, an advantage function is employed to replace the raw value function, leading to advantage actor-critic
(A2C). The main scheme of the policy gradient updates is shown in (1):
θnew θold +ηRθ(1)
4Wenshuai Zhao et al. /Procedia Computer Science 00 (2020) 000–000
where θdenotes the policy to be learned, ηis the learning rate, and Rθrepresents the policy gradient, given by (2).
Rθ1
N
N
n=1
Tn
t=1
R(τn)logp(an
t|sn
t) (2)
where Nis the number of trajectories sampled under the policy τ, and R(τn) denotes the accumulated reward for each
episode consisting of Tnsteps. In the policy with weight θ, an action an
tunder the state sn
tis chosen with probability
p(an
t|sn
t).
In this policy gradient method, the accumulated reward R(τn) is calculated by sampling the trajectories, which are
computed when the whole episode is finished, and hence might bring high variability aecting the policy convergence.
To avoid this, value estimation is introduced and merged into the policy gradient method. An advantage function is
thus proposed to replace R(τn) according to (3), which is also the reason for the name of A2C.
R(τn)=rn
t+Vπ(sn
t+1)Vπ(sn
t) (3)
where rtis the reward gained in the step t, and Vπdenotes the value function to estimate the accumulated reward
that will be gained. Additionally, in the implementation of this A2C algorithm, multiple agents are employed to
produce the trajectories in parallel. Compared to A3C [4], in which each agent will update the network individually
and asynchronously, A2C collects the whole data from each agents and then update the shared network. This is also
illustrated in Fig. 1.
In order to analyze the eect of byzantine actions, we choose two typical gym-wrapped Atari games as our simula-
tion environments: PongNoFrameskip-v4 (Fig. 2b) and BreakoutNoFrameskip-v4 (Fig. 3b). Both of them take video
as an input, based on which the policy will be trained to produce the corresponding discrete actions to obtain higher
rewards. The action spaces for Pong and Breakout have cardinality of 5 and 4, respectively. We set their corresponding
byzantine agents to behave with the opposite actions (e.g., if the output action from the policy is action =2 in Pong,
then a wrong action by the byzantine agents will be action =3). In this paper, we consider the eect of the presence of
Byzantine agent in terms of their number and the frequency of wrong actions they perform. The patterns we observe
in the experiments can be further utilized to detect Byzantine agents in distributed multi-agent DRL scenarios.
4. Simulations and Results
In this section, we describe the settings in our experiments and present the main conclusions of our analysis. There
are totally 16 agents or workers employed to produce trajectory data both in Pong and Breakout environments. The
experiences, actions and rewards from the dierent agents are then aggregated synchronously to calculate the policy
gradients and update the policy towards a more optimal one. In the experiments, we analyze how byzantine agents that
perform wrong actions unknowingly aect the collaborative learning eort. The reference training without byzantine
agents for the Pong and Breakout environments are shown in Fig. 2a and Fig. 3a, respectively.
In the Pong environment, we first set a single agent continuously behaving wrongly, out of the total of 16 agents
working in parallel (Fig. 2c). Compared to the reference training, we observe that the policy is unable to improve in
order to obtain better rewards. Therefore, a single byzantine agent representing as little as 6.25% of the total is enough
to completely disable the ability of the system to converge towards a working policy. Therefore, we have focused on
analyzing the maximum fraction of wrong actions that byzantine agents can perform in order to ensure convergence
of the system. Moreover, in order to test whether it is the total fraction what matters or the number of agents, we have
considered the same total fraction of byzantine actions in dierent settings.
In the training, the policy is updated only when the agents perform a series of steps, collecting a certain amount of
interaction data. In particular, agents perform 5 steps between updates of the policy. This leads to 80 steps between
updates, and we set the total number of steps to 107for the complete training process. The number of episodes depends
on the performance of the agents (the better the performance, the longer the episodes are).
With this, we set dierent fractions of byzantine actions depending on (i) the number of agents, (ii) the number of
wrong actions in between updates, and (iii) the fraction of updates aected by byzantine actions. The results for the
Pong environment are shown in Figures 2d through 2h. From these, we conclude that 20% of byzantine actions are
enough to deplete the systems’ ability to converge, while the system is able to converge with slight unstabilities in the
presence of a 10% of byzantine actions (Figures 2e,2g and 2h). Finally, with just 5% of byzantine actions (Fig. 2f)
the convergence is similar to the reference.
Wenshuai Zhao et al. / Procedia Computer Science 177 (2020) 324–329 327
Wenshuai Zhao et al. /Procedia Computer Science 00 (2020) 000–000 3
In this paper, we study the eect of byzantine agents that perform the wrong action with certain probabilities and
report initial results that let us understand the limitations of the state-of-the-art in multi-agent RL in the presence
of byzantine agents for discrete action spaces. In particular, we utilize the synchronous advantage actor-critic (A2C)
algorithm on two of the standard Atari environments typically used for benchmarking DRL methods. This is, to the
best of our knowledge, the first paper analyzing the eects terms of policy convergence in collaborative multi-agent
DRL caused by having dierent fractions of wrong actions in discrete action spaces. Our results show that in some
environments with totally 16 distributed agents the training process is highly sensitive to having a single agent acting
in the wrong manner over a relatively small fraction of its actions, and unstable convergence appears with just a
single agent having 2.5% of its actions wrong. In other environments the threshold is higher, with the network still
converging to a working policy at over 10% of wrong actions in a single byzantine agent.
The remainder of this document is organized as follows. Section 2 presents related works in the area of adversarial
RL and applications combining RL with edge computing. Section 3 describes the basic theory behind A2C and the
simulation environments. Section 4 then presents our results on the convergence of the system when a fraction of the
actions is wrong, and Section 5 concludes the work and outlines our future work directions.
2. Related Works
Adversarial RL has attracted many researchers’ interest in recent years. Multiple deep learning algorithms are
known to be vulnerable to manipulation by perturbed inputs [9]. This problem also aects various reinforcement
learning algorithms under dierent scenarios. In multi-agent environments, an attacker can significantly increase the
adversarial observations ability [10]. Ilahi et al. review emerging adversarial attacks in DRL-based systems and the
potential countermeasures to defend against these attacks [15]. The authors classify the attacks as attacks targeting
(i) rewards, (ii) policies, (iii) observations, and (iv) the environment. In this paper, instead, we consider targeting the
agents’ actions, which can happen in real-world applications when agents interact with their environment.
Similarly considering how to better transfer learning from simulations to the real-world, multiple researchers have
been working on a simulation-to-reality transfer for specific applications in dierent environments [12,2,16]. In this
paper, we will analyze the eect of the adversarial of byzantine eects in multi-agents reinforcement learning, and
introduce a fraction of byzantine actions, which has not been studied before.
Other researchers have explored the influence of noisy rewards in RL. Wang et al. present a robust RL framework
that enables agents to learn in noisy environments where only perturbed rewards are observed, and analyzes dierent
algorithms performance under their proposed framework, including PPO, DQN, and DDPG [11]. In this paper, the
perturbances on the DRL process will be explored, but we focus on analyzing discrete action spaces and a fraction of
byzantine actions performed by a small number of byzantine agents.
Multiple works have been presented in the convergence of DRL and edge computing. However, rather than focusing
on exploiting edge computing for distributed RL, most of the current literature is exploiting RL for edge service. For
instance, Ning et al. apply DRL for more ecient ooading orchestration [3], while Wang et al. have applied DRL
to optimize resource allocation at the edge [17]. In our work, however, we focus on analyzing some of the challenges
that can appear when edge computing is exploited for distributed multi-agent DRL in real-world applications.
3. Methodology
This section describes the methods and simulation environments utilized for the analysis in Section 4: the advantage
actor-critic (A2C) algorithm for distributed DRL, and the simulation environment.
Actor-critic methods combine the advantages of value based and policy based methods, and has been regarded as
the base of many modern RL algorithms. In A2C, two neural networks represent the actor and critic, where the actor
controls the agent’s behavior and the critic evaluate how good the action taken is. As value-based methods tend to
high variability, an advantage function is employed to replace the raw value function, leading to advantage actor-critic
(A2C). The main scheme of the policy gradient updates is shown in (1):
θnew θold +ηRθ(1)
4Wenshuai Zhao et al. /Procedia Computer Science 00 (2020) 000–000
where θdenotes the policy to be learned, ηis the learning rate, and Rθrepresents the policy gradient, given by (2).
Rθ1
N
N
n=1
Tn
t=1
R(τn)logp(an
t|sn
t) (2)
where Nis the number of trajectories sampled under the policy τ, and R(τn) denotes the accumulated reward for each
episode consisting of Tnsteps. In the policy with weight θ, an action an
tunder the state sn
tis chosen with probability
p(an
t|sn
t).
In this policy gradient method, the accumulated reward R(τn) is calculated by sampling the trajectories, which are
computed when the whole episode is finished, and hence might bring high variability aecting the policy convergence.
To avoid this, value estimation is introduced and merged into the policy gradient method. An advantage function is
thus proposed to replace R(τn) according to (3), which is also the reason for the name of A2C.
R(τn)=rn
t+Vπ(sn
t+1)Vπ(sn
t) (3)
where rtis the reward gained in the step t, and Vπdenotes the value function to estimate the accumulated reward
that will be gained. Additionally, in the implementation of this A2C algorithm, multiple agents are employed to
produce the trajectories in parallel. Compared to A3C [4], in which each agent will update the network individually
and asynchronously, A2C collects the whole data from each agents and then update the shared network. This is also
illustrated in Fig. 1.
In order to analyze the eect of byzantine actions, we choose two typical gym-wrapped Atari games as our simula-
tion environments: PongNoFrameskip-v4 (Fig. 2b) and BreakoutNoFrameskip-v4 (Fig. 3b). Both of them take video
as an input, based on which the policy will be trained to produce the corresponding discrete actions to obtain higher
rewards. The action spaces for Pong and Breakout have cardinality of 5 and 4, respectively. We set their corresponding
byzantine agents to behave with the opposite actions (e.g., if the output action from the policy is action =2 in Pong,
then a wrong action by the byzantine agents will be action =3). In this paper, we consider the eect of the presence of
Byzantine agent in terms of their number and the frequency of wrong actions they perform. The patterns we observe
in the experiments can be further utilized to detect Byzantine agents in distributed multi-agent DRL scenarios.
4. Simulations and Results
In this section, we describe the settings in our experiments and present the main conclusions of our analysis. There
are totally 16 agents or workers employed to produce trajectory data both in Pong and Breakout environments. The
experiences, actions and rewards from the dierent agents are then aggregated synchronously to calculate the policy
gradients and update the policy towards a more optimal one. In the experiments, we analyze how byzantine agents that
perform wrong actions unknowingly aect the collaborative learning eort. The reference training without byzantine
agents for the Pong and Breakout environments are shown in Fig. 2a and Fig. 3a, respectively.
In the Pong environment, we first set a single agent continuously behaving wrongly, out of the total of 16 agents
working in parallel (Fig. 2c). Compared to the reference training, we observe that the policy is unable to improve in
order to obtain better rewards. Therefore, a single byzantine agent representing as little as 6.25% of the total is enough
to completely disable the ability of the system to converge towards a working policy. Therefore, we have focused on
analyzing the maximum fraction of wrong actions that byzantine agents can perform in order to ensure convergence
of the system. Moreover, in order to test whether it is the total fraction what matters or the number of agents, we have
considered the same total fraction of byzantine actions in dierent settings.
In the training, the policy is updated only when the agents perform a series of steps, collecting a certain amount of
interaction data. In particular, agents perform 5 steps between updates of the policy. This leads to 80 steps between
updates, and we set the total number of steps to 107for the complete training process. The number of episodes depends
on the performance of the agents (the better the performance, the longer the episodes are).
With this, we set dierent fractions of byzantine actions depending on (i) the number of agents, (ii) the number of
wrong actions in between updates, and (iii) the fraction of updates aected by byzantine actions. The results for the
Pong environment are shown in Figures 2d through 2h. From these, we conclude that 20% of byzantine actions are
enough to deplete the systems’ ability to converge, while the system is able to converge with slight unstabilities in the
presence of a 10% of byzantine actions (Figures 2e,2g and 2h). Finally, with just 5% of byzantine actions (Fig. 2f)
the convergence is similar to the reference.
328 Wenshuai Zhao et al. / Procedia Computer Science 177 (2020) 324–329
Wenshuai Zhao et al. /Procedia Computer Science 00 (2020) 000–000 5
050 100 150 200 250 300 350
20
0
20
Episode
Reward
(a) Reference training (no byzantine agents). (b) PongNoFrameskip-v4 environment.
0 200 400 600 800
20
0
20
Episode
Reward
(c) One agent with continuous byzantine ac-
tions.
0 200 400 600
20
0
20
Episode
Reward
(d) One agent, byzantine actions in 1/5 updates
(20% of the total).
0 100 200 300
20
0
20
Episode
Reward
(e) One agent, byzantine actions in 1/10 up-
dates (10% of the total).
0 100 200
20
0
20
Episode
Reward
(f) One agent, byzantine actions in 1/2 steps of
1/10 updates (5% of the total).
0 100 200 300
20
0
20
Episode
Reward
(g) Two agents, byzantine actions in 1/2 steps
of 1/10 updates (2 ×5% of the total).
0 100 200 300
20
0
20
Episode
Reward
(h) Four agents, byzantine actions in 1/2 steps
of 1/20 updates (4 ×2.5% of the total).
Fig. 2: Experiments in the PongNoFrameskip-v4 environment.
In addition, we also conduct similar experiments on another classical Atari game, BreakoutNoFrameskip-v4. The
results with byzantine actions are shown in Figures 3c through Figure 3h. In this environment, 10% byzantine actions
are enough to deter convergence (Fig. 3c, Figure 3f, and Figure 3h). Only by reducing the frequency to 2.5% can
the training get acceptable convergence (Figure 3e). A general conclusion is also that the total fraction of byzantine
actions is what matters the most, and not how they are introduced in the system.
5. Conclusion
Applying distributed multi-agent deep reinforcement learning in real-world scenarios requires further study of how
adversarial conditions aect collaborative learning eorts. We have done this as an initial step towards distributed
DRL applications falling under the umbrella of opportunities that the edge computing paradigm and next-generation
mobile networks are bringing to the IoT and a myriad of cyber-physical systems. In this work, we have explored the
performance of the state-of-the-art A2C algorithm in two dierent simulation environments and reported initial results
analyzing the ability of the systems to converge in the presence of byzantine agents. In both environments, the agents
were collaboratively learning a policy for discrete action spaces. Our study has focused on analyzing how dierent
fractions of wrong actions performed unknowingly by byzantine agents aect the collaborative learning eort. These
results will serve to prepare future research against such adversities. In our future works, we will focus on building
methods to detect the byzantine agents and therefore provide a more robust collaborative learning framework.
Wenshuai Zhao et al. / Procedia Computer Science 177 (2020) 324–329 329
Wenshuai Zhao et al. /Procedia Computer Science 00 (2020) 000–000 5
050 100 150 200 250 300 350
20
0
20
Episode
Reward
(a) Reference training (no byzantine agents). (b) PongNoFrameskip-v4 environment.
0 200 400 600 800
20
0
20
Episode
Reward
(c) One agent with continuous byzantine ac-
tions.
0 200 400 600
20
0
20
Episode
Reward
(d) One agent, byzantine actions in 1/5 updates
(20% of the total).
0 100 200 300
20
0
20
Episode
Reward
(e) One agent, byzantine actions in 1/10 up-
dates (10% of the total).
0 100 200
20
0
20
Episode
Reward
(f) One agent, byzantine actions in 1/2 steps of
1/10 updates (5% of the total).
0 100 200 300
20
0
20
Episode
Reward
(g) Two agents, byzantine actions in 1/2 steps
of 1/10 updates (2 ×5% of the total).
0 100 200 300
20
0
20
Episode
Reward
(h) Four agents, byzantine actions in 1/2 steps
of 1/20 updates (4 ×2.5% of the total).
Fig. 2: Experiments in the PongNoFrameskip-v4 environment.
In addition, we also conduct similar experiments on another classical Atari game, BreakoutNoFrameskip-v4. The
results with byzantine actions are shown in Figures 3c through Figure 3h. In this environment, 10% byzantine actions
are enough to deter convergence (Fig. 3c, Figure 3f, and Figure 3h). Only by reducing the frequency to 2.5% can
the training get acceptable convergence (Figure 3e). A general conclusion is also that the total fraction of byzantine
actions is what matters the most, and not how they are introduced in the system.
5. Conclusion
Applying distributed multi-agent deep reinforcement learning in real-world scenarios requires further study of how
adversarial conditions aect collaborative learning eorts. We have done this as an initial step towards distributed
DRL applications falling under the umbrella of opportunities that the edge computing paradigm and next-generation
mobile networks are bringing to the IoT and a myriad of cyber-physical systems. In this work, we have explored the
performance of the state-of-the-art A2C algorithm in two dierent simulation environments and reported initial results
analyzing the ability of the systems to converge in the presence of byzantine agents. In both environments, the agents
were collaboratively learning a policy for discrete action spaces. Our study has focused on analyzing how dierent
fractions of wrong actions performed unknowingly by byzantine agents aect the collaborative learning eort. These
results will serve to prepare future research against such adversities. In our future works, we will focus on building
methods to detect the byzantine agents and therefore provide a more robust collaborative learning framework.
6Wenshuai Zhao et al. /Procedia Computer Science 00 (2020) 000–000
050 100 150 200 250 300 350 400 450
0
100
200
300
400
Episode
Reward
(a) Reference training (no byzantine agents) (b) BreakoutNoframeskip-v4 env.
0500
0
100
200
300
400
Episode
Reward
(c) One agent, byzantine actions in 1/10 up-
dates (10% of the total).
0 200 400 600
0
100
200
300
400
Episode
Reward
(d) One agent, byzantine actions in 1/2 steps of
1/10 updates (5% of the total).
0 200 400
0
100
200
300
400
Episode
Reward
(e) One agent, byzantine actions in 1/2 steps of
1/20 updates (2.5% of the total).
0500 1,000 1,500
0
100
200
300
400
Episode
Reward
(f) One agent, byzantine actions in 1/2 steps of
1/5 updates (10% of the total).
0 200 400 600
0
100
200
300
400
Episode
Reward
(g) Two agents, byzantine actions in 1/2 steps
of 1/10 updates (2 ×5% of the total).
0 200 400 600
0
100
200
300
400
Episode
Reward
(h) Four agents, byzantine actions in 1/2 steps
of 1/20 updates (4 ×2.5% of the total).
Fig. 3: Experiments in the BreakoutNoFrameskip-v4 environment.
References
[1] T.T. Nguyen et al. Deep reinforcement learning for multiagent systems: A review of challenges, solutions, and applications. IEEE transactions
on cybernetics, 2020.
[2] J. Matas et al. Sim-to-real reinforcement learning for deformable object manipulation. arXiv, 2018.
[3] Z. Ning et al. Deep reinforcement learning for vehicular edge computing: An intelligent ooading system. ACM TIST, 10(6), 2019.
[4] V. Mnih et al. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, 2016.
[5] R. Raileanu et al. Modeling others using oneself in multi-agent reinforcement learning. arXiv preprint arXiv:1802.09640, 2018.
[6] R. Conde et al. Time-varying formation controllers for unmanned aerial vehicles using deep reinforcement learning. arXiv, 2017.
[7] P. Long et al. Towards optimally decentralized multi-robot collision avoidance via deep reinforcement learning. In ICRA. IEEE, 2018.
[8] J. Pe˜
na Queralta et al. Enhancing autonomy with blockchain and multi-acess edge computing in distributed robotic systems. In The Fifth
International Conference on Fog and Mobile Edge Computing (FMEC). IEEE, 2020.
[9] V. Behzadan et al. Vulnerability of deep reinforcement learning to policy induction attacks. In MLDM, 2017.
[10] A. Gleave et al. Adversarial policies: Attacking deep reinforcement learning. arXiv preprint arXiv:1905.10615, 2019.
[11] J. Wang et al. Reinforcement learning with perturbed rewards. In AAAI, pages 6202–6209, 2020.
[12] B. Balaji et al. Deepracer: Educational autonomous racing platform for experimentation with sim2real reinforcement learning.
arXiv:1911.01562, 2019.
[13] S. Gu et al. Deep reinforcement learning for robotic manipulation with asynchronous o-policy updates. In ICRA. IEEE, 2017.
[14] Y. Yu et al. Multi-agent deep reinforcement learning multiple access for heterogeneous wireless networks with imperfect channels. arXiv
preprint arXiv:2003.11210, 2020.
[15] I. Ilahi et al. Challenges and countermeasures for adversarial attacks on deep reinforcement learning. arXiv preprint arXiv:2001.09684, 2020.
[16] K. Arndt et al. Meta reinforcement learning for sim-to-real domain adaptation. arXiv, 2019.
[17] J. Wang et al. Smart resource allocation for mobile edge computing: A deep reinforcement learning approach. IEEE Transactions on emerging
topics in computing, 2019.
... We have chosen for this study, however, the more recent NVIDIA Isaac Sim platform owing to the high-quality visuals but also tools that enable seamless generation of randomized environments for synthetic data acquisition. Randomization and the ability to alter the environments has been shown to be a key parameter to collaborative learning approaches [15], [20]. An additional advantage is a common ecosystem of tools with the embedded NVIDIA Jetson platforms, the state-of-theart in embedded computing for robots that need discrete GPUs for DL inference. ...
... Multi-robot collaboration and DL for robotics have both played an increasingly important role in multiple robotic applications [4]. However, most of the work to date in learning from real-world experiences, sim-to-real transfer, and continuous learning, has been dedicated to reinforcement learning [3,11] and robotic manipulation [12]. Within the possibilities to achieve collaborative learning, one of the most straightforward approaches is cloud-based centralized learning [13], with a server where data is aggregated and training occurs at once or in batches, but in an offline manner. ...
Article
Full-text available
The role of deep learning (DL) in robotics has significantly deepened over the last decade. Intelligent robotic systems today are highly connected systems that rely on DL for a variety of perception, control and other tasks. At the same time, autonomous robots are being increasingly deployed as part of fleets, with collaboration among robots becoming a more relevant factor. From the perspective of collaborative learning, federated learning (FL) enables continuous training of models in a distributed, privacy-preserving way. This paper focuses on vision-based obstacle avoidance for mobile robot navigation. On this basis, we explore the potential of FL for distributed systems of mobile robots enabling continuous learning via the engagement of robots in both simulated and real-world scenarios. We extend previous works by studying the performance of different image classifiers for FL, compared to centralized, cloud-based learning with a priori aggregated data. We also introduce an approach to continuous learning from mobile robots with extended sensor suites able to provide automatically labelled data while they are completing other tasks. We show that higher accuracies can be achieved by training the models in both simulation and reality, enabling continuous updates to deployed models.
... Multi-robot collaboration and DL for robotics have both played an increasingly important role in multiple robotic applications [4]. However, most of the work to date in learning from real-world experiences, sim-to-real transfer, and continuous learning, has been dedicated to reinforcement learning [3], [11] and robotic manipulation [12]. Within the possibilities to achieve collaborative learning, one of the most straightforward approaches is cloud-based centralized learning [13], with a server where data is aggregated and training occurs at once or in batches, but in an offline manner. ...
Preprint
Full-text available
The role of deep learning (DL) in robotics has significantly deepened over the last decade. Intelligent robotic systems today are highly connected systems that rely on DL for a variety of perception, control, and other tasks. At the same time, autonomous robots are being increasingly deployed as part of fleets, with collaboration among robots becoming a more relevant factor. From the perspective of collaborative learning, federated learning (FL) enables continuous training of models in a distributed, privacy-preserving way. This paper focuses on vision-based obstacle avoidance for mobile robot navigation. On this basis, we explore the potential of FL for distributed systems of mobile robots enabling continuous learning via the engagement of robots in both simulated and real-world scenarios. We extend previous works by studying the performance of different image classifiers for FL, compared to centralized, cloud-based learning with a priori aggregated data. We also introduce an approach to continuous learning from mobile robots with extended sensor suites able to provide automatically labeled data while they are completing other tasks. We show that higher accuracies can be achieved by training the models in both simulation and reality, enabling continuous updates to deployed models.
... We have chosen for this study, however, the more recent NVIDIA Isaac Sim platform owing to the high-quality visuals but also tools that enable seamless generation of randomized environments for synthetic data acquisition. Randomization and the ability to alter the environments has been shown to be a key parameter to collaborative learning approaches [15], [20]. An additional advantage is a common ecosystem of tools with the embedded NVIDIA Jetson platforms, the state-of-theart in embedded computing for robots that need discrete GPUs for DL inference. ...
Preprint
Deep learning methods have revolutionized mobile robotics, from advanced perception models for an enhanced situational awareness to novel control approaches through reinforcement learning. This paper explores the potential of federated learning for distributed systems of mobile robots enabling collaboration on the Internet of Robotic Things. To demonstrate the effectiveness of such an approach, we deploy wheeled robots in different indoor environments. We analyze the performance of a federated learning approach and compare it to a traditional centralized training process with a priori aggregated data. We show the benefits of collaborative learning across heterogeneous environments and the potential for sim-to-real knowledge transfer. Our results demonstrate significant performance benefits of FL and sim-to-real transfer for vision-based navigation, in addition to the inherent privacy-preserving nature of FL by keeping computation at the edge. This is, to the best of our knowledge, the first work to leverage FL for vision-based navigation that also tests results in real-world settings.
... Active research areas in TIERS include multi-robot coordination [1], [2], [3], [4], [5], swarm design [6], [7], [8], [9], UWB-based localization [10], [11], [12], [13], [14], [15], localization and navigation in unstructured environments [16], [17], [18], lightweight AI at the edge [19], [20], [21], [22], [23], distributed ledger technologies at the edge [24], [25], [26], [27], [28], [29], edge architectures [30], [31], [32], [33], [34], [35], offloading for mobile robots [36], [37], [38], [39], [40], [41], [42], LPWAN networks [43], [44], [45], [46], sensor fusion algorithms [47], [48], [49], and reinforcement and federated learning for multi-robot systems [50], [51], [52], [53]. ...
... The main objective of a multi-agent RL is to obtain the localized policies and maximize the global reward for knowledge sharing on the premise of increased system complexity and computation [38]. In multirobot systems, distributed RL can be leveraged to expose different robots to different environments, or to learn more robust policies in the presence of disturbances [39,40]. ...
Preprint
Full-text available
Autonomous systems are becoming inherently ubiquitous with the advancements of computing and communication solutions enabling low-latency offloading and real-time collaboration of distributed devices. Decentralized technologies with blockchain and distributed ledger technologies (DLTs) are playing a key role. At the same time, advances in deep learning (DL) have significantly raised the degree of autonomy and level of intelligence of robotic and autonomous systems. While these technological revolutions were taking place, raising concerns in terms of data security and end-user privacy has become an inescapable research consideration. Federated learning (FL) is a promising solution to privacy-preserving DL at the edge, with an inherently distributed nature by learning on isolated data islands and communicating only model updates. However, FL by itself does not provide the levels of security and robustness required by today's standards in distributed autonomous systems. This survey covers applications of FL to autonomous robots, analyzes the role of DLT and FL for these systems, and introduces the key background concepts and considerations in current research.
Article
Full-text available
Recent studies have shown that reinforcement learning (RL) models are vulnerable in various noisy scenarios. For instance, the observed reward channel is often subject to noise in practice (e.g., when rewards are collected through sensors), and is therefore not credible. In addition, for applications such as robotics, a deep reinforcement learning (DRL) algorithm can be manipulated to produce arbitrary errors by receiving corrupted rewards. In this paper, we consider noisy RL problems with perturbed rewards, which can be approximated with a confusion matrix. We develop a robust RL framework that enables agents to learn in noisy environments where only perturbed rewards are observed. Our solution framework builds on existing RL/DRL algorithms and firstly addresses the biased noisy reward setting without any assumptions on the true distribution (e.g., zero-mean Gaussian noise as made in previous works). The core ideas of our solution include estimating a reward confusion matrix and defining a set of unbiased surrogate rewards. We prove the convergence and sample complexity of our approach. Extensive experiments on different DRL platforms show that trained policies based on our estimated surrogate reward can achieve higher expected rewards, and converge faster than existing baselines. For instance, the state-of-the-art PPO algorithm is able to obtain 84.6% and 80.8% improvements on average score for five Atari games, with error rates as 10% and 30% respectively.
Article
Full-text available
Reinforcement learning (RL) algorithms have been around for decades and employed to solve various sequential decision-making problems. These algorithms, however, have faced great challenges when dealing with high-dimensional environments. The recent development of deep learning has enabled RL methods to drive optimal policies for sophisticated and capable agents, which can perform efficiently in these challenging environments. This article addresses an important aspect of deep RL related to situations that require multiple agents to communicate and cooperate to solve complex tasks. A survey of different approaches to problems related to multiagent deep RL (MADRL) is presented, including nonstationarity, partial observability, continuous state and action spaces, multiagent training schemes, and multiagent transfer learning. The merits and demerits of the reviewed methods will be analyzed and discussed with their corresponding applications explored. It is envisaged that this review provides insights about various MADRL methods and can lead to the future development of more robust and highly useful multiagent learning methods for solving real-world problems.
Conference Paper
Full-text available
This conceptual paper discusses how different aspects involving the autonomous operation of robots and vehicles will change when they have access to next-generation mobile networks. 5G and beyond connectivity is bringing together a myriad of technologies and industries under its umbrella. High-bandwidth, low-latency edge computing services through network slicing have the potential to support novel application scenarios in different domains including robotics, autonomous vehicles, and the Internet of Things. In particular, multi-tenant applications at the edge of the network will boost the development of autonomous robots and vehicles offering computational resources and intelligence through reliable offloading services. The integration of more distributed network architectures with distributed robotic systems can increase the degree of intelligence and level of autonomy of connected units. We argue that the last piece to put together a services framework with third-party integration will be next-generation low-latency blockchain networks. Blockchains will enable a transparent and secure way of providing services and managing resources at the Multi-Access Edge Computing (MEC) layer. We overview the state-of-the-art in MEC slicing, distributed robotic systems and blockchain technology to define a framework for services the MEC layer that will enhance the autonomous operations of connected robots and vehicles.
Conference Paper
Full-text available
Deep learning classifiers are known to be inherently vulnerable to manipulation by intentionally perturbed inputs, named adversarial examples. In this work, we establish that reinforcement learning techniques based on Deep Q-Networks (DQNs) are also vulnerable to adversarial input perturbations, and verify the transferability of adversarial examples across different DQN models. Furthermore, we present a novel class of attacks based on this vulnerability that enable policy manipulation and induction in the learning process of DQNs. We propose an attack mechanism that exploits the transferability of adversarial examples to implement policy induction attacks on DQNs, and demonstrate its efficacy and impact through experimental study of a game-learning scenario.
Article
The development of smart vehicles brings drivers and passengers a comfortable and safe environment. Various emerging applications are promising to enrich users' traveling experiences and daily life. However, how to execute computing-intensive applications on resource-constrained vehicles still faces huge challenges. In this paper, we construct an intelligent offloading system for vehicular edge computing by leveraging deep reinforcement learning. First, both the communication and computation states are modelled by finite Markov chains. Moreover, the task scheduling and resource allocation strategy is formulated as a joint optimization problem to maximize users' Quality of Experience (QoE). Due to its complexity, the original problem is further divided into two sub-optimization problems. A two-sided matching scheme and a deep reinforcement learning approach are developed to schedule offloading requests and allocate network resources, respectively. Performance evaluations illustrate the effectiveness and superiority of our constructed system. Additional Key Words and Phrases: Vehicular system, intelligent offloading, deep reinforcement learning, edge computing.
Article
The development of mobile devices with improving communication and perceptual capabilities has brought about a proliferation of numerous complex and computation-intensive mobile applications. Mobile devices with limited resources face more severe capacity constraints than ever before. As a new concept of network architecture and an extension of cloud computing, Mobile Edge Computing (MEC) seems to be a promising solution to meet this emerging challenge. However, MEC also has some limitations, such as the high cost of infrastructure deployment and maintenance, as well as the severe pressure that the complex and mutative edge computing environment brings to MEC servers. At this point, how to allocate computing resources and network resources rationally to satisfy the requirements of mobile devices under the changeable MEC conditions has become a great aporia. To combat this issue, we propose a smart, Deep Reinforcement Learning based Resource Allocation (DRLRA) scheme, which can allocate computing and network resources adaptively, reduce the average service time and balance the use of resources under varying MEC environment. Experimental results show that the proposed DRLRA performs better than the traditional OSPF algorithm in the mutative MEC conditions.