ChapterPDF Available

Deep Q Network (DQN), Double DQN, and Dueling DQN: A Step Towards General Artificial Intelligence

Authors:

Abstract and Figures

In this chapter, we will take our first step towards Deep Learning based Reinforcement Learning. We will discuss the very popular Deep Q Networks and its very powerful variants like Double DQN and Dueling DQN. Extensive work has been done on these models and these models form the basis of some of the very popular applications like AlphaGo. We will also introduce the concept of General AI in this chapter and discuss how these models have been instrumental in inspiring hopes of achieving General AI through these Deep Reinforcement Learning model applications.
Content may be subject to copyright.
Citations
BibTex:
@incollection{drl-8-dqn-ddqn,
title={Deep Q Network (DQN), Double DQN, and Dueling DQN},
author={Sewak, Mohit},
booktitle={Deep Reinforcement Learning},
pages={95--108},
year={2019},
publisher={Springer}
}
Plain Text:
M. Sewak, Deep Q Network (DQN), Double DQN, and Dueling DQN”,
Deep Reinforcement Learning, pp. 95-108, Springer, 2019.
2
Deep Q Network (DQN), Dou-
ble DQN and Dueling DQN
A step towards General Artificial Intelligence
Abstract
In this chapter we will take our first step towards Deep Learning based Reinforce-
ment Learning. We will discuss the very popular Deep Q Networks and its very
powerful variants like Double DQN and Dueling DQN. Extensive work has been
done on these models and these models form the basis of some of the very popular
applications like AlphaGo. We will also introduce the concept of General AI in this
chapter and discuss how these models have been instrumental in inspiring hopes of
achieving General AI through these Deep Reinforcement Learning model applica-
tions.
3
General Artificial Intelligence
Until now the Reinforcement Learning agents that we studied may be considered
to be falling under the category of Artificial Intelligence agents. But is there some-
thing beyond Artificial Intelligence as well? In Chapter 1 while discussing what
could be called as real ‘Intelligence’, we stumbled upon the idea of ‘Human-Like’
behavior as a benchmark for evaluating the degree of ‘Intelligence’. Let’s spend a
moment discussing what a human’s intelligence or human like intelligence is capa-
ble of.
Taking our theme of games again for this discussion so that all the readers could
relate. Ever since ‘80s many could have grown up playing video games like ‘Mario’,
then ‘Contra’, and then could have explored some popular FPS (First Person
Shooter Categories) games like ‘Half-Life’, followed by ‘Arcade’ genre games like
‘Counter-Strike’, and now addicted to the more popular ‘Battle-Royale’ genre of
games like ‘PUBG’ and ‘Fortnite’. It may take someone a day or may be a week to
gain a decent level of proficiency in any of these games, even while moving from
one game to another, and at times even addicted to more than a single game simul-
taneously. As humans, even for the avid game enthusiasts, we do a lot of things
besides playing games as well, and we could gain increasingly improved profi-
ciency at all of them with the same ‘mind’ and the ‘Intelligence’ that we have. This
concept where a single architecture and model of intelligence could be used to learn
different seemingly even unrelated problems is called ‘General Artificial Intelli-
gence’.
Until recently the Reinforcement Learning agents were hand crafted and tuned
to perform individual and specific tasks. For example, we could have various schol-
ars experimenting with innovative agents and mechanisms to get better in the game
of ‘Backgammon’. Recently with AI gym and some other initiatives opening their
platforms to Reinforcement Learning academician and enthusiast to work on stand-
ardized problems (in the form of exposed standard environments) and compare their
4
results and enhancements for these problems with the community, there have been
several papers and informal competitions where researchers and academicians try
to propose innovative algorithms and other enhancements that could generate better
scores/ rewards in a specific environment of Reinforcement Learning. So essen-
tially, from one evolution of agents to another, the Reinforcement Learning agents
and algorithms that empower them keeps getting better in doing a particular task.
These specific tasks may range from solving a particular environment of the ‘AI
Gym’ like playing ‘Backgammon’ to balancing g ‘Cart-Pole’ and others. But still
the concept of ‘General Artificial Intelligence’ has remained evasive.
But things are changing now with the evolution of ‘Deep Reinforcement Learn-
ing’. As we discussed in an earlier chapter, ‘Deep Learning’ has the capability of
intelligently self-extracting important features from the data, without requiring hu-
man/ SME involvement in hand-crafting domain specific features for them. When
we combine this ability with the self-acting capability of the Reinforcement Learn-
ing, we come closer to realizing the idea of ‘General Artificial Intelligence’.
An Introduction to ‘Google Deep Mind’ and ‘AlphaGo’
Researchers at Google’s ‘Deep Mind’ (‘Deep Mind’ was acquired by Google
sometime back) developed this algorithm called as the Deep Q Network that we will
be discussing in detail in this chapter. Researchers at Deep Mind combined the Q
Learning algorithm in Reinforcement Learning with the ideas in Deep Learning to
enable the concept of Deep Q Networks (DQN). A single DQN program could teach
itself how to play 49 different games from the ‘Atari’ titles (‘Atari used to be a
very popular gaming console in the era of ‘80s and beyond. Atari had a lot of game
titles with graphical interfaces.) and excel at most of them simultaneously, even
defeating the best of human adversary’s scores for most of these titles as shown in
the figure below.
5
Figure 1- Normalized performance of DQN vs. human gamer (100*(DQN-score ran-
dom-play-score)/(human-score-random-play-score)) for games where DQN performed bet-
ter than human gamer. (Ref: DQN-Nature-Paper)
Algorithms similar to the DQN also powered ‘Deep Mind’s’ famous ‘AlphaGo’
program. AlphaGo was the very first program that for the first time consistently and
repeatedly defeated the best of human adversaries at the game of ‘Go’. For the read-
ers who are unfamiliar with the game of ‘Go’, if they understand ‘Chess’ then just
for the sake of comparison, if they consider ‘Chess’ as a game that challenges hu-
man intelligence, planning and strategizing capabilities to a significant level, then
the game of ‘Go’ is considered to take these challenges many notches higher. It is
known that the number of possible moves in the game of ‘Go’ is even more that the
number of atoms in the entire universe, and hence it requires the best of human
intelligence, thinking and planning capabilities to excel in this game.
With a Deep Reinforcement Learning agent, consistently defeating even the best
of human adversaries in these games, and with many other similar Deep Reinforce-
ment Learning algorithms consistently defeating their respective human adversaries
in at least 49 other gaming instances as claimed in multiple comparative studies
using standardized games/ environments, we could assume that the advancements
6
in the area of ‘Deep Reinforcement Learning’ are leading us towards the concept of
‘General Artificial Intelligence’ as described in the previous section.
The DQN Algorithm
The term ‘Deep’ in the ‘Deep Q Networks’ (DQN) refers to the use of ‘Deep’
‘Convolutional Neural Networks’ (CNN) in the DQNs. Convolutional Neural Net-
works are deep learning architectures inspired by the way human’s visual cortex
area of the brain works to understand the images that the sensors (eyes) are receiv-
ing. We mentioned in Chapter 1, while discussing about state-formulations, that for
image/ visual inputs, the state could either be humanly abstracted or the agent could
be made intelligent enough to make sense of these states. In the former case a sep-
arate human-defined algorithm to understand the objects in an image, their specific
instances, and the position of each on the instance, is custom trained and then the
agent is fed this simplified data as an input to form a simplified state for the agent
to work on. In the latter case, we also discussed a way that we could enable our
Reinforcement Learning agent itself to simplify the state of raw image pixels for it
to draw intelligence from. We also discussed the role of CNN (Convolutional Neu-
ral Networks) briefly there.
CNNs contains layers of Convolutional Neurons, and within each layer there are
different kernel (functions) that cover the image in different strides. A 3 x N x N
dimensional input image (here 3xNxN dimension input means an input image with
3 color channels, each of N x N pixels) when passed through a convolutional layer
may produce multiple convolution maps of lower dimension as compared to input
pixel size of N x N for each channel, but each resulting map use the same weight
for the kernels. Since the weights for the kernels in a layer remains the same so only
a single vector needs to be optimized hence for bringing out the key features in an
image a CNN is more efficient for working on images than any a DNN (MLP based
Deep Neural Network) counterpart delivering similar accuracies. But the output of
a CNN is a multi-dimensional tensor, which is not effective for feeding into any
subsequent classification or regression (value estimation) model. Therefore, the last
convolutional layer of a CNN is connected to one or more flat layers (which are
similar to the hidden layer in a DNN network) before it is fed into (mostly) a ‘Soft-
Max’ activation layer for classification or (generally) a ‘Linear’ activation layer for
regression. The ‘SoftMax’ activation layer produces the class-probabilities for each
class for which classification is required and choosing the output class with the
highest class-probability (argmax) determines the best action.
7
The DQN network contains the CNN network as described above. The specific
DQN that we mentioned in the earlier section on the introduction to General Artifi-
cial Intelligence, that performed well on 49 Atari titles simultaneously, used an ar-
chitecture having a CNN with 2 convolutional layers, followed by 2 fully-connected
layers, terminating into an 18 class ‘SoftMax’ classification. These 18 classes rep-
resent the 18 actions possible from an Atari controller (Atari had a single 8-direction
joystick, and just one button for all the games) that the game input could act on.
These 18 classes (as used in the specific DQN by DeepMind for Atari) are Do-
Nothing (i.e. don’t do anything), then 8-classes representing the 8 directions of the
joystick (Move-Straight-Up, Move-Diagonal-Right-UP, Move-Straight-Right,
Move-Diagonal-Right-Down, Move-Straight-Down, Move-Diagonal-Left-Down,
Move-Straight-Left, Move-Diagonal-Left-Up), Press-Button (alone without mov-
ing), then another 8 actions corresponding to simultaneously pressing the button
and making one of the joystick-movement.
With every instance that the agent is required to act (such action instances may
not exactly correspond with every step one to one as we will discuss later), the agent
chooses one of the actions (note that one of the actions is Do-Nothing as well). The
figure below shows the illustrative architecture of the Deep Learning model culmi-
nating into the required 18 action classes.
Figure 2- DQN CNN Schematic (Ref: DQN-Nature-Paper)
The motivation for this book is enable the user to make their own real-life RL
agent. Since the Atari based agent might not have the most suitable model for some
of the other applications, we may have to change the CNN configurations, and the
8
structure of the output layer for the specific use-cases and domain that we would be
implementing it for.
Atari gives a 60 FPS video output. It means that every second the game generates
and displays/ sends 60 images as an input. This is the signal that we could use as an
input to our agent as states. One drawback of using raw image pixels and working
directly with all consecutive frames at such high frame rate to train a Q-Learning-
Network is that the training of the Q-Learning-Network may not be very stable. Not
only the training might take a lot of time to converge, but at times instead of con-
verging the loss function may actually diverge or get stuck into a hunting loop. To
overcome these challenges while working on high frame-rate, high-dimension, cor-
related image data the DQN had to implement the following three enhancements to
ensure descent convergence and practical applicability.
Experience Replay
It is important to understand the concept of Experience Trail, before we dis-
cuss ‘Experience Replayenhancement. In Chapter 4, while discussing Q Learning,
we referred to the quadruple of (state, action, reward, next-state) as an ‘experience’
data instance to train the Q Learning’s Action-Value/ Q function. In ‘Experience
Trail’, the first term ‘experience’ is exactly the same experience instance, that is a
tuple of (state, action, reward, next-state) or (s, a, r, s) in abbreviated form. Now let
us discuss the problem of convergence as we briefly touched upon in the earlier
section in greater detail to understand why the ‘trail’ of such experience instances
is required.
While using graphical feed as input to our reinforcement learning agent we get
numerous frames of raw pixels in quick succession. Also, since these frames are in
sequence, there would be very high correlation amongst such consecutive input
frames. The update that occurs to the Q function values during the training process
is very sensitive to the number of times the algorithm encounters a particular expe-
rience instance. In the basic Q Learning algorithm the action-values/ Q Function is
updated in every step. Though we will understand in the later sub-section that in
DQN this shortcoming is also enhanced slightly for the same reason. Consecutively
seeing similar experience instances very frequently, will result in the weights of the
Q network updated in a very specific direction. Such biases in training may lead to
the formation local ‘ravines’ in the loss function’s hyperparameter space. Such ‘ra-
vines’ are very difficult to maneuver by simple optimization algorithms and hence
such biases from the ingestion of multiple similar experience instances slows down
9
or hinders the optimization of the cost function. It may be difficult to optimize such
a loss function under these challenges, and the training of such a Q network may
warrant the use of very complex optimizers.
Therefore, in ‘Experience Trail’ the ‘experience-tuples’ are not directly used in
the order that they are being generated from the source system (in our case the Atari
processor) to train the agent. Instead all the experience instance tuples as generated
from the source system are collected in a memory-buffer (mostly having fixed size
of memory). This memory-buffer is updated with new experience instances as they
are received as a queue, i.e. in a first-in, first-out order. So as soon as the memory-
buffer reaches its limit, the oldest experience instances are deleted to make way for
the latest experience instances. From this pool/ buffer of experience instances the
‘experience-tuples’ are picked randomly to train the agent. This process is known
as ‘Experience Replay’.
‘Experience Replay’ not only solves the problems arising from the use of con-
current-sequence of experiences for training the Q network as we discussed earlier,
but also limits the similar-frames problem as only a few frames from a concurrent
sequence are likely to be picked in a random draw.
Prioritized Experience Replay
Prioritized Experience Replay is an enhancement to the Experience Replay
mechanism used in the base/ original DQN algorithm that outperformed humans in
49 of Atari games. A DQN with Prioritized Experience Replay was able to outper-
form the original DQN with uniform Experience Replay in 41 of the 49 games
where the original DQN outperformed human gamers.
In the basic ‘Experience Replay’ enhancement that we discussed earlier, we
learnt that all the experience instances are stored in the experience train in the
same order that they are received. Such buffered experience instances are ‘ran-
domly’ selected for training. As the name suggests, in Prioritized Experience Re-
play, we would like to use some sort of priority in this experience replay process.
There are two mode of prioritizing experience instances from the experience
trail. The first mode to ‘prioritize’ which input experience instances received from
the source system are stored in the experience trail from where these could be picked
at random for ‘replay’. Alternatively, the second mode is to buffer all the experience
instances as they are being generated from the source system and then to prioritize
10
which specific experience instances are selected to be replayed from this unpriori-
tized storage.
In the ‘Prioritized Experience Replay’ enhancement we chose to prioritize using
the second mode. Once we have determined the prioritization mode as that of pri-
oritized replaying from an unprioritized experience trail (storage), the second deci-
sion aspect is to determine the specific criteria for prioritizing the experience in-
stance for replay. For this the ‘Prioritized Experience Replay’ algorithm uses the
Temporal Difference error δ as the criteria to prioritize specific experience in-
stance for replay in subsequent iterations of the training. So, unlike the original
(‘uniform’) Experience Replay method, where every experience instance ((s, a, r,
s′) tuple) has a uniform probability to be selected for training, the Prioritized Expe-
rience Replay variant gives relatively higher priority to samples that produced a
larger TD error δ’.
So, the probability of selection of a given experience tuple is given as:
\[p_{i} = \abs{\delta_i} + e \] ---(1)
Where e is an added constant to avoid zero probability for any available sample
in the experience trail. One problem with the above formulation is that though such
a prioritization is good when the training is in initial phases, but later on when the
agent has predominantly learnt from some specific experiences repeatedly, it devel-
ops biases towards such experience instances. This leads to over-fitting of the
agent’s model and associated nuances. To avoid this pitfall in the above equation is
modified slightly and a stochastic formulation is applied to it to add some random-
ness and avoid a completely greedy solution. This is done as below:
\[P_{(i)} = \frac {p^\alpha_i} {\sum_{k} p^\alpha_{k}}\] --- (2)
In the equation above (equation 2), the process for determining the probability
of sampling a particular experience instance could be controlled. We could have a
sampling probability on a particular experience instance as one generated from a
sampling processes that ranges from a process that is purely random to a process
that is purely greedy to anything in between. This control is defined is determined
be a parameter that is the ratio of the priority of transition as defined in the earlier
equation, normalized over all transition priorities, each raised to the power ‘α’. Here
‘α’ is a constant that determines the greediness of the sampling process. α could be
set to any value between 0 and 1. A process with α = 0 denotes no prioritization and
gives an effect of uniform sampling, leading to results similar to that of the original
11
unprioritized Experience Replay algorithm. Conversely a process with α = 1, will
be similar to the extreme prioritized experience replay for samples with large TD
errors throughout the training and associated biases as we discussed earlier.
Skipping Frames
A further enhancement for the problem caused because of the bias due to training
with frequent and multiple consecutive similar frames as discussed above is that we
do not pick all 60 the frames generated per second for the purpose of training. In
the DQN trained for ‘Atari’ 4 consecutive frames were combined to make the data
pertaining to one state. This also reduces computational cost, without losing much
information. Assuming that the games are made for human react-able time between
key events, there would be a lot of correlation in every frame if the feed is at 60 fps.
The number 4 frames every 60 is not very hard. In practical application in one’s
own domain, this number could be adjusted depending upon the requirement of the
specific use-case and the input frequency and correlation amongst consecutive
frames.
Additional Target Q-Network
One major change that the Deep Q Networks made over that of the basic Q
Learning algorithm, is that of the introduction of a new ‘Target-Q-Network’. While
discussing Q-Leaning in Chapter 4, we referred to the term “(r + γ maxa(Q(s, a))
in the equation for the Q Function update (equation [4].(7)) as the ‘target’. Given
below is the complete equation for reference.
\[Q_{(s, a)} = (1-\alpha) Q_{(s, a)} + \alpha ( r + \gamma max_{a\textprime}
{Q_{(s\textprime, a\textprime)}})\] --- (4.7)
So essentially in this equation the Q Function Q (s, a) is being referred twice and
each of this reference is for different purposes. The first reference i.e. (1- α) Q (s,
12
a) is mainly to retrieve the present state-action value so as to update its value (use
of Q as in: Q (s, a) = (1- α) Q (s, a) + …), and the second is to get the ‘target’ value
for the subsequent Q value for the next state-action (i.e. Q as in: r + γ max a(Q(s,
a)). Though in the basic Q Learning algorithms both these Q Functions/ Networks
(or Q-Tables in case of a tabular Q-Learning approach) were same, it may not nec-
essarily be so always.
In DQN the target Q network is different from the one that is being continu-
ously updated in every step. This is done to overcome the drawbacks related to using
the same Q Network for both continuous update and for referring the target values.
These drawbacks are majorly because of two reasons. The first as we highlighted
earlier in the section of basic DQN i.e. related to issues related delayed/sub-optimal
convergence in case of too-frequent, and highly-correlated data for training. It could
be noted that if the targets for training are coming from the same network they are
bound to be correlated. Another reason is that it is not a good idea to use the target
values from the same function to correct its own update. This is because when we
use the same function to update its own estimates then sometimes it may lead to
‘unstable’ target function.
Thus, it is found that using two different Q networks for these two different pur-
poses enhances the stability of the Q network. But if a target action-value is required
for the training, and if this target value does not update at all after initialization
(which as we learnt could even be all zeros in case of Q Learning as it is an off-
policy algorithm), then the ‘active’ (actively updated/ estimated) network could not
be updated effectively. Therefore, the ‘target’ Q Network is synced and updated
with the actively updated Q Network’ once every ‘c’ number of steps. For the
‘Atari’ problem the value of ‘c’ was fixed to 1000 steps.
Clipping Rewards and Penalties
Although this is not a very significant change while considering training and de-
ployment for a single application, but when considered in perspective of developing
a system for ‘General Artificial Intelligence’ the mechanism for accumulating re-
wards and penalties needs to be balanced. Different games (and real-domain skills)
may have a different scoring system. Some games may offer a relatively lower ab-
solute score for even a very challenging task, and others may be too generous in
giving absolute rewards (scores). For example, in a game like ‘Mario’, it is easy to
13
get score in the range of hundreds of thousands of points; whereas in a game like
‘Pong’ the player gets just a single point for defending an entire game.
Since Reinforcement Learning and especially the idea of ‘General Artificial In-
telligence’ has spawned from the human mind’s ability to learn different skills, lets
analyze the constitution of our own body to understand this concept better. Human’s
and for that matter most of the animals learn different habits and stereotypes by a
process called reinforcement, which is also the basis of Reinforcement Learning
that we have modeled for the machines to become adept at different skills. Since
Reinforcement Learning requires a reward to ‘reinforce’ any behavior so our mind
should also work on receiving some rewards to reinforce and learn any behavior. In
humans, the sense of reward is achieved by the release of a chemical called ‘dopa-
mine’, which reinforces a particular behavior that acted as a trigger to this behavior.
If you are curious why you get so addicted to you mobile and want to click on every
notification, social media feeds and shopping apps to the point that they start con-
trolling you instead of you controlling them, you can blame the dopamine response
for the same. Similarly, the addictions to substances ranging from drugs to sugary
foods are governed by the reinforcement caused by the release of dopamine which
serves as the reward mechanism to make us reinforce certain behaviors.
Since the body’s dopamine producing capability is limited, so an automatic scal-
ing and clipping effect is realized across different activities we do. When this dopa-
mine response system is altered externally/ chemically for example by consumption
of drugs, it does lead to withdrawal from other activities and bring meaning to life,
leading to an unstable behavior and outcome.
To achieve a similar reward/ penalty scaling and clipping effects in the DQN as
used in the ‘Atari’ game, all the rewards across all games were fixed to +1 and all
penalties to -1. Since rewards are key to reinforcement training and vary widely
across applications, readers are encouraged to device their own scaling and clipping
techniques for their respective use cases and domains.
14
Double DQN
In situations like the ones that warrant the application of Deep Reinforcement
Learning, generally the state-space and state-size may be extremely large, and it
may take a lot of time for the agent to learn sufficient information about the envi-
ronment and ascertain which state/ actions may lead to the most optimal instanta-
neous or total rewards. Under these conditions the exploration opportunities may be
overwhelmed (especially in the case of constant epsilon-based algorithms) and the
agent may get stuck to exploiting the explored and estimated state-action combina-
tions that have relatively higher values even if not the highest values possible but
not yet explored. This may lead to the ‘overestimation of Q-Values’ for some of
these combinations of state-actions leading to suboptimal training.
In the earlier section we discussed about splitting the Q Network into two differ-
ent Q Networks, one being the online/ active and the other being the target Q Net-
work whose values are used as a reference. We also discussed that the target Q
network is not updated very frequently and instead is updated only after a certain
number of steps. The above highlighted overestimation problem may become even
more significant if the actions are taken on the basis of a Q network (Target Q-
network) whose values are not even frequently updated (since these are updates
from the active ‘online’ Q Network after every thousands or so of steps).
We also discussed in the earlier section why it was important to split the Q Net-
work into the active and the target Q Networks and the benefits of a dedicated target
Q Network. So, we would like to continue using the ‘Target’ Q Network as it offers
better and more stable target values for the update. To combine the best of both
worlds, the ‘Double DQN algorithms proposes to select the action on the basis of
the ‘Online’ Q Network but to use the values of the target state-action value corre-
sponding to this particular state-action from the ‘Target’ Q Network.
So, in Double DQN, in every step, the value of all action-value combinations for
all possible actions in the given state is read from the ‘online’ Q Network which is
being updated continuously. Then an argmax is taken over all the state-action values
of such possible actions, and whichever state-action combination value maximizes
the value, that particular action is selected. But to update the ‘Online’ Q Network
the (target) corresponding value of such selected state-action combination is taken
from the ‘target’ Q Network (which is updated intermittently). Double DQN algo-
rithm suggests and by doing so we could simultaneously overcome both the ‘over-
estimation’ problem of Q Values while also avoiding the instability in the target
values.
15
Dueling DQN
Until now the Deep Learning Models (here the term models refer to its usage as
in supervised learning models as opposed to the MDP model) that we covered, were
‘Sequential’ architectures (sequential architectures and sequential models may have
different meaning in deep learning). In these models, all the neurons in any partic-
ular layer could be connected only to the neurons in just one layer before and one
layer after their own layer. In other words, no branches or loops existed in these
model architectures.
Though both DQN and Double DQN had two Q networks, but there was only
one deep learning model and the other (target) network values was a periodic copy
of the active (online) network’s values. In Dueling DQN we have a non-sequential
architecture of deep learning in which, after the convolutional layers, the model
layers branches into two different streams (sub-networks), each having their own
fully-connected layer and output layers. The first of these two branches/ networks
are corresponding to that of the Value function which is used to ‘estimate’ the value
of a given state, and has a single node in its output layer. The second branch/ net-
work is called the ‘Advantage’ network, and it computes the value of the ‘ad-
vantage’ of taking a particular action over the base value of being in the current
state.
Figure 3 - Schematic - Dueling Q Network
But the Q Function in Dueling DQN, still represents Q Function in any atypical
Q Learning algorithm and thus the Dueling DQN algorithm should work in the same
way conceptually as how the atypical Q Learning algorithm works by estimating
16
the absolute action values or Q estimates. So somehow, we need to estimate the
action-value/ Q estimates as well. Remember action-value is the absolute value of
taking a given action in a given state. So, if we could combine (add) the output of
the state’s base value (first network/ branch) and the incremental ‘advantage’ val-
ues of the actions from the second (‘advantage’) network/ branch then we could
essentially estimate the action-value or are Q Values as required in Q Learning. This
could be represented mathematically as below:
\[ Q_{(s, a; \theta, \alpha, \beta)} = V_{(s; \theta, \beta)} + (A_{(s, a; \theta,
\alpha)} - \max_{ a\textprime \in \abs{A}} A_{(s, a\textprime; \theta, \alpha)})\] -
-- (3)
In the equation (3) above the terms Q, V, s, a, ahave the same consistent mean-
ing as we discussed earlier in this book. Additionally, the term A’ denotes the ad-
vantage value. θ represents the parameter vector of the convolutional layer which
is common to both the ‘Value’ network and the ‘Advantage’ network. α represents
the parameter vector of the ‘Advantage’ network and β represents the parameter
vector of the ‘State-Value’ function. Since we have entered the domain of function
approximators, therefore the values of any network are denoted with respect to the
parameters of the ‘estimating’ network to distinguish between the values/ estimates
of the same variable estimated from multiple different estimating functions.
The equation in simple terms mean that the Q value (the subscripts θ, α, β to the
Q here indicates that the Q estimates here are as computed from the estimating
model which has three series of parameters or is a function of θ, α, β) for a given
state-action combination is equal to the value of that state or absolute utility of being
in that state as estimated from the state-value (V) network (the subscripts θ, β of V
in the equation denotes that the state values are coming from an estimation function
that has parameters θ, β), plus the incremental value or the ‘advantage’ (the sub-
scripts θ, α of A in the equation denotes that the advantage is derived from an esti-
mation function that has parameters θ, α) of taking that action in that state. The last
part of the equation is to provide the necessary corrections to provide ‘identifiabil-
ity’.
Let us spend a moment to understand the ‘identifiability’ part in greater details.
The equation (3), from the simple explanation we discussed above which is also
intuitive could have been as simple as below:
\[ Q_{(s, a; \theta, \alpha, \beta)} = V_{(s; \theta, \beta)} + A_{(s, a; \theta, \al-
pha)} \] --- (4)
17
But the problem with this simple construct as in equation (4) above is that though
we could get the value of Q (action values), provided the values of S and A are
given, but the converse in not true. That is, we could not ‘uniquely’ recover the
values of S, A from a given value of Q. This is called ‘unidentifiability. The last
part of the equation (in equation 1) solves this ‘unidentifiability’ problem by
providing ‘forward-mapping’.
A better modification of equation (3) is provided below as equation (5). In equa-
tion (5) the last part as provided in the equation (3) is slightly modified. Though by
subtracting a constant the values get slightly off-targeted, but that does not affect
the learning much as the value comparison is still intact. Moreover, the equation in
this form adds to the stability of the optimization.
\[ Q_{(s, a; \theta, \alpha, \beta)} = V_{(s; \theta, \beta)} + (A_{(s, a; \theta,
\alpha)} – \frac{1}{|A|} \sum_{a} A_{(s, a; \theta, \alpha)}) \] --- (5)
18
Summary
General Artificial Intelligence or the idea of a single algorithm or system learning
and excelling at multiple seemingly different tasks simultaneously have been the
ultimate goal of Artificial Intelligence. General AI is a step towards enabling ma-
chines and agents to reach human level intelligence of adapting to different scenar-
ios by learning new skills.
The DQN paper by Deep Mind claims to making some progress in the creating
a system that could learn the essential skills to the requirements of 47 different types
of Atari games, even surpassing the best of the human adversaries’ scores in many
of these games.
Though DQN is very potent and could surpass human level performance in many
games as claimed by its success on standardized Atari environments, but it has its
own shortcomings. There are many enhancements that could be employed to over-
come these shortcomings. The Double DQN and Dueling DQN, both use 2 different
Q Networks instead of a single Q network as used in DQN and aims at overcoming
the shortcomings of DQN though in slightly different manner.
The Dueling DQN also brings the concept of advantage or the incrementally
higher utility of taking an action over the base state’s absolute value. The concept
of advantage will be explored further in some of the other algorithms we will cover
in this book.
19
References
DQN, deepmind.com, url: https://deepmind.com/research/dqn/, Accessed: Aug 2018.
AlphaGo, deepmind.com, url: https://deepmind.com/research/alphago/, Accessed: Aug 2018.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan
Wierstra, Martin A. Riedmiller, “Playing Atari with Deep Reinforcement Learning”, CoRR,
arXiv: 1312.5602, 2013.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bel-
lemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen,
Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan
Wierstra, Shane Legg and Demis Hassabis, “Human-level control through deep reinforcement
learning”, Nature, vol. 14236, doi 10.1038, 2015.
Tom Schaul, John Quan, Ioannis Antonoglou, David Silver, “Prioritized Experience Replay”,
CoRR, arXiv: 1511.05952, 2015.
Lin, Long-Ji, “Reinforcement Learning for Robots Using Neural Networks”, Carnegie Mellon
University, PhD Thesis Lin:1992: RLR:168871, 1992.
Takuma Seno, “Welcome to deep reinforcement learning”, towardsdatascience.com, url: https://to-
wardsdatascience.com/welcome-to-deep-reinforcement-learning-part-1-dqn-c3cab4d41b6b,
Accessed: Aug 2018.
Arthur Juliani, “Simple reinforcement learning with tensorflow”, medium.com, url: https://me-
dium.com/@awjuliani/simple-reinforcement-learning-with-tensorflow-part-4-deep-q-net-
works-and-beyond-8438a3e2b8df, Accessed: Aug 2018.
Hado van Hasselt, Arthur Guez, David Silver, “Deep Reinforcement Learning with Double Q-
learning”, CoRR, arXiv 1509.06461, 2015.
Ziyu Wang, Nando de Freitas, Marc Lanctot, “Dueling Network Architectures for Deep Reinforce-
ment Learning”, CoRR, arXiv 1511.06581, 2015.
... We utilized a Dueling DQN algorithm-based Reinforcement Learning (RL) agent for RLAB as the adversarial attack agent (Sewak 2019), which also evaluates the robustness of CNN image classification models. The Dueling DQN model is well-suited to the discrete action space, accommodating the limited possible values of NADD DIST and NREM DIST , while maintaining an appropriate level of complexity for effective prediction within a reasonably bounded training process. ...
Article
Full-text available
We present a Reinforcement Learning Platform for Adversarial Black-box untargeted and targeted attacks, RLAB, that allows users to select from various distortion filters to create adversarial examples. The platform uses a Reinforcement Learning agent to add minimum distortion to input images while still causing misclassification by the target model. The agent uses a novel dual-action method to explore the input image at each step to identify sensitive regions for adding distortions while removing noises that have less impact on the target model. This dual action leads to faster and more efficient convergence of the attack. The platform can also be used to measure the robustness of image classification models against specific distortion types. Also, retraining the model with adversarial samples significantly improved robustness when evaluated on benchmark datasets. The proposed platform outperforms state-of-the-art methods in terms of the average number of queries required to cause misclassification. This advances trustworthiness with a positive social impact. #AdversarialMachineLearning #ReinforcementLearning #BlackBoxAttacks #RobustML #TrustworthyAI #AAAI #ModelRobustness #RLAB #MLSafety #AdversarialRobustness #MLSafety #AdversarialRL
... The performance of DQN scheduler [21] is evaluated using these processing steps: Initialization of evaluation parameters, execution of DQN scheduler, recording the scores and average score calculation. Average score is found to be 295.6934780476084. ...
Article
The basic challenge in spatially distributed resource restricted Wireless Senor Network (WSN) is to prolong the network lifetime maintaining its performance criterions like energy efficiency, coverage and connectivity. To achieve this, scheduling can be performed in sensor nodes in terms of their wake up time periods. There are many traditional scheduling mechanisms developed. But, sensor node scheduling still poses a challenge when the WSN becomes dynamic in nature. In this paper, Deep Q-network modeling is carried out with Reinforcement scheduler. The advancement in reinforcement learning and particularly Deep Q-Network (DQN) algorithms provides an efficient tool for implementing intelligent scheduling. When an agent is trained using these techniques, its ability to differentiate between events improves thus, optimizing the scheduling process. The presented scheduling algorithm is evaluated and compared with traditional scheduling algorithms for the same network characteristics. The DQN modeling shows promising potential to intelligently schedule the sensor nodes in terms of energy consumption, average rewards and also minimum and maximum latencies.
... We utilized a Dueling DQN algorithm-based Reinforcement Learning (RL) agent for RLAB as the adversarial attack agent (Sewak 2019), which also evaluates the robustness of CNN image classification models. The Dueling DQN model is well-suited to the discrete action space, accommodating the limited possible values of NADD DIST and NREM DIST , while maintaining an appropriate level of complexity for effective prediction within a reasonably bounded training process. ...
Preprint
Full-text available
We present a Reinforcement Learning Platform for Adversarial Black-box untargeted and targeted attacks, RLAB, that allows users to select from various distortion filters to create adversarial examples. The platform uses a Reinforcement Learning agent to add minimum distortion to input images while still causing misclassification by the target model. The agent uses a novel dual-action method to explore the input image at each step to identify sensitive regions for adding distortions while removing noises that have less impact on the target model. This dual action leads to faster and more efficient convergence of the attack. The platform can also be used to measure the robustness of image classification models against specific distortion types. Also, retraining the model with adversarial samples significantly improved robustness when evaluated on benchmark datasets. The proposed platform outperforms state-of-the-art methods in terms of the average number of queries required to cause misclassification. This advances trustworthiness with a positive social impact.
... To enhance policy stability and learning, DQN is augmented with double DQN, using evaluation and target networks. This approach mitigates DQN's overestimation problem [47]. The evaluation network aids in action selection and policy validation, while the target network exploits Q-values for subsequent actions. ...
Article
NR-V2X Mode 2 is introduced by the Third Generation Partnership Project (3GPP) to support Vehicle-to-Everything (V2X) communication. In NR-V2X Mode 2, vehicles select resources for the exchange of Cooperative Awareness Messages (CAM) in a decentralized manner based on their local observation using semi-persistent scheduling. Resources are distributed over the two-dimensional frequency and time domain, following the LTE frame structure. Since vehicles select resources based on their local observations and due to spectrum scarcity, this may lead to contention. Hence, selecting a resource is challenging, and as each vehicle strives to select a resource, it becomes a consensus problem. To resolve resource contention, in this paper, we propose a Knowledge-empowered Distributed Multi-Agent Deep Reinforcement Learning (K-MADRL) approach. Based on traffic flow information, Long Short-term Memory (LSTM) is employed to deploy Unmanned Internet of Aerial Agents (UIAAs) to collect vehicle state information. UIAAs gather vehicle state knowledge and train the local deep reinforcement learning model. The locally trained model at the UIAA is shared and aggregated at the gNB for the global model update. The trained policy is then sent to the vehicles over System Synchronization Blocks (SSB) for distributed execution. Moreover, the vehicles select the resource based on the joint action, i.e., by anticipating the actions of the neighboring vehicles. Our scheme is compared with other methods such as DRL (deep reinforcement learning), optimization techniques, the SPS method, and random allocation methods used in the NR-V2X environment. The results of the simulations show that our scheme outperforms the other methods.
Conference Paper
Battery Energy Storage Systems (BESS) play a crucial role in enhancing Distributed Energy Resources (DERs) efficiency and reliability. Managing these systems across diverse DER environments presents challenges due to the dynamic nature of the grid, market fluctuations, and the inherent complexities of both DERs and the batteries themselves. This paper proposes a new approach for adaptive battery management in DERs, utilizing meta learning for Deep Q Networks. We trained an autonomous agent in a reinforcement learning environment, enabling it to optimize battery operations across multiple DER locations with minimal training data. The effectiveness of the proposed method is validated through a two-stage process. First, the agent undergoes meta-learning training in the reinforcement environment, equipping it with the necessary decision-making capabilities. Second, its performance is evaluated through a simulation using real world data on energy consumption, generation, and pricing. The agent excels at handling multiple objectives simultaneously and pursues three key goals: maximizing renewable energy usage, maintaining healthy battery states of charge, and potentially reducing energy costs for consumers.
Article
Full-text available
Common challenges in the area of robotics include issues such as sensor modeling, dynamic operating environments, and limited on-broad computational resources. To improve decision making, robots need a dependable framework to facilitate communication between different modules and the optimal action for real-world applications. The Robotics Operating System (ROS) and Reinforcement Learning (RL) are two promising approaches that help accomplish precise control, seamless integration of sensors-actuators, and exhibit learned behavior. The ROS enables seamless communication between heterogeneous components, while RL focuses on learning optimal behaviors through trial-and-error scenarios. Combining the ROS and RL offers superior decision making, improved perception, enhanced automation, and reliability. This work focuses on investigating ROS-based RL applications across various domains, aiming to enhance understanding through comprehensive discussion, analysis, and summarization. We base our evaluation on the application area, type of RL algorithm used, and degree of ROS–RL integration. Additionally, we provide summary of seminal works that define the current state of the art, along with GitHub repositories and resources for research purposes. Based on the review of successfully implemented projects, we make recommendations highlighting the advantages and limitations of RL techniques for specific applications and environments. The ultimate goal of this work is to advance the robotics field by providing a comprehensive overview of the recent important works that incorporate both the ROS and RL, thereby improving the adaptability of these emerging techniques.
Article
Full-text available
Quadrotor control in dynamic and unstructured environments requires robust and adaptive solutions to achieve agile and autonomous navigation. Nonlinear Model Predictive Control (NMPC) offers precise control through a nonlinear dynamical system model and online optimization over a short prediction horizon. However, its adaptability can be significantly enhanced through reinforcement learning. This study proposes a multimodal NMPC (MP-NMPC) policy, leveraging deep reinforcement learning to improve online optimization. Conditioned on the quadrotor's local observations, the trained reinforcement learning policy dynamically selects adaptation parameters and feedforward control commands for the low-level NMPC controller. The search for adaptation variables and feedforward control signals is formulated as a Markov Decision Process (MDP), enabling reinforcement learning to address this optimization challenge. A multimodal deep reinforcement learning policy that combines depth images with inertial data is trained in simulation and transferred to a real quadrotor in a zero-shot fashion, facilitating online hyperparameter adaptation in highly dynamic unstructured environments. The significance of this approach is demonstrated through its application to challenging tasks such as agile navigation and obstacle avoidance, where it enables autonomous flight through complex environments and outperforms standard reinforcement learning methods.
Article
Full-text available
Multi-access edge computing (MEC) brings many services closer to user devices, alleviating the pressure on resource-constrained devices. It enables devices to offload compute-intensive tasks to nearby MEC servers. Hence, improving users’ quality of experience (QoS) by reducing both application execution time and energy consumption. However, to meet the huge demands, efficient resource scheduling algorithms are an essential and challenging problem. Resource scheduling involves efficiently allocating and managing MEC resources. In this paper, we survey the state-of-the-art research regarding this issue and focus on deep reinforcement learning (DRL) solutions. DRL algorithms reach optimal or near-optimal policies when adapted to a particular scenario. To the best of our knowledge, this is the first survey that specifically focuses on the use of RL and DRL techniques for resource scheduling in multi-access computing. We analyze recent literature in three research aspects, namely, content caching, computation offloading, and resource management. Moreover, we compare and classify the reviewed papers in terms of application use cases, network architectures, objectives, utilized RL algorithms, evaluation metrics, and model approaches: centralized and distributed. Furthermore, we investigate the issue of user mobility and its effect on the model. Finally, we point out a few unresolved research challenges and suggest several open research topics for future studies.
Article
Full-text available
In recent years there have been many successes of using deep representations in reinforcement learning. Still, many of these applications use conventional architectures, such as convolutional networks, LSTMs, or auto-encoders. In this paper, we present a new neural network architecture for model-free reinforcement learning inspired by advantage learning. Our dueling architecture represents two separate estimators: one for the state value function and one for the state-dependent action advantage function. The main benefit of this factoring is to generalize learning across actions without imposing any change to the underlying reinforcement learning algorithm. Our results show that this architecture leads to better policy evaluation in the presence of many similar-valued actions. Moreover, the dueling architecture enables our RL agent to outperform the state-of-the-art Double DQN method of van Hasselt et al. (2015) in 46 out of 57 Atari games.
Conference Paper
Experience replay lets online reinforcement learning agents remember and reuse experiences from the past. In prior work, experience transitions were uniformly sampled from a replay memory. However, this approach simply replays transitions at the same frequency that they were originally experienced, regardless of their significance. In this paper we develop a framework for prioritizing experience, so as to replay important transitions more frequently, and therefore learn more efficiently. We use prioritized experience replay in Deep Q-Networks (DQN), a reinforcement learning algorithm that achieved human-level performance across many Atari games. DQN with prioritized experience replay achieves a new state-of-the-art, outperforming DQN with uniform replay on 41 out of 49 games.
Article
The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.
Article
"January 6, 1993." Thesis (Ph. D.)--Carnegie Mellon University, 1993. Includes bibliographical references. Supported in part by the Avionics Laboratory, Wright Research and Development Center, Aeronautical Systems Division, U.S. Air Force, Wright- Patterson AFB. Supported in part by the Fujitsu Laboratories, Ltd.
Playing Atari with Deep Reinforcement Learning
  • Volodymyr Mnih
  • Koray Kavukcuoglu
  • David Silver
  • Alex Graves
  • Ioannis Antonoglou
  • Daan Wierstra
  • Martin A Riedmiller
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin A. Riedmiller, "Playing Atari with Deep Reinforcement Learning", CoRR, arXiv: 1312.5602, 2013.
Welcome to deep reinforcement learning
  • Takuma Seno
Takuma Seno, "Welcome to deep reinforcement learning", towardsdatascience.com, url: https://towardsdatascience.com/welcome-to-deep-reinforcement-learning-part-1-dqn-c3cab4d41b6b, Accessed: Aug 2018.
Simple reinforcement learning with tensorflow
  • Arthur Juliani
Arthur Juliani, "Simple reinforcement learning with tensorflow", medium.com, url: https://medium.com/@awjuliani/simple-reinforcement-learning-with-tensorflow-part-4-deep-q-networks-and-beyond-8438a3e2b8df, Accessed: Aug 2018.
Deep Reinforcement Learning with Double Qlearning
  • Arthur Hado Van Hasselt
  • David Guez
  • Silver
Hado van Hasselt, Arthur Guez, David Silver, "Deep Reinforcement Learning with Double Qlearning", CoRR, arXiv 1509.06461, 2015.