Content uploaded by Mohit Sewak
Author content
All content in this area was uploaded by Mohit Sewak on Jun 05, 2021
Content may be subject to copyright.
Citations
BibTex:
@incollection{drl-8-dqn-ddqn,
title={Deep Q Network (DQN), Double DQN, and Dueling DQN},
author={Sewak, Mohit},
booktitle={Deep Reinforcement Learning},
pages={95--108},
year={2019},
publisher={Springer}
}
Plain Text:
M. Sewak, “Deep Q Network (DQN), Double DQN, and Dueling DQN”,
Deep Reinforcement Learning, pp. 95-108, Springer, 2019.
2
Deep Q Network (DQN), Dou-
ble DQN and Dueling DQN
A step towards General Artificial Intelligence
Abstract
In this chapter we will take our first step towards Deep Learning based Reinforce-
ment Learning. We will discuss the very popular Deep Q Networks and its very
powerful variants like Double DQN and Dueling DQN. Extensive work has been
done on these models and these models form the basis of some of the very popular
applications like AlphaGo. We will also introduce the concept of General AI in this
chapter and discuss how these models have been instrumental in inspiring hopes of
achieving General AI through these Deep Reinforcement Learning model applica-
tions.
3
General Artificial Intelligence
Until now the Reinforcement Learning agents that we studied may be considered
to be falling under the category of Artificial Intelligence agents. But is there some-
thing beyond Artificial Intelligence as well? In Chapter 1 while discussing what
could be called as real ‘Intelligence’, we stumbled upon the idea of ‘Human-Like’
behavior as a benchmark for evaluating the degree of ‘Intelligence’. Let’s spend a
moment discussing what a human’s intelligence or human like intelligence is capa-
ble of.
Taking our theme of games again for this discussion so that all the readers could
relate. Ever since ‘80s many could have grown up playing video games like ‘Mario’,
then ‘Contra’, and then could have explored some popular FPS (First Person
Shooter Categories) games like ‘Half-Life’, followed by ‘Arcade’ genre games like
‘Counter-Strike’, and now addicted to the more popular ‘Battle-Royale’ genre of
games like ‘PUBG’ and ‘Fortnite’. It may take someone a day or may be a week to
gain a decent level of proficiency in any of these games, even while moving from
one game to another, and at times even addicted to more than a single game simul-
taneously. As humans, even for the avid game enthusiasts, we do a lot of things
besides playing games as well, and we could gain increasingly improved profi-
ciency at all of them with the same ‘mind’ and the ‘Intelligence’ that we have. This
concept where a single architecture and model of intelligence could be used to learn
different seemingly even unrelated problems is called ‘General Artificial Intelli-
gence’.
Until recently the Reinforcement Learning agents were hand crafted and tuned
to perform individual and specific tasks. For example, we could have various schol-
ars experimenting with innovative agents and mechanisms to get better in the game
of ‘Backgammon’. Recently with AI gym and some other initiatives opening their
platforms to Reinforcement Learning academician and enthusiast to work on stand-
ardized problems (in the form of exposed standard environments) and compare their
4
results and enhancements for these problems with the community, there have been
several papers and informal competitions where researchers and academicians try
to propose innovative algorithms and other enhancements that could generate better
scores/ rewards in a specific environment of Reinforcement Learning. So essen-
tially, from one evolution of agents to another, the Reinforcement Learning agents
and algorithms that empower them keeps getting better in doing a particular task.
These specific tasks may range from solving a particular environment of the ‘AI
Gym’ like playing ‘Backgammon’ to balancing g ‘Cart-Pole’ and others. But still
the concept of ‘General Artificial Intelligence’ has remained evasive.
But things are changing now with the evolution of ‘Deep Reinforcement Learn-
ing’. As we discussed in an earlier chapter, ‘Deep Learning’ has the capability of
intelligently self-extracting important features from the data, without requiring hu-
man/ SME involvement in hand-crafting domain specific features for them. When
we combine this ability with the self-acting capability of the Reinforcement Learn-
ing, we come closer to realizing the idea of ‘General Artificial Intelligence’.
An Introduction to ‘Google Deep Mind’ and ‘AlphaGo’
Researchers at Google’s ‘Deep Mind’ (‘Deep Mind’ was acquired by Google
sometime back) developed this algorithm called as the Deep Q Network that we will
be discussing in detail in this chapter. Researchers at Deep Mind combined the Q
Learning algorithm in Reinforcement Learning with the ideas in Deep Learning to
enable the concept of Deep Q Networks (DQN). A single DQN program could teach
itself how to play 49 different games from the ‘Atari’ titles (‘Atari’ used to be a
very popular gaming console in the era of ‘80s and beyond. Atari had a lot of game
titles with graphical interfaces.) and excel at most of them simultaneously, even
defeating the best of human adversary’s scores for most of these titles as shown in
the figure below.
5
Figure 1- Normalized performance of DQN vs. human gamer (100*(DQN-score – ran-
dom-play-score)/(human-score-random-play-score)) for games where DQN performed bet-
ter than human gamer. (Ref: DQN-Nature-Paper)
Algorithms similar to the DQN also powered ‘Deep Mind’s’ famous ‘AlphaGo’
program. AlphaGo was the very first program that for the first time consistently and
repeatedly defeated the best of human adversaries at the game of ‘Go’. For the read-
ers who are unfamiliar with the game of ‘Go’, if they understand ‘Chess’ then just
for the sake of comparison, if they consider ‘Chess’ as a game that challenges hu-
man intelligence, planning and strategizing capabilities to a significant level, then
the game of ‘Go’ is considered to take these challenges many notches higher. It is
known that the number of possible moves in the game of ‘Go’ is even more that the
number of atoms in the entire universe, and hence it requires the best of human
intelligence, thinking and planning capabilities to excel in this game.
With a Deep Reinforcement Learning agent, consistently defeating even the best
of human adversaries in these games, and with many other similar Deep Reinforce-
ment Learning algorithms consistently defeating their respective human adversaries
in at least 49 other gaming instances as claimed in multiple comparative studies
using standardized games/ environments, we could assume that the advancements
6
in the area of ‘Deep Reinforcement Learning’ are leading us towards the concept of
‘General Artificial Intelligence’ as described in the previous section.
The DQN Algorithm
The term ‘Deep’ in the ‘Deep Q Networks’ (DQN) refers to the use of ‘Deep’
‘Convolutional Neural Networks’ (CNN) in the DQNs. Convolutional Neural Net-
works are deep learning architectures inspired by the way human’s visual cortex
area of the brain works to understand the images that the sensors (eyes) are receiv-
ing. We mentioned in Chapter 1, while discussing about state-formulations, that for
image/ visual inputs, the state could either be humanly abstracted or the agent could
be made intelligent enough to make sense of these states. In the former case a sep-
arate human-defined algorithm to understand the objects in an image, their specific
instances, and the position of each on the instance, is custom trained and then the
agent is fed this simplified data as an input to form a simplified state for the agent
to work on. In the latter case, we also discussed a way that we could enable our
Reinforcement Learning agent itself to simplify the state of raw image pixels for it
to draw intelligence from. We also discussed the role of CNN (Convolutional Neu-
ral Networks) briefly there.
CNNs contains layers of Convolutional Neurons, and within each layer there are
different kernel (functions) that cover the image in different strides. A 3 x N x N
dimensional input image (here 3xNxN dimension input means an input image with
3 color channels, each of N x N pixels) when passed through a convolutional layer
may produce multiple convolution maps of lower dimension as compared to input
pixel size of N x N for each channel, but each resulting map use the same weight
for the kernels. Since the weights for the kernels in a layer remains the same so only
a single vector needs to be optimized hence for bringing out the key features in an
image a CNN is more efficient for working on images than any a DNN (MLP based
Deep Neural Network) counterpart delivering similar accuracies. But the output of
a CNN is a multi-dimensional tensor, which is not effective for feeding into any
subsequent classification or regression (value estimation) model. Therefore, the last
convolutional layer of a CNN is connected to one or more flat layers (which are
similar to the hidden layer in a DNN network) before it is fed into (mostly) a ‘Soft-
Max’ activation layer for classification or (generally) a ‘Linear’ activation layer for
regression. The ‘SoftMax’ activation layer produces the class-probabilities for each
class for which classification is required and choosing the output class with the
highest class-probability (argmax) determines the best action.
7
The DQN network contains the CNN network as described above. The specific
DQN that we mentioned in the earlier section on the introduction to General Artifi-
cial Intelligence, that performed well on 49 Atari titles simultaneously, used an ar-
chitecture having a CNN with 2 convolutional layers, followed by 2 fully-connected
layers, terminating into an 18 class ‘SoftMax’ classification. These 18 classes rep-
resent the 18 actions possible from an Atari controller (Atari had a single 8-direction
joystick, and just one button for all the games) that the game input could act on.
These 18 classes (as used in the specific DQN by DeepMind for Atari) are Do-
Nothing (i.e. don’t do anything), then 8-classes representing the 8 directions of the
joystick (Move-Straight-Up, Move-Diagonal-Right-UP, Move-Straight-Right,
Move-Diagonal-Right-Down, Move-Straight-Down, Move-Diagonal-Left-Down,
Move-Straight-Left, Move-Diagonal-Left-Up), Press-Button (alone without mov-
ing), then another 8 actions corresponding to simultaneously pressing the button
and making one of the joystick-movement.
With every instance that the agent is required to act (such action instances may
not exactly correspond with every step one to one as we will discuss later), the agent
chooses one of the actions (note that one of the actions is Do-Nothing as well). The
figure below shows the illustrative architecture of the Deep Learning model culmi-
nating into the required 18 action classes.
Figure 2- DQN CNN Schematic (Ref: DQN-Nature-Paper)
The motivation for this book is enable the user to make their own real-life RL
agent. Since the Atari based agent might not have the most suitable model for some
of the other applications, we may have to change the CNN configurations, and the
8
structure of the output layer for the specific use-cases and domain that we would be
implementing it for.
Atari gives a 60 FPS video output. It means that every second the game generates
and displays/ sends 60 images as an input. This is the signal that we could use as an
input to our agent as states. One drawback of using raw image pixels and working
directly with all consecutive frames at such high frame rate to train a Q-Learning-
Network is that the training of the Q-Learning-Network may not be very stable. Not
only the training might take a lot of time to converge, but at times instead of con-
verging the loss function may actually diverge or get stuck into a hunting loop. To
overcome these challenges while working on high frame-rate, high-dimension, cor-
related image data the DQN had to implement the following three enhancements to
ensure descent convergence and practical applicability.
Experience Replay
It is important to understand the concept of ‘Experience Trail’, before we dis-
cuss ‘Experience Replay’ enhancement. In Chapter 4, while discussing Q Learning,
we referred to the quadruple of (state, action, reward, next-state) as an ‘experience’
data instance to train the Q Learning’s Action-Value/ Q function. In ‘Experience
Trail’, the first term ‘experience’ is exactly the same experience instance, that is a
tuple of (state, action, reward, next-state) or (s, a, r, s′) in abbreviated form. Now let
us discuss the problem of convergence as we briefly touched upon in the earlier
section in greater detail to understand why the ‘trail’ of such experience instances
is required.
While using graphical feed as input to our reinforcement learning agent we get
numerous frames of raw pixels in quick succession. Also, since these frames are in
sequence, there would be very high correlation amongst such consecutive input
frames. The update that occurs to the Q function values during the training process
is very sensitive to the number of times the algorithm encounters a particular expe-
rience instance. In the basic Q Learning algorithm the action-values/ Q Function is
updated in every step. Though we will understand in the later sub-section that in
DQN this shortcoming is also enhanced slightly for the same reason. Consecutively
seeing similar experience instances very frequently, will result in the weights of the
Q network updated in a very specific direction. Such biases in training may lead to
the formation local ‘ravines’ in the loss function’s hyperparameter space. Such ‘ra-
vines’ are very difficult to maneuver by simple optimization algorithms and hence
such biases from the ingestion of multiple similar experience instances slows down
9
or hinders the optimization of the cost function. It may be difficult to optimize such
a loss function under these challenges, and the training of such a Q network may
warrant the use of very complex optimizers.
Therefore, in ‘Experience Trail’ the ‘experience-tuples’ are not directly used in
the order that they are being generated from the source system (in our case the Atari
processor) to train the agent. Instead all the experience instance tuples as generated
from the source system are collected in a memory-buffer (mostly having fixed size
of memory). This memory-buffer is updated with new experience instances as they
are received as a queue, i.e. in a first-in, first-out order. So as soon as the memory-
buffer reaches its limit, the oldest experience instances are deleted to make way for
the latest experience instances. From this pool/ buffer of experience instances the
‘experience-tuples’ are picked randomly to train the agent. This process is known
as ‘Experience Replay’.
‘Experience Replay’ not only solves the problems arising from the use of con-
current-sequence of experiences for training the Q network as we discussed earlier,
but also limits the similar-frames problem as only a few frames from a concurrent
sequence are likely to be picked in a random draw.
Prioritized Experience Replay
‘Prioritized Experience Replay’ is an enhancement to the Experience Replay
mechanism used in the base/ original DQN algorithm that outperformed humans in
49 of Atari games. A DQN with Prioritized Experience Replay was able to outper-
form the original DQN with ‘uniform’ Experience Replay in 41 of the 49 games
where the original DQN outperformed human gamers.
In the basic ‘Experience Replay’ enhancement that we discussed earlier, we
learnt that ‘all’ the experience instances are stored in the experience train in the
same order that they are received. Such buffered experience instances are ‘ran-
domly’ selected for training. As the name suggests, in Prioritized Experience Re-
play, we would like to use some sort of priority in this experience replay process.
There are two mode of prioritizing experience instances from the experience
trail. The first mode to ‘prioritize’ which input experience instances received from
the source system are stored in the experience trail from where these could be picked
at random for ‘replay’. Alternatively, the second mode is to buffer all the experience
instances as they are being generated from the source system and then to prioritize
10
which specific experience instances are selected to be replayed from this unpriori-
tized storage.
In the ‘Prioritized Experience Replay’ enhancement we chose to prioritize using
the second mode. Once we have determined the prioritization mode as that of pri-
oritized replaying from an unprioritized experience trail (storage), the second deci-
sion aspect is to determine the specific criteria for prioritizing the experience in-
stance for replay. For this the ‘Prioritized Experience Replay’ algorithm uses the
Temporal Difference error ‘δ’ as the criteria to prioritize specific experience in-
stance for replay in subsequent iterations of the training. So, unlike the original
(‘uniform’) Experience Replay method, where every experience instance ((s, a, r,
s′) tuple) has a uniform probability to be selected for training, the Prioritized Expe-
rience Replay variant gives relatively higher priority to samples that produced a
larger TD error ‘δ’.
So, the probability of selection of a given experience tuple is given as:
\[p_{i} = \abs{\delta_i} + e \] ---(1)
Where e is an added constant to avoid zero probability for any available sample
in the experience trail. One problem with the above formulation is that though such
a prioritization is good when the training is in initial phases, but later on when the
agent has predominantly learnt from some specific experiences repeatedly, it devel-
ops biases towards such experience instances. This leads to over-fitting of the
agent’s model and associated nuances. To avoid this pitfall in the above equation is
modified slightly and a stochastic formulation is applied to it to add some random-
ness and avoid a completely greedy solution. This is done as below:
\[P_{(i)} = \frac {p^\alpha_i} {\sum_{k} p^\alpha_{k}}\] --- (2)
In the equation above (equation 2), the process for determining the probability
of sampling a particular experience instance could be controlled. We could have a
sampling probability on a particular experience instance as one generated from a
sampling processes that ranges from a process that is purely random to a process
that is purely greedy to anything in between. This control is defined is determined
be a parameter that is the ratio of the priority of transition as defined in the earlier
equation, normalized over all transition priorities, each raised to the power ‘α’. Here
‘α’ is a constant that determines the greediness of the sampling process. α could be
set to any value between 0 and 1. A process with α = 0 denotes no prioritization and
gives an effect of uniform sampling, leading to results similar to that of the original
11
unprioritized Experience Replay algorithm. Conversely a process with α = 1, will
be similar to the extreme prioritized experience replay for samples with large TD
errors throughout the training and associated biases as we discussed earlier.
Skipping Frames
A further enhancement for the problem caused because of the bias due to training
with frequent and multiple consecutive similar frames as discussed above is that we
do not pick all 60 the frames generated per second for the purpose of training. In
the DQN trained for ‘Atari’ 4 consecutive frames were combined to make the data
pertaining to one state. This also reduces computational cost, without losing much
information. Assuming that the games are made for human react-able time between
key events, there would be a lot of correlation in every frame if the feed is at 60 fps.
The number 4 frames every 60 is not very hard. In practical application in one’s
own domain, this number could be adjusted depending upon the requirement of the
specific use-case and the input frequency and correlation amongst consecutive
frames.
Additional Target Q-Network
One major change that the Deep Q Networks made over that of the basic Q
Learning algorithm, is that of the introduction of a new ‘Target-Q-Network’. While
discussing Q-Leaning in Chapter 4, we referred to the term “(r + γ maxa′(Q(s′, a′))
” in the equation for the Q Function update (equation [4].(7)) as the ‘target’. Given
below is the complete equation for reference.
\[Q_{(s, a)} = (1-\alpha) Q_{(s, a)} + \alpha ( r + \gamma max_{a\textprime}
{Q_{(s\textprime, a\textprime)}})\] --- (4.7)
So essentially in this equation the Q Function Q (s, a) is being referred twice and
each of this reference is for different purposes. The first reference i.e. (1- α) Q (s,
12
a) is mainly to retrieve the present state-action value so as to update its value (use
of Q as in: Q (s, a) = (1- α) Q (s, a) + …), and the second is to get the ‘target’ value
for the subsequent Q value for the next state-action (i.e. Q as in: r + γ max a′ (Q(s′,
a′)). Though in the basic Q Learning algorithms both these Q Functions/ Networks
(or Q-Tables in case of a tabular Q-Learning approach) were same, it may not nec-
essarily be so always.
In DQN the ‘target’ Q network is different from the one that is being continu-
ously updated in every step. This is done to overcome the drawbacks related to using
the same Q Network for both continuous update and for referring the target values.
These drawbacks are majorly because of two reasons. The first as we highlighted
earlier in the section of basic DQN i.e. related to issues related delayed/sub-optimal
convergence in case of too-frequent, and highly-correlated data for training. It could
be noted that if the targets for training are coming from the same network they are
bound to be correlated. Another reason is that it is not a good idea to use the target
values from the same function to correct its own update. This is because when we
use the same function to update its own estimates then sometimes it may lead to
‘unstable’ target function.
Thus, it is found that using two different Q networks for these two different pur-
poses enhances the stability of the Q network. But if a target action-value is required
for the training, and if this target value does not update at all after initialization
(which as we learnt could even be all zeros in case of Q Learning as it is an off-
policy algorithm), then the ‘active’ (actively updated/ estimated) network could not
be updated effectively. Therefore, the ‘target’ Q Network is synced and updated
with the actively updated Q Network’ once every ‘c’ number of steps. For the
‘Atari’ problem the value of ‘c’ was fixed to 1000 steps.
Clipping Rewards and Penalties
Although this is not a very significant change while considering training and de-
ployment for a single application, but when considered in perspective of developing
a system for ‘General Artificial Intelligence’ the mechanism for accumulating re-
wards and penalties needs to be balanced. Different games (and real-domain skills)
may have a different scoring system. Some games may offer a relatively lower ab-
solute score for even a very challenging task, and others may be too generous in
giving absolute rewards (scores). For example, in a game like ‘Mario’, it is easy to
13
get score in the range of hundreds of thousands of points; whereas in a game like
‘Pong’ the player gets just a single point for defending an entire game.
Since Reinforcement Learning and especially the idea of ‘General Artificial In-
telligence’ has spawned from the human mind’s ability to learn different skills, lets
analyze the constitution of our own body to understand this concept better. Human’s
and for that matter most of the animals learn different habits and stereotypes by a
process called reinforcement, which is also the basis of Reinforcement Learning
that we have modeled for the machines to become adept at different skills. Since
Reinforcement Learning requires a reward to ‘reinforce’ any behavior so our mind
should also work on receiving some rewards to reinforce and learn any behavior. In
humans, the sense of reward is achieved by the release of a chemical called ‘dopa-
mine’, which reinforces a particular behavior that acted as a trigger to this behavior.
If you are curious why you get so addicted to you mobile and want to click on every
notification, social media feeds and shopping apps to the point that they start con-
trolling you instead of you controlling them, you can blame the dopamine response
for the same. Similarly, the addictions to substances ranging from drugs to sugary
foods are governed by the reinforcement caused by the release of dopamine which
serves as the reward mechanism to make us reinforce certain behaviors.
Since the body’s dopamine producing capability is limited, so an automatic scal-
ing and clipping effect is realized across different activities we do. When this dopa-
mine response system is altered externally/ chemically for example by consumption
of drugs, it does lead to withdrawal from other activities and bring meaning to life,
leading to an unstable behavior and outcome.
To achieve a similar reward/ penalty scaling and clipping effects in the DQN as
used in the ‘Atari’ game, all the rewards across all games were fixed to +1 and all
penalties to -1. Since rewards are key to reinforcement training and vary widely
across applications, readers are encouraged to device their own scaling and clipping
techniques for their respective use cases and domains.
14
Double DQN
In situations like the ones that warrant the application of Deep Reinforcement
Learning, generally the state-space and state-size may be extremely large, and it
may take a lot of time for the agent to learn sufficient information about the envi-
ronment and ascertain which state/ actions may lead to the most optimal instanta-
neous or total rewards. Under these conditions the exploration opportunities may be
overwhelmed (especially in the case of constant epsilon-based algorithms) and the
agent may get stuck to exploiting the explored and estimated state-action combina-
tions that have relatively higher values even if not the highest values possible but
not yet explored. This may lead to the ‘overestimation of Q-Values’ for some of
these combinations of state-actions leading to suboptimal training.
In the earlier section we discussed about splitting the Q Network into two differ-
ent Q Networks, one being the online/ active and the other being the target Q Net-
work whose values are used as a reference. We also discussed that the target Q
network is not updated very frequently and instead is updated only after a certain
number of steps. The above highlighted overestimation problem may become even
more significant if the actions are taken on the basis of a Q network (Target Q-
network) whose values are not even frequently updated (since these are updates
from the active ‘online’ Q Network after every thousands or so of steps).
We also discussed in the earlier section why it was important to split the Q Net-
work into the active and the target Q Networks and the benefits of a dedicated target
Q Network. So, we would like to continue using the ‘Target’ Q Network as it offers
better and more stable target values for the update. To combine the best of both
worlds, the ‘Double DQN’ algorithms proposes to select the action on the basis of
the ‘Online’ Q Network but to use the values of the target state-action value corre-
sponding to this particular state-action from the ‘Target’ Q Network.
So, in Double DQN, in every step, the value of all action-value combinations for
all possible actions in the given state is read from the ‘online’ Q Network which is
being updated continuously. Then an argmax is taken over all the state-action values
of such possible actions, and whichever state-action combination value maximizes
the value, that particular action is selected. But to update the ‘Online’ Q Network
the (target) corresponding value of such selected state-action combination is taken
from the ‘target’ Q Network (which is updated intermittently). Double DQN algo-
rithm suggests and by doing so we could simultaneously overcome both the ‘over-
estimation’ problem of Q Values while also avoiding the instability in the target
values.
15
Dueling DQN
Until now the Deep Learning Models (here the term models refer to its usage as
in supervised learning models as opposed to the MDP model) that we covered, were
‘Sequential’ architectures (sequential architectures and sequential models may have
different meaning in deep learning). In these models, all the neurons in any partic-
ular layer could be connected only to the neurons in just one layer before and one
layer after their own layer. In other words, no branches or loops existed in these
model architectures.
Though both DQN and Double DQN had two Q networks, but there was only
one deep learning model and the other (target) network values was a periodic copy
of the active (online) network’s values. In Dueling DQN we have a non-sequential
architecture of deep learning in which, after the convolutional layers, the model
layers branches into two different streams (sub-networks), each having their own
fully-connected layer and output layers. The first of these two branches/ networks
are corresponding to that of the Value function which is used to ‘estimate’ the value
of a given state, and has a single node in its output layer. The second branch/ net-
work is called the ‘Advantage’ network, and it computes the value of the ‘ad-
vantage’ of taking a particular action over the base value of being in the current
state.
Figure 3 - Schematic - Dueling Q Network
But the Q Function in Dueling DQN, still represents Q Function in any atypical
Q Learning algorithm and thus the Dueling DQN algorithm should work in the same
way conceptually as how the atypical Q Learning algorithm works by estimating
16
the absolute action values or Q estimates. So somehow, we need to estimate the
action-value/ Q estimates as well. Remember action-value is the absolute value of
taking a given action in a given state. So, if we could combine (add) the output of
the state’s base value (first network/ branch) and the incremental ‘advantage’ val-
ues of the actions from the second (‘advantage’) network/ branch then we could
essentially estimate the action-value or are Q Values as required in Q Learning. This
could be represented mathematically as below:
\[ Q_{(s, a; \theta, \alpha, \beta)} = V_{(s; \theta, \beta)} + (A_{(s, a; \theta,
\alpha)} - \max_{ a\textprime \in \abs{A}} A_{(s, a\textprime; \theta, \alpha)})\] -
-- (3)
In the equation (3) above the terms Q, V, s, a, a′ have the same consistent mean-
ing as we discussed earlier in this book. Additionally, the term ‘A’ denotes the ad-
vantage value. ‘θ’ represents the parameter vector of the convolutional layer which
is common to both the ‘Value’ network and the ‘Advantage’ network. ‘α’ represents
the parameter vector of the ‘Advantage’ network and ‘β’ represents the parameter
vector of the ‘State-Value’ function. Since we have entered the domain of function
approximators, therefore the values of any network are denoted with respect to the
parameters of the ‘estimating’ network to distinguish between the values/ estimates
of the same variable estimated from multiple different estimating functions.
The equation in simple terms mean that the Q value (the subscripts θ, α, β to the
Q here indicates that the Q estimates here are as computed from the estimating
model which has three series of parameters or is a function of θ, α, β) for a given
state-action combination is equal to the value of that state or absolute utility of being
in that state as estimated from the state-value (V) network (the subscripts θ, β of V
in the equation denotes that the state values are coming from an estimation function
that has parameters θ, β), plus the incremental value or the ‘advantage’ (the sub-
scripts θ, α of A in the equation denotes that the advantage is derived from an esti-
mation function that has parameters θ, α) of taking that action in that state. The last
part of the equation is to provide the necessary corrections to provide ‘identifiabil-
ity’.
Let us spend a moment to understand the ‘identifiability’ part in greater details.
The equation (3), from the simple explanation we discussed above which is also
intuitive could have been as simple as below:
\[ Q_{(s, a; \theta, \alpha, \beta)} = V_{(s; \theta, \beta)} + A_{(s, a; \theta, \al-
pha)} \] --- (4)
17
But the problem with this simple construct as in equation (4) above is that though
we could get the value of Q (action values), provided the values of S and A are
given, but the converse in not true. That is, we could not ‘uniquely’ recover the
values of S, A from a given value of Q. This is called ‘unidentifiability’. The last
part of the equation (in equation 1) solves this ‘unidentifiability’ problem by
providing ‘forward-mapping’.
A better modification of equation (3) is provided below as equation (5). In equa-
tion (5) the last part as provided in the equation (3) is slightly modified. Though by
subtracting a constant the values get slightly off-targeted, but that does not affect
the learning much as the value comparison is still intact. Moreover, the equation in
this form adds to the stability of the optimization.
\[ Q_{(s, a; \theta, \alpha, \beta)} = V_{(s; \theta, \beta)} + (A_{(s, a; \theta,
\alpha)} – \frac{1}{|A|} \sum_{a′} A_{(s, a′; \theta, \alpha)}) \] --- (5)
18
Summary
General Artificial Intelligence or the idea of a single algorithm or system learning
and excelling at multiple seemingly different tasks simultaneously have been the
ultimate goal of Artificial Intelligence. General AI is a step towards enabling ma-
chines and agents to reach human level intelligence of adapting to different scenar-
ios by learning new skills.
The DQN paper by Deep Mind claims to making some progress in the creating
a system that could learn the essential skills to the requirements of 47 different types
of Atari games, even surpassing the best of the human adversaries’ scores in many
of these games.
Though DQN is very potent and could surpass human level performance in many
games as claimed by its success on standardized Atari environments, but it has its
own shortcomings. There are many enhancements that could be employed to over-
come these shortcomings. The Double DQN and Dueling DQN, both use 2 different
Q Networks instead of a single Q network as used in DQN and aims at overcoming
the shortcomings of DQN though in slightly different manner.
The Dueling DQN also brings the concept of advantage or the incrementally
higher utility of taking an action over the base state’s absolute value. The concept
of advantage will be explored further in some of the other algorithms we will cover
in this book.
19
References
DQN, deepmind.com, url: https://deepmind.com/research/dqn/, Accessed: Aug 2018.
AlphaGo, deepmind.com, url: https://deepmind.com/research/alphago/, Accessed: Aug 2018.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan
Wierstra, Martin A. Riedmiller, “Playing Atari with Deep Reinforcement Learning”, CoRR,
arXiv: 1312.5602, 2013.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bel-
lemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen,
Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan
Wierstra, Shane Legg and Demis Hassabis, “Human-level control through deep reinforcement
learning”, Nature, vol. 14236, doi 10.1038, 2015.
Tom Schaul, John Quan, Ioannis Antonoglou, David Silver, “Prioritized Experience Replay”,
CoRR, arXiv: 1511.05952, 2015.
Lin, Long-Ji, “Reinforcement Learning for Robots Using Neural Networks”, Carnegie Mellon
University, PhD Thesis Lin:1992: RLR:168871, 1992.
Takuma Seno, “Welcome to deep reinforcement learning”, towardsdatascience.com, url: https://to-
wardsdatascience.com/welcome-to-deep-reinforcement-learning-part-1-dqn-c3cab4d41b6b,
Accessed: Aug 2018.
Arthur Juliani, “Simple reinforcement learning with tensorflow”, medium.com, url: https://me-
dium.com/@awjuliani/simple-reinforcement-learning-with-tensorflow-part-4-deep-q-net-
works-and-beyond-8438a3e2b8df, Accessed: Aug 2018.
Hado van Hasselt, Arthur Guez, David Silver, “Deep Reinforcement Learning with Double Q-
learning”, CoRR, arXiv 1509.06461, 2015.
Ziyu Wang, Nando de Freitas, Marc Lanctot, “Dueling Network Architectures for Deep Reinforce-
ment Learning”, CoRR, arXiv 1511.06581, 2015.