Access to this full-text is provided by Wiley.
Content available from Journal of Advanced Transportation
This content is subject to copyright. Terms and conditions apply.
Research Article
Ramp Metering for a Distant Downstream Bottleneck Using
Reinforcement Learning with Value Function Approximation
Yue Zhou ,
1
Kaan Ozbay,
1
Pushkin Kachroo,
2
and Fan Zuo
1
1
C2SMART Center, New York University, NYU Civil Engineering, 6 Metrotech Center, Brooklyn 11201, NY, USA
2
Department of Electrical and Computer Engineering, University of Nevada, Las Vegas, 4505 S. Maryland Pkwy,
Las Vegas 89154-4026, NV, USA
Correspondence should be addressed to Yue Zhou; zhouyue30@msn.com
Received 13 July 2020; Revised 19 August 2020; Accepted 18 September 2020; Published 28 October 2020
Academic Editor: Ruimin Li
Copyright ©2020 Yue Zhou et al. is is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
Ramp metering for a bottleneck located far downstream of the ramp is more challenging than for a bottleneck that is near the
ramp. is is because under the control of a conventional linear feedback-type ramp metering strategy, when metered traffic from
the ramp arrive at the distant downstream bottleneck, the state of the bottleneck may have significantly changed from when it is
sampled for computing the metering rate; due to the considerable time, these traffic will have to take to traverse the long distance
between the ramp and the bottleneck. As a result of such time-delay effects, significant stability issue can arise. Previous studies
have mainly resorted to compensating for the time-delay effects by incorporating predictors of traffic flow evolution into the
control systems. is paper presents an alternative approach. e problem of ramp metering for a distant downstream bottleneck
is formulated as a Q-learning problem, in which an intelligent ramp meter agent learns a nonlinear optimal ramp metering policy
such that the capacity of the distant downstream bottleneck can be fully utilized, but not to be exceeded to cause congestion. e
learned policy is in pure feedback form in that only the current state of the environment is needed to determine the optimal
metering rate for the current time. No prediction is needed, as anticipation of traffic flow evolution has been instilled into the
nonlinear feedback policy via learning. To deal with the intimidating computational cost associated with the multidimensional
continuous state space, the value function of actions is approximated by an artificial neural network, rather than a lookup table.
e mechanism and development of the approximate value function and how learning of its parameters is integrated into the
Q-learning process are well explained. rough experiments, the learned ramp metering policy has demonstrated effectiveness
and benign stability and some level of robustness to demand uncertainties.
1. Introduction
A genuine motivation behind ramp metering strategies is to
reduce the total time spent within the freeway network of
interest [1]. Minimization of the total time spent can be
shown to be equivalent to maximizing time-weighted dis-
charging flow from the network, i.e., encouraging early
discharge of flow [1]. is motivation, combined with the
knowledge of traffic flow theory, implies that the objective of
a ramp metering strategy is to maintain the flow rate into the
most restrictive bottleneck of the network to be close to the
capacity of the bottleneck, but not to exceed it, so that
congestion will not be caused. is objective can be achieved
by regulating the traffic density (or occupancy) of the
bottleneck to stay close to the critical density (or critical
occupancy) through metering the ramp flow. is is the
principle behind many conventional linear feedback-type
ramp metering strategies, e.g., [2–5]. For this kind of ramp
metering strategies, the control target bottleneck is usually
near the ramp, and in most cases, the bottleneck is incurred
by the merging of the mainline and ramp traffic itself. In
some other cases, however, the control target bottleneck is
located far away from the metered ramp, for example, a lane-
drop that is a few kilometers downstream. In these latter
cases, conventional linear feedback-type ramp metering
strategies can perform poorly in stability due to the long
Hindawi
Journal of Advanced Transportation
Volume 2020, Article ID 8813467, 13 pages
https://doi.org/10.1155/2020/8813467
distance between the ramp and the bottleneck. Specifically,
when metered traffic from the ramp arrive at the distant
downstream bottleneck, the traffic density (or occupancy) of
the bottleneck may have significantly changed from when it
is sampled for computing the metering rate. To overcome
this issue, many previous studies have resorted to com-
pensating for the time-delay effects by incorporating pre-
dictors of traffic flow evolution into the control systems.
is study presents an alternative approach. e pro-
posed approach formulates the problem of ramp metering
for a distant downstream bottleneck as a Q-learning
problem, in which an intelligent ramp meter agent learns an
optimal ramp metering policy such that the capacity of the
distant downstream bottleneck can be fully utilized but not
to be exceeded to cause congestion. To our best knowledge,
this is the first such effort in the literature. e learned policy
is in pure feedback form in that only the current state of the
environment is needed to determine the optimal metering
rate for the current time. No prediction is needed, as an-
ticipation of traffic flow evolution has been instilled into the
learned nonlinear feedback policy. To deal with the intim-
idating computational cost associated with the multidi-
mensional continuous state space of the formulated
Q-learning problem, the value function of ramp metering
rates is approximated by an artificial neural network (ANN),
rather than a lookup table.
In the remainder of this paper, Section 2 reviews pre-
vious studies in ramp metering for distant downstream
bottlenecks and Q-learning applications in freeway control.
Section 3 develops the proposed approach, including for-
mulation of the Q-learning problem with value function
approximation and the algorithm to solve the problem.
Section 4 evaluates the proposed approach by experiments.
Section 5 concludes this study.
2. Literature Review
2.1. Ramp Metering for a Distant Downstream Bottleneck.
Compared with the richness of the literature in ramp
metering strategies for bottlenecks near ramps, studies in
ramp metering for distant downstream bottlenecks are
much fewer. ese studies include [6–13]. In [6], the notable
ALINEA strategy, which is a linear “proportional” control
strategy, was extended by adding to it an “integral” term,
resulting in the so-called PI-ALINEA strategy. e authors
theoretically proved the stability of the PI-ALINEA strategy.
Later, Kan et al. [7] evaluated the performance of PI-ALI-
NEA in controlling a distant downstream bottleneck by
simulation. e simulation model employed was META-
NET [14], a second-order discrete-time macroscopic model
of traffic flow dynamics. e simulation evaluation showed
that PI-ALINEA outperformed ALINEA in terms of sta-
bility. In [8], to deal with the time-delay effects of ramp
metering for distant lane-drop bottlenecks, the authors
incorporated a Smith predictor [15] into ALINEA and
termed the resulting strategy as SP-ALINEA. rough
simulation, they showed that the stability region of SP-
ALINEA is much broader than the PI-ALINEA. e sim-
ulation model employed by Felipe de Souza and Jin [8] was
the cell transmission model (CTM) [16], a first-order dis-
crete-time macroscopic model of traffic flow dynamics.
Similar to [8], Frejo and De Schutter [9] added a feedfoward
term to ALINEA to incorporate anticipated evolutions of the
bottleneck density in order to improve the performance of
ALINEA. e resulting strategy is termed FF-ALINEA.
Similar to [8, 9], Yu et al. [10] coupled a predictor to an
extremum-seeking controller for controlling a distant
downstream lane-drop bottleneck by metering upstream
mainline flow. In [12, 13], fuzzy theory was applied to a
proportional-integral-derivative- (PID-) type ramp meter-
ing controller to learn the PID gains in real time. e
resulting controller has the capability of anticipation, hence
performs better in controlling a distant downstream bot-
tleneck than a controller with fixed gains. Stylianopoulou
et al. [11] proposed a linear-quadratic-integral (LQI) reg-
ulator-type ramp metering strategy for controlling a distant
downstream bottleneck. Unlike all the studies that were
summarized above which only take measurements near the
bottleneck, in [11], however, measurements which spread
along the whole stretch between the ramp and the down-
stream bottleneck are utilized by the controller, so the
controller has a better sense of traffic flow evolutions along
the stretch, hence possessing better stability and robustness.
2.2. Q-Learning Applications in Freeway Control.
Application of Q-learning to freeway control has been
widely studied. However, to our best knowledge, no effort
has been made to apply Q-learning to ramp metering for
distant downstream bottlenecks. Notwithstanding this, this
section summarizes previous studies in Q-learning appli-
cations to ramp metering (RM) control for nearby bottle-
necks and to variable speed limit (VSL) control. ese
studies are summarized in Table 1. Although this summary
may not be thorough, it should have included most previous
studies in freeway control by Q-learning approaches.
Among these studies, [18–22, 27, 28, 32] were concerned
with ramp metering. [23, 30, 31, 33] studied variable speed
limits (VSL). Ramp metering and variable speed limits were
jointly applied by [29]. [17, 24–26] simultaneously used
ramp metering and variable message signs (VMS) for dy-
namic routing. Most of these studies aimed to achieve one of
the following three objectives: minimization of the total time
spent by vehicles [17, 19, 27, 28, 31, 33], maximization of
early discharge of flow [24–26], and minimization of de-
viations of the traffic density of the control target section
from the critical density [20, 23, 29, 30]. As discussed in
Section 1, these three objectives are equivalent.
By the type of the applied Q-learning method, these
studies can be classified into two categories. e first cat-
egory consists of those that used lookup table methods, i.e.,
[17, 18, 20–31]; the second category includes those that
employed value function approximation-based methods,
i.e., [31–33].
Lookup table methods, also known as tabular methods
[34], as suggested by the name, maintain a lookup table that
stores the values for all state-action pairs (known as
Q-values). e Q-learning process can be viewed as the
2Journal of Advanced Transportation
process of updating the lookup table. Lookup table methods
can only handle discrete state-action pairs. ey may also
deal with the continuous state space; however, the contin-
uous state space needs to be approximated (discretized) first
so that any continuous state the learning agent encounters
can be mapped to a representative discrete state that is
indexed in the lookup table. Most of the studies in Table 1
belong to lookup table methods. Since state variables of
Table 1: Summary of Q-learning applications in freeway control.
Work Control
method
Lookup table method or
value function
approximation method
State variables Action Reward Simulation
model
[17] RM-
VMS Lookup table Speed, density, flow
diversion splits
Increment in
metering rate,
increment in flow
diversion split
Total time spent Macro
(METANET)
[18] RM Lookup table
Density of bottleneck,
ramp queue length, ramp
demand, current metering
rate
Whether to
increase, decrease,
or not change the
current metering
rate
Outflow, ramp
queue length
Macro
(METANET)
[19] RM Lookup table Not clear Metering rates Total time spent Micro
(VISSIM)
[20] RM Lookup table
Number of vehicles in
mainline, number of
vehicles entered from the
ramp, ramp signal of the
last step
Red/green signal
Deviation of
density from
critical density
Macro (not
clear)
[21] RM Lookup table Number of vehicles in the
area of interest Red/green signal Not clear Macro (not
clear)
[22] RM Lookup table
Mainline speeds, ramp
queue lengths, ramp
metering signal status
Red/green signal
Ramp queue
length, mainline
average speed
Micro
(VISSIM)
[23] VSL Lookup table Densities of mainline and
ramp Speed limits
Deviation of
density from
critical density
Macro (CTM)
[24–26] RM-
VMS
Lookup table with state-space
approximation by the
cerebellar model articulation
controller
Average speeds,
occupancies, status of
VMS and ramp, incident
presence/absence
Increments in red
phase length, VMS
for routing
Time-weighted
exit flow
Micro
(Paramics)
[27, 28] RM
Lookup table with state-space
approximation by k-nearest
neighbors
Density, ramp flow Direct red phase
lengths Total time spent Micro
(Paramics)
[29] RM-VSL
Lookup table with state-space
approximation by k-nearest
neighbors
Densities, ramp flow,
average speeds, speed
differences
Direct red phase
lengths
Deviation from
critical density
Micro
(AnyLogic)
[30] VSL
Lookup table with state-space
approximation by k-nearest
neighbors
Densities and speeds Speed limits
Deviations of
densities from
critical density,
times to collision
Micro
(MOTUS)
[31] VSL
Value function
approximation by the neural
network; lookup table with
state-space approximation by
tile coding
Current and predicted
densities and speeds Speed limits Total time spent Macro
(METANET)
[32] RM
Value function
approximation by the deep
neural network
Densities, ramp queue
lengths, off-ramp
presence/absence
Metering rates
Number of
discharged
vehicles
Macro (CTM)
[33] VSL
Value function
approximation by the deep
neural network
Lane-specific occupancies
in mainline and ramp
Lane-specific speed
limits
Total time spent,
bottleneck speed,
emergency brake,
emissions
Micro
(SUMO)
Journal of Advanced Transportation 3
freeway control problems are usually continuous, e.g., traffic
densities and ramp queue lengths, those studies that have
applied lookup table methods all have involved some kind of
state-space approximation. e simplest state-space ap-
proximation method is aggregation, which divides a con-
tinuous state space into discrete intervals that do not overlap
with each other. Many studies in Table 1 are of this kind, i.e.,
[17, 18, 20–23]. Some other studies employed more so-
phisticated methods, e.g., k-nearest neighbors, to approxi-
mate continuous state spaces. ese studies include [24–31].
It is important to note that state-space approximation is
not primarily a tool for reducing the computational cost of
reinforcement learning. For a multidimensional continuous
state-space problem, the lookup table after state-space ap-
proximation can still be very large. Admittedly, if the state-
space approximation is made very coarse, the table size can
be decreased (hence the computational cost), however, at the
expense of undermining the effectiveness of the learned
policy. Such a difficulty is born with lookup table methods
because they aim at directly updating the value of each state-
action pair, hence cannot avoid the curse of dimensionality
of the state space [35].
e above difficulty can be circumvented by introducing
value function approximation. A value function approxi-
mation-based reinforcement learning method uses a pa-
rameterized function to replace the lookup table to serve as
the approximate value function [34]. Consequently, the
reinforcement learning process entails learning the un-
known parameters of the approximate value function in-
stead of learning the values of state-action pairs. Compared
with the number of state-action pairs of a lookup table for a
(discretized) multidimensional continuous state-space
problem, the number of unknown parameters of an ap-
proximate value function is usually profoundly smaller,
hence making the learning computationally affordable. Only
three studies in Table 1, i.e., [31–33], applied value function
approximation-based reinforcement learning methods. e
approximate value functions used by these three studies were
all artificial neural networks.
An outstanding feature of reinforcement learning that
distinguishes it from supervised and unsupervised learning is
that, for reinforcement learning, data from which the intel-
ligent agent learns an optimal policy are generated from
within the learning process itself. Specifically, the intelligent
agent learns through a great amount of interactions with the
environment which are enabled by simulation. Hence, sim-
ulation models play an important role in reinforcement
learning. Among the studies summarized in this section,
[19, 22, 24, 30, 33] employed microscopic traffic simulation
models such as VISSIM, Paramics, and SUMO;
[17, 18, 20, 21, 23, 31, 32] used macroscopic dynamic traffic
flow models such as CTM [16] and METANET [14] as the
simulation tools.
3. A Q-Learning Problem with Value
Function Approximation
3.1. Multidimensional Continuous State Space. Consider the
freeway section depicted in Figure 1. A lane-drop bottleneck
exists far downstream of the metered ramp. e ramp meter
is supposed to regulate the traffic flow into the bottleneck by
metering the ramp inflow so that the bottleneck capacity can
be fully utilized but not to be exceeded. To this end, the
objective of the ramp metering policy is such that it can
maintain the per-lane traffic density of the control target
location to stay close to a predetermined desired value,
which is (λ2/λ1)ρcr, where λ1and λ2denote the number of
lanes before and after the lane-drop, respectively, and ρcr is
the per-lane critical density. As discussed before, due to the
long distance between the metered ramp and the down-
stream bottleneck, a conventional ramp metering strategy
that only senses and utilizes traffic condition near the
bottleneck can perform poorly due to the lack of anticipation
capability. erefore, one main requirement in designing
our reinforcement learning approach is that it needs to take
into account traffic densities measured along the long stretch
between the metered ramp and the downstream bottleneck
so that an anticipation capability can be built by learning.
Since the computational cost of Q-learning grows expo-
nentially with the increase of the dimension of the state
space, it would not be computationally cost-effective to take
into account measurements at too many places. As a result,
three representative places are selected. ey are located at
the two ends and the middle of the stretch, respectively. Such
a treatment, on the one hand, enables the intelligent ramp
meter agent to learn to anticipate traffic flow evolution on
the stretch, and on the other hand, it limits the computa-
tional cost associated with learning. Note that the place of
the downstream end of the stretch happens to be the control
target location, whose traffic density will be regulated to stay
close to the desired value by ramp metering. erefore, the
first three state variables of the proposed Q-learning
problem are traffic densities of the three representative
places, denoted by ρ1,ρ2, and ρ3, respectively. Note that
when the distance between the metered ramp and the
downstream bottleneck is sufficiently long and meanwhile
the traffic demand pattern is complicated enough in terms of
having frequent and large fluctuations, the resulting tem-
poral-spatial traffic flow pattern may be too complicated for
the three mainline sampling locations to effectively represent
the environment state for the purpose of learning. Under
such a circumstance, more sampling locations may be
needed. What kind of combinations of the stretch length and
traffic demand pattern may yield complicated enough
temporal-spatial traffic flow patterns that would cause the
three representative mainline sampling locations to result in
suboptimal solutions and, accordingly, how many sampling
locations should be taken under these circumstances are
considered beyond the scope of this paper.
e fourth and also the last state variable is known as the
estimated traffic demand on the ramp, denoted by Dramp.
is state variable is needed because to learn how much flow
from the ramp should be released into the mainline, the
intelligent ramp meter agent needs to know not only the
traffic conditions of representative mainline places but also
the current (estimated) traffic demand on the ramp so as to
avoid picking up a metering rate that is too high. e es-
timated traffic demand on the ramp over the current time
4Journal of Advanced Transportation
step is computed by (1), where Dramp(t)denotes the esti-
mated traffic demand on the ramp (in vehicles per hour) for
the current time step; lramp_queue(t)represents the queue
length on the ramp at the current time step; Δtis the time
step length (in seconds); and qramp_arrival(t−1)represents
the arrival flow rate at the ramp over the previous time step.
Dramp(t)≐lramp_queue (t)
(Δt/3600)+qramp_arrival(t−1).(1)
e reason to use the arrival flow rate at the ramp over
the previous time step rather than the current time step is for
the following realistic consideration. Ramp metering rate for
the current time step needs to be computed at the end of the
previous time step (or, equivalently, at the beginning of the
current time step) so that it can be implemented over the
current time step; however, by that time, the actual arrival
flow rate over the current time step is unknown because it
has not yet happened. erefore, the arrival flow rate at the
ramp over the previous time step is used as a proxy to the
arrival flow rate at the ramp over the current time step. Such
a treatment that brings anticipation of the ramp condition
into learning and thus may enhance the learning efficiency
appears to be first used by Davarynejad et al. [18]. Note that
the queue length on the ramp of the current time step does
not need a proxy because it can be readily calculated at the
end of the previous time step.
To summarize, the state vector contains four continuous
variables, i.e., s≐ρ1ρ2ρ3Dramp
, resulting in a four-
dimensional continuous state space.
3.2. State-Dependent Action Space. e actions in the pro-
posed approach are composed of discrete ramp metering
rates, as in [29], ranging from the lowest allowable metering
rate, amin, to the highest allowable metering rate, amax. e
values of amin and amax and the number of discrete metering
rates are up to the user’s specification. In Section 4.1, an
example of such a specification is given which is consistent
with the requirements of the so-called “full traffic cycle”
signal policy for ramp metering [36] so that the results can be
implemented by a traffic light. At any time step, the set of
admissible actions may not necessarily consist of all the
specified discrete metering rates; it is bounded from above
by the estimated traffic demand on the ramp introduced in
Section 3.1. Such a treatment can prevent the agent from
picking up a metering rate that is higher than the ramp traffic
demand, hence may enhance the learning efficiency. us,
the action space at any time step is state-dependent. To
emphasize this point, the action space in this paper is written
as A(s), as will be seen in the remainder of this paper.
3.3. Reward. e rewards earned by the intelligent ramp
meter agent during learning should reflect the objective of
the ramp metering policy to be learned. As introduced in
Section 3.1, the objective of the ramp metering policy to be
learned is to maintain the traffic density of the control target
location, ρ3, to stay close to the desired value, (λ2/λ1)ρcr.
erefore, the reward function can be defined as
r�kρ3−λ2
λ1
ρcr
.(2)
In (2), ris the reward received by the agent for resulting
in ρ3;kis a user-defined negative constant value, serving as a
scaling factor; the other notations have been defined earlier.
e implication of this reward is straightforward: it penalizes
the traffic density of the control target location for deviating
from the desired value. Similar reward designs have been
applied by [20, 23, 29, 30]. In our approach, the reward is a
function of the state resulting from taking an action; but, in
general, depending on needs, the reward can be a function of
the states both before and after taking an action, as well as
the action itself [34].
Note that although the reward defined by (2) is based on
the state of the current time step, reinforcement learning
aims to maximize the total of these rewards over the entire
control horizon. ere also exist traffic flow optimization
methods which optimize performance measures that are
solely based on the current traffic state but repeat the op-
timization at every time step, e.g., [37, 38]. ese two ap-
proaches are different.
3.4. Value Function Approximation by an Artificial Neural
Network. If a lookup table method was to be used, the four-
dimensional continuous state space needs to be approxi-
mated (discretized) first. If, for example, using the simple
aggregation method for approximating the continuous state
space, the range of the traffic density is aggregated into 40
intervals and the range of the estimated traffic demand on
the ramp is aggregated into 20 intervals, then there will be as
many as 403×20, i.e., 1.28 million discrete states. en, if the
action space consists of 20 metering rates, it implies that the
dimension of the resulting lookup table will be 1.28 mil-
lion ×20. is means that there will be a total of 25.6 million
action values (i.e., Q-values) to learn, which will be com-
putationally very demanding. is motivates the introduc-
tion of value function approximation.
We use an artificial neural network (ANN) to serve as the
approximate value function. e role of this approximate
value function in the Q-learning process is at each time step,
it takes as inputs all the state variables, i.e., ρ1,ρ2,ρ3, and
Dramp
Control target
ρ3
ρ2
ρ1
Figure 1: e formulated Q-learning problem having four state variables.
Journal of Advanced Transportation 5
Dramp, based on which it computes the values for all the
available actions, as outputs. at is, the approximate value
function maps the state vector to another vector, each el-
ement of which is the value of the pair of that state and a
candidate action. In general, a value function approximated
by an ANN is a nonlinear mapping:
ANN:R|S|⟶R|A|.(3)
In (3), ANN represents the value function approximated
by an ANN and |S|and |A|denote the dimensions of the
state space and action space, respectively.
3.4.1. State Encoding. In many cases, the state variables are
not directly fed into ANNs; they are first transformed into
some other variables called features [34, 39], which will then
be taken by ANNs. Such a transformation is known as state
encoding or feature extraction [34, 39]. As pointed out by
Bertsekas [39], state encoding can be instrumental in the
success of value function approximation, and with good
state encoding, an ANN need not to be very complicated.
e state encoding method used by this study is a simple tile
coding method [34], which is described as follows. For each
of the four continuous state variables, its value range is
divided into equal intervals that do not overlap with each
other; as a result, at any time step, the value of a state variable
will fall into one of the intervals that collectively cover the
value range of this state variable; the interval into which the
value of this state variable falls will be given value 1, while all
the others will be given value 0. Such a state encoding
treatment can give the ANN stronger stimuli than a treat-
ment that normalizes state variables to have continuous
values between 0 and 1. To emphasize the fact that the
feature vector is a function of the state vector, in this paper,
the feature vector is written as x(s), as can be seen in the
remainder of this paper.
3.4.2. Structure of the Value Function Approximated by the
ANN. e feature vector, x(s), is then taken by the ANN.
e ANN works in the following way. First, through a linear
mapping which is specified by a weight matrix, W, it gen-
erates the so-called raw values [40]. Subsequently, each of
these raw values is transformed by a nonlinear function, e.g.,
a sigmoid function, to obtain the so-called threshold values
[40]. Such a nonlinear transformation is also known as
activation [41]. en, the threshold values are transformed
again through a linear mapping which is specified by another
weight matrix, V. Finally, the newly transformed values are
added by a vector of coefficients, c, known as the bias co-
efficients [40], yielding the outputs from the ANN, i.e., the
vector of action values, q. Note that the dimension of cis
equal to the number of actions. erefore, we see that the
ANN is characterized by three sets of parameters, i.e., W,V,
and c. In other words, the value function approximated by
the ANN is parameterized by W,V, and c. e mapping
from the input state vector to the output action-value vector
can thus be written in a compact form as
q�ANN(x(s);W,V,c).(4)
e structure of the ANN described above is presented in
Figure 2. e three sets of parameters, W,V, and c, are
unknown and need to be learned through the Q-learning
process. e algorithm used for achieving this is presented in
Section 3.5.
3.4.3. Benefit in Computational Cost. It is worth demon-
strating the benefit in computational cost brought by in-
troducing the ANN approximate value function. Recall that
we have estimated the computational cost of the lookup table
method in the beginning of Section 3.4. To enable a “fair”
comparison with the lookup table method, for the ANN
approximate value function, we also assume that the value
range of each traffic density variable is divided into 40 in-
tervals, and the value range of the estimated traffic demand
on the ramp is divided into 20 intervals. is implies that
there will be a total of 40 ×3+20, i.e., 140 state features. We
further assume that the number of hidden nodes is specified
as 3 times of the number of features, which has been found to
be sufficient to yield good learning outcomes in this study.
is implies that the dimension of the weight matrix Wwill
be 140 ×420. We still assume that there are 20 available
metering rates, as in the lookup table case. is implies that
the dimension of the weight matrix Vwill be 420 ×20, and
the dimension of the bias coefficient vector cwill be 20. As a
result, there will be a total of 67,220 unknown parameters to
learn. Compared with the 25.6 million action values (i.e.,
Q-values) to learn for the lookup table method, the benefit in
computational cost brought by the value function approx-
imation is tremendous.
3.5. e Learning Algorithm. As shown above, thanks to the
approximate value function, the computational cost of
learning can be profoundly reduced. e price is that the
learning algorithm will no longer be as straightforward as
lookup table methods. For a lookup table method, for any
encountered state-action pair, the new Q-value computed by
the so-called temporal difference (TD) rule is directly used to
replace the original Q-value in the lookup table. In general,
the TD rule of Q-learning is defined as [34]
Qnew(s, a) �(1−α)Qold (s, a) + αrs, a, s′ +cmax
b∈As′
( ) Qs′, b
⎛
⎝⎞
⎠.
(5)
In (5), sand s′denote states before and after taking the
action, respectively; aand bdenote actions; Ais the state-
dependent action space; rrepresents the reward received by
the agent moving from state sto state s′by taking action a;α
is the learning rate; and cis the discounting factor. In our
approach, the reward rdepends only on the state after taking
the action, as described in Section 3.3.
For a value function approximation-based method,
however, replacements of Q-values in a lookup table are no
longer applicable as there is not a lookup table at all; instead,
at each time step, the original and new Q-values are jointly
6Journal of Advanced Transportation
used to update the parameters of the approximate value
function. In other words, unlike a lookup table method for
which a final lookup table filled by converged Q-values will
be the ultimate outcome of the learning process, a value
function approximation-based method uses Q-values as
training data to calibrate the parameters of the approximate
value function, and the Q-values will not be part of the
ultimate outcome of the learning process. is is a distinct
difference between the two kinds of methods. It is worth
noting that the calibration of the parameters of the ap-
proximate value function is itself a learning problem. Spe-
cifically, it is an incremental supervised learning problem. It
is incremental as information encapsulated in the datum
generated at each time step (i.e., the new Q-value) needs to
be absorbed by the parameters as soon as it becomes
available. It is supervised as the target output (i.e., the new
Q-value) for the approximate value function (i.e., the ANN
in this study) is specified at each time step. e ANN cal-
ibration method employed in this study is the so-called
incremental backpropagation algorithm [40].
e above process is formally presented by Algorithm 1,
the pseudocode of the algorithm of Q-learning with ANN
value function approximation used for this study. ere are
two minor abuses of notations in Algorithm 1 for conve-
nience of presentation. First, by argmaxa∈A(s)ANN
(x(s);W,V,c), we mean the metering rate of the highest
action-value among all admissible metering rates under the
current state s. Second, similarly, by maxa∈A(s)ANN
(x(s);W,V,c), we mean the highest admissible action-value
under the current state s.
4. Assessments
4.1. Experiment Settings. is section evaluates the effects of
the proposed reinforcement learning approach. e layout
of the experiment freeway section is illustrated in Figure 3.
As shown in Figure 3, a lane-drop is located as far as
3500 meters downstream of the metered ramp. Before the
lane-drop, there are 3 lanes in the mainline, and after that,
there are 2 lanes in the mainline. e ramp has only one
lane.
e classical first-order discrete-time macroscopic
model of traffic flow dynamics, the cell transmission model
(CTM) [16], is employed as the simulation model. e free-
flow speed is set as 120 km/h, the critical density is set as
20 veh/km/lane, and the jam density is set as 100 veh/km/
lane. e flow-density fundamental diagram employed is
triangular. us, the capacity of one lane is
120 ×20 �2400 veh/h. Since the number of lanes before and
after the lane-drop is 3 and 2, respectively, and the critical
density is 20 veh/km/lane, the desired traffic density for the
control target cell is (2/3) × 20 �13.33 veh/km/lane.
In general, it may not be possible to quantify the
threshold distance value between the metered ramp and the
downstream bottleneck that will fail a conventional linear
feedback-type ramp metering controller, as this value may
vary from case to case, depending on factors including the
free-flow speed and design of the linear feedback controller.
For the specific experiment environment as described above,
we found that a proportional-integral (PI) controller, which
is a conventional linear feedback-type controller and can
work well for close bottlenecks, will no longer be stable if the
distance between the metered ramp and the downstream
lane-drop location exceeds 1000 meters.
Traffic demands of the mainline and ramp are given in
Figure 4. is demand profile is similar to what was used in
[18, 23, 29–31]. It is assumed in this study that the traffic flow
is composed of only passenger cars. Multiclass traffic flow
cases are not considered in this study. Note that, in order for
the problem to be meaningful, the mainline demand should
not exceed the mainline capacity after the lane drop, for
otherwise the ramp metering cannot help in anyway.
e method described in Section 3.4.1 is applied for state
encoding. e value range of each of the three traffic density
Features
x(s)
State
………
…… ……
…… …
……
…
Encoding
Bias node
1
State
s
Nonlinear
activation
Action
values
q
Weight
matrix
V
Weight
matrix
W
Bias
coefficients
c
Figure 2: Structure of the artificial neural network that serves as the approximate value function.
Journal of Advanced Transportation 7
variables, [0,ρjam], is equally divided into 40 intervals. e
value range of the estimated traffic demand on the ramp is
divided into 20 intervals. Unlike the value range of any traffic
density variable which has an explicit fixed upper bound
(i.e., ρjam), it is not that straightforward to specify a proper
upper bound for the value range of the estimated traffic
demand on the ramp. We could specify a very large upper
bound to ensure that any estimated traffic demand on the
ramp will fall within the value range. However, this can cause
the estimated traffic demand on the ramp to be much lower
than the specified upper bound for most of the times, hence
may not be efficient. To handle this issue, it is worth recalling
the purpose of state encoding: to facilitate the efficiency of
learning through translating the state variable into some
other variable(s) that is(are) more representable under the
specific learning task. Here, the learning task is to determine
the ramp metering rate which is bounded by the highest
allowable value, amax, regardless of the traffic demand on the
3500m
2000m 500m
Figure 3: Layout of the freeway section used for assessment.
Data: mainline and ramp traffic demands
Result: calibrated parameters of the artificial neural network
Initialization: set W,V, and cto small random numbers [40].
while episode reward not yet converged do
Set the freeway network of interest as empty
Initialize the state s
while not the end of this episode do
(1) Determine ramp metering rate aaccording to the ϵ−greedy strategy: a⟵argmaxa∈A(s)ANN(x(s);W,V,c)or
a⟵ais a random element in A(s)
(2) Simulate to obtain the new state s′, with aimplemented.
(3) Compute reward rbased on s′
(4) Compute Qold by the ANN: Qold ⟵maxa∈A(s)ANN(x(s);W,V,c)
(5) Compute Qnext by the ANN: Qnext ⟵maxa∈A(s′)ANN(x′(s′);W,V,c)
(6) Compute Qnew by updating Qold using the temporal difference rule Qnew ⟵(1−α)Qold +α(r+cQnext )
(7) Update the parameters of the ANN by the incremental backpropagation algorithm using Qold as the input to the ANN and
Qnew as the desired output [40]: W,V,c⟵Backpropagation(Qold, Qnew ,W,V,c)
(8) Update the state s⟵s′
end
end
A
LGORITHM
1:
Pseudocode of the algorithm of Q-learning with value function approximated by an artificial neural network.
1000 2000 3000 4000 5000 60000
sec
0
500
1000
1500
2000
2500
3000
3500
4000
4500
veh/h
Mainline demand
Ramp demand
Figure 4: Traffic demands for the mainline and the ramp.
8Journal of Advanced Transportation
ramp. erefore, a reasonable way to discretize the value
range of the estimated traffic demand on the ramp is as
follows: the range [0, amax]is equally divided into 19 in-
tervals; the range (amax,∞)accounts for the last interval.
e above state encoding treatment converts the
four-dimensional state vector of continuous variables into a
140-dimensional (40 ×3+20 �140)feature vector of bi-
nary variables.
In this experiment, the lowest allowable metering rate,
amin, is set as 200 veh/h, and the highest allowable metering
rate, amax, is set as 1200 veh/h. e range [amin, amax ]is
equally divided into 10 intervals, resulting in a total of 11
discrete metering rates: 200,300,. . . ,1100,1200
{ } veh/h.
is specification for the action space is determined fol-
lowing the so-called “full traffic cycle” signal policy for ramp
metering [36] to ensure that the optimal metering rates
learned through the proposed method can be implemented
by a traffic light. Note that 200,300,. . . ,1100,1200
{ } veh/h is
the largest admissible action space. As introduced in Section
3.2, in the proposed approach, at any time step, the
0
5
10
15
20
25
30
35
40
45
50
Veh/km/ln
1000 2000 3000 4000 5000 60000
Sec
Actual value
Desired value
1000 2000 3000 4000 5000 60000
sec
(a) (b)
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
m
0
5
10
15
20
25
30
35
40
45
50
Actual value
Desired value
1000 2000 3000 4000 5000 60000
Sec
0
5
10
15
20
25
30
35
40
45
50
Veh/km/ln
(c) (d)
0
5
10
15
20
25
30
35
40
45
50
1000 2000 3000 4000 5000 60000
Sec
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
m
0
5
10
15
20
25
30
35
40
45
50
Veh/km/ln
1000 2000 3000 4000 5000 60000
Sec
Actual value
Desired value
(e) (f)
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
m
1000 2000 3000 4000 5000 60000
sec
0
5
10
15
20
25
30
35
40
45
50
Figure 5: Comparison of traffic density time series of the control target cell (left column) and traffic density contours (right column),
respectively, among the no control case (top row), the PI controller case (middle row), and the case of the proposed approach (bottom row).
Journal of Advanced Transportation 9
admissible action space can be smaller than the largest set
because it is constrained by the estimated traffic demand on
the ramp.
e hyperparameters used in the experiments are
specified as follows. e number of hidden neurons is set
as 3 times of the features, i.e., 3×140 �420. e deter-
mination of this number was based on a considerable
amount of trial-and-error experiments. If this number is
set too big, the training time would be excessively long; if
it is set too small, the approximate value function would
not be able to effectively discriminate state inputs. e
learning rate, α, of TD updating rule (5) is set as such that,
for the first 0.1 million episode iterations, it is equal to
0.05, and it is equal to 0.01 afterwards. e discounting
factor, c, of TD updating rule (5) is set as 0.95. e ex-
ploration rate, ε, in the ε-greedy policy in Algorithm 1 is
set as decaying with the increase of the number of iterated
episodes [34].
4.2. Results. e experiment was coded and executed by
MATLAB R2019a. Learning converged after about 0.7
million of episodic iterations. e left column of Figure 5
presents the resulting traffic density time series of the
control target cell for the case of no control, the case of a
PI controller (which is a conventional linear feedback-
type controller), and the case of the proposed rein-
forcement learning approach; the right column of Figure 5
illustrates the traffic density contours of the entire freeway
section for the three cases. e black dash line in each
traffic density contour indicates the location of the lane-
drop; the origin of the y-axis of each traffic density
contour corresponds to the beginning location of the
concerned freeway section as depicted in Figure 3. From
Figure 5, it can be seen that, without any control measure,
as traffic demands increase, the traffic density of the
control target cell soon grows beyond the desired value,
and hence, congestion initiates from the bottleneck and
grows into the upstream. Under the PI ramp metering
control, the traffic density of the control target cell can be
maintained around the desired value in the large, how-
ever, with severe oscillations which propagate into the
upstream and influence the whole section. Under the
ramp metering policy learned through the proposed re-
inforcement learning approach, the traffic density of the
control target cell is managed to stay close to the desired
value with almost no fluctuations, and accordingly, the
traffic density contour of the entire section is much
smoother than the case of the PI controller.
Figure 6 compares the ramp metering rates computed
by the PI controller (Figure 6(a)) and by the policy learned
through the proposed reinforcement learning approach
(Figure 6(b)). It indicates that the patterns of the two sets
of metering rates are quite different. Moreover, micro-
scopically, the metering rates given by the learned policy
are very shredded in order to avoid the potential time-
delay effects due to the long distance, thanks to the facts
that it is a highly nonlinear feedback policy and takes in
traffic conditions at multiple locations along the stretch. It
is these shredded metering rates that manage to stabilize
the traffic density of the control target cell around the
desired value with almost no fluctuations, as shown in
Figure 5. By contrast, the metering rates given by the PI
controller lack subtle variations but can only constantly
oscillate with large amplitudes, which results in quite
unstable traffic densities of the control target cell, as
shown in Figure 5.
4.3. Robustness. It is of interest to what extent the learned
ramp metering policy can tolerate uncertainties in traffic
demands. To this end, the traffic demands are corrupted by
white noise. Figure 7 presents the results for the cases in
which the standard deviation of the white noise of the traffic
demands is 50, 100, 150, 200, and 250 veh/h, respectively. It
can be seen that the metering policy learned from the
proposed approach can perform satisfactorily up to the noise
level of 200 veh/h; its performance starts to go down as the
demand noise grows bigger.
200
400
600
800
1000
1200
Veh/h
1000 2000 3000 4000 5000 60000
Sec
(a)
1000 2000 3000 4000 5000 60000
Sec
200
400
600
800
1000
1200
Veh/h
(b)
Figure 6: Comparison of ramp metering rates computed by the PI controller (a) and by the policy learned through the proposed approach
(b).
10 Journal of Advanced Transportation
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Veh/h
1000 2000 3000 4000 5000 60000
Sec
Mainline demand
Ramp demand
0
5
10
15
20
25
30
35
40
45
50
Veh/km/ln
1000 2000 3000 4000 5000 60000
Sec
Actual value
Desired value
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
m
1000 2000 3000 4000 5000 60000
Sec
0
5
10
15
20
25
30
35
40
45
50
1000 2000 3000 4000 5000 60000
Sec
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Veh/h
Mainline demand
Ramp demand
1000 2000 3000 4000 5000 60000
Sec
0
5
10
15
20
25
30
35
40
45
50
Veh/km/ln
Actual value
Desired value
0
5
10
15
20
25
30
35
40
45
50
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
m
1000 2000 3000 4000 5000 60000
Sec
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Veh/h
1000 2000 3000 4000 5000 60000
Sec
Mainline demand
Ramp demand
0
5
10
15
20
25
30
35
40
45
50
Veh/km/ln
1000 2000 3000 4000 5000 60000
Sec
Actual value
Desired value
1000 2000 3000 4000 5000 60000
Sec
0
5
10
15
20
25
30
35
40
45
50
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
m
1000 2000 3000 4000 5000 60000
Sec
–1000
0
1000
1500
2000
2500
3000
3500
4000
5000
4500
Veh/h
Mainline demand
Ramp demand
1000 2000 3000 4000 5000 60000
Sec
0
5
10
15
20
25
30
35
40
45
50
Veh/km/ln
Actual value
Desired value
0
5
10
15
20
25
30
35
40
45
50
1000 2000 3000 4000 5000 60000
Sec
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
m
(a) (b) (c)
(d) (e) (f )
(g) (h) (i )
(j) (k) (l)
–1000
0
1000
2000
3000
4000
5000
Veh/h
1000 2000 3000 4000 5000 60000
Sec
Mainline demand
Ramp demand
1000 2000 3000 4000 5000 60000
Sec
0
5
10
15
20
25
30
35
40
45
50
Veh/km/ln
Actual value
Desired value
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
m
1000 2000 3000 4000 5000 60000
Sec
0
5
10
15
20
25
30
35
40
45
50
(m) (n) (o)
Figure 7: Performances of the ramp metering policy learned through the proposed approach under traffic demands with white noise. From
the top row to the bottom row, the standard deviation of the white noise is 50, 100, 150, 200, and 250 veh/h, respectively. e left column is
traffic demands, the middle column is traffic density time series of the control target cell, and the right column is the traffic density contours
of the entire section.
Journal of Advanced Transportation 11
5. Conclusions
is paper proposes a reinforcement learning approach to
learn an optimal ramp metering policy controlling a
downstream bottleneck that is far away from the metered
ramp. An artificial neural network replaces the lookup
table in the ordinary Q-learning approach to serve as the
approximate value function. e state vector is chosen so
that a tradeoff between the capability to anticipate traffic
flow evolution and the computational cost is achieved.
e action space is state-dependent to enhance the
learning efficiency. A simple tile coding method is
employed to convert the continuous state vector to a
binary feature vector to give stronger stimuli to the ar-
tificial neural network. e experiment results indicate
that the ramp metering policy learned through the pro-
posed approach is able to yield clearly more stable results
than a conventional linear feedback-type controller.
Specifically, under the learned ramp metering policy, the
traffic density of the control target cell is successfully
maintained to stay close to the desired value with almost
no fluctuations. As a result, traffic flow evolution over the
entire freeway section is also smooth. In comparison,
under a conventional linear feedback-type ramp metering
strategy, the traffic density of the control target cell os-
cillates significantly around the desired value. Conse-
quently, traffic flow evolution over the entire freeway
section also suffers from significant instability. e
metering policy learned through the proposed approach
has also demonstrated some level of robustness in terms of
yielding satisfactory results under uncertain traffic
demands.
For the next step, we plan to extend the proposed
method so that it can manage queue length on the ramp
at the expense of trading off some mainline efficiency.
Another interesting direction is to replace the artificial
neural network approximate value function by a simpler
linear approximate value function but with employing
more sophisticated state encoding techniques to better
capture the interactions among the state variables so that
a sophisticated approximate value function such as
an ANN may be avoided. It will also be interesting to
examine the impact of the number of representative
mainline sampling locations, especially under the cir-
cumstances of excessively long distance between
the ramp and the downstream bottleneck and compli-
cated traffic demand patterns. Finally, we will also look
into the approach of policy approximation as an alter-
native to the action-value approximation approach in
this paper.
Data Availability
e data used to support the findings of this study are in-
cluded within the article.
Conflicts of Interest
e authors declare that they have no conflicts of interest.
Acknowledgments
e authors thank Prof. Abhijit Gosavi of Missouri Uni-
versity of Science and Technology for his comments on the
manuscript. is work was funded by the C2SMART
University Transportation Center under USDOT award no.
69A3551747124.
References
[1] M. Papageorgiou and A. Kotsialos, “Freeway ramp metering:
an overview,” IEEE Transactions on Intelligent Transportation
Systems, vol. 3, no. 4, pp. 271–281, 2002.
[2] M. Papageorgiou, H. Hadj-Salem, J.-M. Blosseville et al.,
“Alinea: a local feedback control law for on-ramp metering,”
Transportation Research Record, vol. 1320, no. 1, pp. 58–67,
1991.
[3] P. Kachroo and K. Kumar, “System dynamics and feedback
control design problem formulations for real time ramp
metering,” Journal of Integrated Design and Process Science,
vol. 4, no. 1, pp. 37–54, 2000.
[4] K. Ozbay, I. Yasar, and P. Kachroo, “Modeling and paramics
based evaluation of new local freeway ramp metering strategy
that takes into account ramp queues,” Transportation Re-
search Record, vol. 2004, pp. 89–97, 1867.
[5] P. Kachroo and K. Ozbay, “Feedback Ramp Metering for
Intelligent Transportation System,” Kluwer Academics, New
York, NY, USA, 2003.
[6] Y. Wang, E. B. Kosmatopoulos, I. M. Papageorgiou, and
I. Papamichail, “Local ramp metering in the presence of a
distant downstream bottleneck: theoretical analysis and
simulation study,” IEEE Transactions on Intelligent Trans-
portation Systems, vol. 15, no. 5, pp. 2024–2039, 2014.
[7] Y. Kan, Y. Wang, M. Papageorgiou, and I. Papamichail, “Local
ramp metering with distant downstream bottlenecks: a
comparative study,” Transportation Research Part C: Emerging
Technologies, vol. 62, pp. 149–170, 2016.
[8] Felipe de Souza and W. Jin, “Integrating a smith predictor into
ramp metering control of freeways,” in Proceedings of the 2017
96th Transportation Research Board Annual Meeting, New
York, NY, USA, 2017.
[9] J. R. D Frejo and B. De Schutter, “Feed-forward alinea: a ramp
metering control algorithm for nearby and distant bottle-
necks,” IEEE Transactions on Intelligent Transportation Sys-
tems, vol. 20, no. 7, pp. 2448–2458, 2018.
[10] H. Yu, S. Koga, T. Roux Oliveira, and M. Krstic, “Extremum
seeking for traffic congestion control with a downstream
bottleneck,” 2019.
[11] E. Stylianopoulou, M. Kontorinaki, M. Papageorgiou, and
I. Papamichail, “A linear-quadratic-integral regulator for local
ramp metering in the case of distant downstream bottle-
necks,” Transportation Letters, vol. 1, 2019.
[12] L. Zhao, Z. Li, Ke Zemian, and Li Meng, “Distant downstream
bottlenecks in local ramp metering: comparison of fuzzy self-
adaptive pid controller and pi-alinea,” in Proceedings of the
2019 19th COTA International Conference of Transportation
Professionals, pp. 2532–2542, New York, NY, USA, 2019.
[13] L. Zhao, Z. Li, Z. Ke, and M. Li, “Fuzzy self-adaptive pro-
portional-integral-derivative control strategy for ramp
metering at distance downstream bottlenecks,” IET Intelligent
Transport Systems, vol. 14, no. 4, pp. 250–256, 2020.
[14] M. Papageorgiou, J.-M. Blosseville, and H. Hadj-Salem,
“Modelling and real-time control of traffic flow on the
southern part of boulevard peripherique in paris: Part i:
12 Journal of Advanced Transportation
Modelling,” Transportation Research Part A: General, vol. 24,
no. 5, pp. 345–359, 1990.
[15] C. Meyer, D. E. Seborg, and R. K. Wood, “A comparison of the
smith predictor and conventional feedback control,” Chem-
ical Engineering Science, vol. 31, no. 9, pp. 775–778, 1976.
[16] C. Daganzo, “e cell transmission model. part i: a simple
dynamic representation of highway traffic,” Transportation
Research Part B: Methodology, vol. 31, 1994.
[17] K. Wen, S. Qu, and Y. Zhang, “A machine learning method for
dynamic traffic control and guidance on freeway networks,” in
Proceedings of the 2009 International Asia Conference on
Informatics in Control, Automation and Robotics, vol. 67–71,
New York, NY, USA, 2009.
[18] M. Davarynejad, A. Hegyi, J. Vrancken, and J. van den Berg,
“Motorway ramp-metering control with queuing consider-
ation using q-learning,” in Proceedings of the 2011 14th In-
ternational IEEE Conference on Intelligent Transportation
Systems (ITSC), New York, NY, USA, 2011.
[19] K. Veljanovska, Z. Gacovski, and S. Deskovski, “Intelligent
system for freeway ramp metering control,” in Proceedings of
the 2012 6th IEEE International Conference Intelligent Systems,
pp. 279–282, New York, NY, USA, 2012.
[20] F. Ahmed and W. Gomaa, “Freeway ramp-metering control
based on reinforcement learning,” in Proceedings of the 11th
IEEE International Conference on Control & Automation
(ICCA), pp. 1226–1231, New York, NY, USA, 2014.
[21] F. Ahmed and W. Gomaa, “Multi-agent reinforcement
learning control for ramp metering,” in Progress in Systems
Engineering, pp. 167–173, Springer, Berlin, Germany, 2015.
[22] E. Ivanjko, D. Koltovska Neˇ
coska, M. Greguri´
c, M. Vuji´
c,
G. Jurkovi´c, and S. Mandˇ
zuka, “Ramp metering control based
on the q-learning algorithm,” Cybernetics and Information
Technologies, vol. 15, no. 5, pp. 88–97, 2015.
[23] Z. Li, P. Liu, C. Xu, H. Duan, and W. Wang, “Reinforcement
learning-based variable speed limit control strategy to reduce
traffic congestion at freeway recurrent bottlenecks,” IEEE
Transactions on Intelligent Transportation Systems, vol. 18,
no. 11, pp. 3204–3217, 2017.
[24] C. Jacob and B. Abdulhai, “Integrated traffic corridor control
using machine learning,” in Proceedings of the 2005 IEEE
International Conference on Systems, Man and Cybernetics,
vol. 4, pp. 3460–3465, New York, NY, USA, 2005.
[25] C. Jacob and B. Abdulhai, “Automated adaptive traffic cor-
ridor control using reinforcement learning: approach and case
studies,” Transportation Research Record, vol. 1, no. 1, pp. 1–8,
2006.
[26] C. Jacob and B. Abdulhai, “Machine learning for multi-
jurisdictional optimal traffic corridor control,” Transportation
Research Part A: Policy and Practice, vol. 44, no. 2, pp. 53–64,
2010.
[27] K. Rezaee, B. Abdulhai, and H. Abdelgawad, “Application of
reinforcement learning with continuous state space to ramp
metering in real-world conditions,” in Proceedings of the 2012
15th International IEEE Conference on Intelligent Trans-
portation Systems, New York, NY, USA, 2012.
[28] K. Rezaee, B. Abdulhai, and H. Abdelgawad, “Self-learning
adaptive ramp metering,” Transportation Research Record:
Journal of the Transportation Research Board, vol. 2396, no. 1,
pp. 10–18, 2013.
[29] T. Schmidt-Dumont and J. H van Vuuren, “Decentralised
reinforcement learning for ramp metering and variable speed
limits on highways,” IEEE Transactions on Intelligent Trans-
portation Systems, vol. 14, no. 8, p. 1, 2015.
[30] C. Wang, J. Zhang, L. Xu, L. Li, and B. Ran, “A new solution
for freeway congestion: cooperative speed limit control using
distributed reinforcement learning,” IEEE Access, vol. 7,
pp. 41947–41957, 2019.
[31] E. Walraven, M. T. J. Spaan, and B. Bakker, “Traffic flow
optimization: a reinforcement learning approach,” Engi-
neering Applications of Artificial Intelligence, vol. 52, p. 203,
2016.
[32] F. Belletti, D. Haziza, G. Gomes, and A. M. Bayen, “Expert
level control of ramp metering based on multi-task deep
reinforcement learning,” IEEE Transactions on Intelligent
Transportation Systems, vol. 19, no. 4, pp. 1198–1207, 2017.
[33] Y. Wu, H. Tan, and B. Ran, “Differential variable speed limits
control for freeway recurrent bottlenecks via deep rein-
forcement learning,” 2018.
[34] R. S. Sutton and A. G. Barto, Reinforcement Learning: An
Introduction, MIT Press, New York, NY, USA, 2018.
[35] R. E. Bellman, Adaptive Control Processes: A Guided Tour,
Princeton University Press, New York, NY, USA, 2015.
[36] M. Papageorgiou and I. Papamichail, “Overview of traffic
signal operation policies for ramp metering,” Transportation
Research Record, vol. 204, no. 1, pp. 28–36, 2008.
[37] J. Zhao, W. Ma, Y. Liu, and K. Han, “Optimal operation of
freeway weaving segment with combination of lane assign-
ment and on-ramp signal control,” Transportmetrica A:
Transport Science, vol. 12, no. 5, pp. 413–435, 2016.
[38] C. Zhang, N. R. Sabar, E. Chung, A. Bhaskar, and X. Guo,
“Optimisation of lane-changing advisory at the motorway
lane drop bottleneck,” Transportation Research Part C:
Emerging Technologies, vol. 106, pp. 303–316, 2019.
[39] D. P. Bertsekas, Reinforcement Learning and Optimal Control,
Athena Scientific Belmont, Massachusetts, MA, USA, 2019.
[40] A. Gosavi, Simulation-Based Optimization: Parametric Op-
timization Techniques and Reinforcement Learning, Springer,
Berlin, Germany, 2015.
[41] J.-A. Gosavi, Probabilistic Machine Learning for Civil Engi-
neers, MIT Press, New York, NY, USA, 2020.
Journal of Advanced Transportation 13