Available via license: CC BY-NC-ND 4.0
Content may be subject to copyright.
1
A sequential transit network design algorithm with optimal learning
under correlated beliefs
Gyugeun Yoon1, Joseph Y.J. Chow2*
1Department of Computational Data Science and Engineering, North Carolina A&T State
University, Greensboro, NC, USA
2C2SMART University Transportation Center, New York University Tandon School of
Engineering, Brooklyn, NY, USA
*Corresponding author email: joseph.chow@nyu.edu
Abstract
Mobility service route design requires potential demand information to well accommodate travel
demand within the service region. Transit planners and operators can access various data sources
including household travel survey data and mobile device location logs. However, when
implementing a mobility system with emerging technologies, estimating demand level becomes
harder because of more uncertainties with user behaviors. Therefore, this study proposes an
artificial intelligence-driven algorithm that combines sequential transit network design with
optimal learning. An operator gradually expands its route system to avoid risks from inconsistency
between designed routes and actual travel demand. At the same time, observed information is
archived to update the knowledge that the operator currently uses. Three learning policies are
compared within the algorithm: multi-armed bandit, knowledge gradient, and knowledge gradient
with correlated beliefs. For validation, a new route system is designed on an artificial network
based on public use microdata areas in New York City. Prior knowledge is reproduced from the
regional household travel survey data. The results suggest that exploration considering correlations
can achieve better performance compared to greedy choices in general. In future work, the problem
may incorporate more complexities such as demand elasticity to travel time, no limitations to the
number of transfers, and costs for expansion.
Keywords: Mobility Service, Sequential Transit Network Design, Reinforcement Learning,
Correlated Beliefs, Artificial Intelligence
2
1. Introduction
Mobility service route design depends on identifying the demand level in the potential service
region of a system. In conventional transportation planning, an operator would collect survey data
from travelers in the region and use that to forecast the demand for the whole route design. For
example, information about demand mostly comes from such existing data sources as survey
results from samples of the population (e.g. Regional Household Travel Survey (RHTS) conducted
by New York Metropolitan Transportation Council (NYMTC)) (NYMTC, 2010). Such forecast
models are possible if travelers are familiar with the costs and benefits of the proposed mode(s)
and how they perform relative to existing modes in the region. But for emerging transit
technologies, such as modular autonomous vehicle fleets (e.g. Guo et al., 2017; Caros and Chow,
2021), more elaborate feeder-trunk routes that make use of various shared mobility options (e.g.
Ma et al., 2019), or shared autonomous vehicle fleets that have little to no data to begin with, each
new city deployment requires starting from scratch (Allahviranloo and Chow, 2019). Whereas
regional travel surveys may be conducted every five or ten years, emergent technologies are
unfolding on a yearly time horizon. Not having experienced an emerging technology mode,
travelers’ responses to hypothetical questions in stated preference surveys may not be as reliable
when asked about them (see Bunch et al., 1993; Watson et al., 2020).
Meanwhile, a mobility service system can accumulate its own data generated during the
operation. This data is not limited to basic operation logs including vehicle trajectory and load
profile data, but also covers the result of such interactions with passengers as revealed preferences,
ridership, average wait time, and ratings. In this way, the routes served by an operator become
important sensors for collecting data. An operator can use these data to design their route system
to fit the prevailing demand in a sequential or phased implementation (Yoon and Chow, 2020)
instead of implementing a single design at once. In this context, service design should be
implemented with explicit consideration of the use of the routes as sensors, i.e. placement of one
route design versus another impacts not only service provided to travelers, but also in the
knowledge gained from the more reliable data on those operated routes. This explicit consideration
of route design as sensors is called transit service route design with “optimal learning.”
In optimal learning, sequential decisions are made such that the information gained from each
decision epoch is optimized to reduce the uncertainty for subsequent decisions. It is a type of
reinforcement learning (RL) (Powell, 2021). Motivated by the similarity between sequential
mobility route design and optimal learning, this study seeks a new approach to design mobility
service routes tailored to local demand patterns by adapting a learning-based scheme. While earlier
work on this problem from Yoon and Chow (2020) considered the use of contextual bandit
algorithms sequentially building out one complete route at a time, that method assumes that
knowledge gained from each route expansion is independent from other options. In a network
setting, this is clearly not the case. An alternative approach is the knowledge gradient (Frazier et
al., 2008), which does not require this assumption of independence between options. However,
implementation of knowledge gradients in network settings can be problematic because of the way
the covariance matrix between different route elements scales up (Ryzhov et al., 2012).
We propose several contributions to resolve this issue for transit route design. First, a system
design is proposed for integrating optimal learning with correlated stochastic variables for
sequential transit network design at the segment extension level. The design includes the flow of
data, required inputs and outputs, and integration of optimal learning methods into a route design
algorithm. Second, three different learning policies are compared in two experiments: a 5-by-5
3
grid network with artificially generated data and a realistic scenario based on public use microdata
from New York City (NYC). In the latter, two benchmarks are included in the comparison: a
greedy algorithm and the Chow-Regan (CR) reference policy (Chow and Sayarshad, 2016). These
computational experiments provide insights on the performance of sequential transit network
design with optimal learning regarding the inclusion of exploration.
This paper is structured as follows. First, existing works about transit route design and learning
policy are reviewed. Second, the problem is defined, and the proposed route design algorithmic
framework is presented. Third, numerical experiments are designed for illustrating and testing the
proposed algorithm under different learning methods, and results are discussed. Finally,
conclusions and prospective future advances from this research are presented.
2. Literature review
The review in this section examines the conventional and dynamic transit service route design
and explores some optimal learning policies potentially applicable to the proposed methodology.
2.1. Mobility service route design with fixed routes
2.1.1. Conventional approaches with static demand
Transit route design is one of the components of a line planning problem (LPP), where the
latter more broadly includes service frequency or timetable scheduling. Conventional approaches
establish a set of fixed routes that optimizes objective values based on the static demand
information. Routes are not assumed to change once implemented, which makes the system design
simpler and more efficient. This is highly applicable when the demand level is expected to be
stable, or the spatiotemporal scope is sufficiently narrow to perceive demand as constant.
Methodologies for generating routes for static line planning either build them based on the
network characteristics (e.g. Ngamchai and Lovell, 2003; Baaj and Mahmassani, 1995; Israeli and
Ceder, 1995; Chakroborty and Wivedi, 2002; Zhao and Zeng, 2008; Iliopoulou and Kepaptsoglou,
2019) or choose from a predefined set of “feasible routes”, routes that satisfy such criteria
regarding geometrical or operational attributes as total length or mandatory visit to certain nodes
(e.g. Ceder and Israeli, 1998; Pattnaik et al., 1998; Tom and Mohan, 2003; Fan and Machemehl,
2006, 2008; Cipriani et al., 2012a, 2012b; Schmid, 2014; Walteros et al., 2015; Owais et al., 2015).
Numerous methodologies used for generating route sets include route construction heuristics
(Ceder and Wilson, 1986), genetic algorithm (Chien et al., 2001), column-generation algorithm
(Borndörfer et al., 2007), and adaptive neighborhood search metaheuristics (Canca et al., 2017).
Several studies investigated route design with variable or elastic demand. For instance, if the
public transit demand is determined by a mode choice model with travel time attributes such as
logit model, the ridership will depend on the service level it provides. However, they assumed the
demand matrix and the model is known to the planner, making the ridership estimable (e.g. Fan
and Machemehl, 2008; Lee and Vuchic, 2015; Yoo et al., 2010; Gallo et al., 2011; Zarrinmehr et
al., 2016).
2.1.2. Sequential design
In contrast, dynamic demand involves uncertainty that prohibits the precise prediction of
demand. Due to its higher complexity, route design under uncertainty is not well studied. A few
studies tackle LPPs incorporating route design with two-stage stochastic programs or robust
optimization (An and Lo, 2015, 2016; Liang et al., 2019). One strategy for mitigating uncertainty
4
over a time horizon is to consider the buildout over multiple stages and to adapt subsequent stages
to prior outcomes (Chow and Regan, 2011). Staged development is a natural approach to transit
networks (Mohammed et al., 2006; Li et al., 2015; Sun et al., 2017; Yu et al., 2019). This notion
of flexibility leads to a sequential network design problem under uncertainty, a more complex
category of Markov decision processes where interdependent decisions are made dynamically over
multiple periods with information revealed over time (Chow and Sayarshad, 2016; Powell, 2007).
Approximate dynamic programming (ADP) algorithms and RL are typically used to optimize such
problems. Whereas ADP typically assumes the distribution of the uncertainty is known throughout
the decision process and seeks actions to anticipate the future, RL includes uncertainty in the belief
of the distribution and manage the decisions between exploitation (optimizing the system under
current belief of distribution) and exploration (optimizing the updating of the belief of the
distribution). There are methods that combine the anticipative actions of ADPs with the optimal
learning aspects in RL (e.g. Powell and Ryzhov, 2012a; Ryzhov et al., 2019).
RL has been proposed for extending routes (Wei et al, 2020). Yoon and Chow (2020) proposed
a similar approach at the route level that builds feasible routes in advance and include each route
as options for sequential expansion. However, the route enumeration can be an obstacle to optimize
the system. An alternative is to design a transit network sequentially one segment at a time
assuming shorter epochs in which the system obtains feedback after each segment extension; this
would avoid route set generation. No literature has considered the sequential line planning
problem with segment-level decisions, let alone using RL.
2.2. Reinforcement learning in sequential planning for transportation systems
RL can help determine how to efficiently estimate the consequence of our action with the
limited knowledge that we have. Each action corresponds to a reward that operators are interested
in: radio channel availability (Liu and Zhao, 2010), click-through rate of news articles (Li et al.,
2010), and reliability of chosen path (Zhou et al., 2019). The basic concept of RL consists of
exploring and exploiting. After the system accumulates knowledge from exploring various options,
it determines the best option to choose based on the combination of current reward and expected
future value. Different learning policies define their own evaluation measures.
Not surprisingly, some transportation domains have adopted the learning techniques to
dynamically reflect real data collected from previous operations and predict the best action that
systems may take in the next period. TABLE 1 highlights some transportation-related examples
that adopted RL. Zolfpour-Arokhlo et al. (2014) developed a route planning model using multi-
agent RL in a Malaysian intercity road network to reduce travel time between cities. Zhou et al.
(2019) applied a MAB algorithm to sequential departure time and path selection considering on-
time arrival reliability. Mean rewards of options incorporate early and late arrival times and
corresponding penalties. Huang et al. (2019) used knowledge gradient for allocating delivery
vehicles to sectors in a region while minimizing the expected operational cost. The learning
algorithm predicted probabilities of cost curves being the truth from a pool of curves. Römer et al.
(2019) implemented a contextual bandit process to control charging demands of EVs by adjusting
the price and recommending stations to users. Considering station load, charging price, or income
as features which affect driver behavior, they analyzed the effect of bandit algorithms on maximum
loads at stations and average rewards of drivers. Zhu and Modiano (2018) dealt with travel time
delays on the network where delays were collected only in total along the path and those of
individual links would be revealed if selected. Three alternative methods are discussed in more
detail.
5
TABLE 1 Examples of Learning Techniques Used in Transportation Domain
Subject
Approach
Information (I) and Action
(A)
Intercity route planning
(Zolfpour-Arokhlo et al.,
2014)
Q-value based dynamic
programming
I: travel time
A: route choice
Sequential reliable route
selection (Zhou et al., 2019)
Multi-armed bandit
I: generalized travel time,
reliability
A: route choice
Delivery vehicle allocation
(Huang et al., 2019)
Knowledge gradient
I: operational cost curve
A: vehicle allocation to
regions
Demand management of EV
charging stations (Römer et
al., 2019)
Contextual bandit
I: load on grid
A: charging price adjustment
Stochastic online shortest
path routing (Zhu and
Modiano, 2018)
Combinatorial bandit
I: end-to-end delay
A: path choice
2.2.1. Multi-armed bandit
Multi-armed bandit (MAB) involves recommending from a set of fixed, independent options
(i.e. arms) over multiple periods to minimize regret resulting from the revealed rewards in each
trial. Regret represents the difference between a maximum possible reward and the acquired
reward as shown in Eq. (1). When an arm chosen at time step is labeled as , the observed reward
is . Having arms, of arm is assumed to follow an unknown distribution
(Bubeck and Cesa-Bianchi, 2012). In reality, detecting is nearly impossible due to the
unobservability of simultaneous rewards across all arms.
(1)
If the total length of observations equals to , the initialization period is set to build
initial knowledge about distributions of rewards. During , choices on arms can be made randomly
instead of following a certain policy. When , the agent chooses the arm that balances between
the highest expected reward in the current period and lowering the upper bound of possible regret
across all periods. An algorithm called “Upper Confidence Bound (UCB)” limits the increase of
total regret at a rate lower than , reducing the uncertainty of total amount of regret (Auer et
al., 2002). The drawback of this approach is that every arm is assumed to be independent of each
other.
2.2.2. Knowledge gradient
Knowledge gradient (KG) is an optimal learning technique that sequentially decides on an
action taking into account the potential knowledge that may be gained in shaping subsequent
decisions (Powell and Ryzhov, 2012b). If is the state after the -th measurement, is
6
the state after choosing after the -th measurement, and is the value of being in state
after the -th measurement. It seeks the option that can improve the most by estimating the
expectation of the value after the -th measurement in Eq. (2).
(2)
After estimating
for all , the policy chooses the best option that maximizes
in Eq.
(3). The derivation of
is illustrated in Section 3.1.3.
(3)
2.2.3. Knowledge gradient with correlated belief
KG with correlated beliefs (KGCB) is similar to KG but considers correlation among options
(Powell and Ryzhov, 2012b). The expected value of choosing after the -th measurement is
estimated by Eq. (4).
(4)
where is a belief about the mean of a measure at , is the change in our belief about
option , is a random variable regarding the difference between the observation and belief,
is state of knowledge, and is the chosen option at . Specifically, a correlation matrix
affects , providing information about potential influences of choosing on other options.
Section 3.1.3 describes the detailed calculation.
2.3. Summary
Some transportation fields have actively adopted RL, especially in traffic flow, traffic signals,
and transit assignment (Abdulhai et al., 2003; Walraven et al., 2016; Cats and West, 2020). There
are some common points between sequential transit network design and optimal learning policies.
First, actions (segment extensions) correspond to rewards (additional demand covered). Second, a
series of observations is used to improve the accuracy of the expected reward estimation. Lastly,
an action at the current time step affects future ones. The following sections propose how to model
a sequential segment-level route design process with optimal learning.
3. Methodology
There is a need to clarify the concept of “segment extension” and “route expansion.” Both
terms imply the physical growth of a system, but we distinguish segment extension and route
expansion as two different levels as shown in Figure 1. A segment extension appends a segment
to a current route and elongates its length. On the contrary, a route expansion adds a new route to
an existing route set and improves the coverage of the system. When considering transfers,
multiple routes can cover travel demand across more node pairs.
7
Figure 1. Concept of segment extension and route expansion.
Consider a system that is gradually expanded: we extend a route segment-by-segment and add
it to the existing system. Each time a segment is added, some time passes in which new information
is obtained from users served by that additional segment. The maximum length and number of
routes are design criteria which can be derived from the budget availability.
This approach can benefit operators in several ways. First, they can obtain the most recent
information to supplement the initial knowledge. Second, responses to dynamic demand can be
more prompt as the accumulated data affects the route design. Lastly, it can be more applicable
when the budget is split and sequentially expanded.
3.1. Problem Statement
Suppose an operator plans candidate routes on a network with nodes and
candidate bidirectional segments over a time horizon . A segment can be denoted as
to identify both ends. The operator decides to sequentially expand routes and apply a
learning policy due to demand uncertainty. To solve the sequential segment-level network
design problem, we propose to structure the time dimension into a set of segment extension trials
embedded within a segment extension time step , consisting of a larger set of route expansion
epochs in . This is illustrated in Figure 2. For each , there is a set of alternative segments
that can be appended to , and a set of corresponding nodes . In a segment choice stage, one
in is labeled as according to Definition 1. After trials, is extended to
according to the evaluation, and the next time step is conducted. This repeats until the
termination criteria are satisfied. In contrast, one-time implementation initiates all routes at the
same time based on the knowledge about existing demand.
Definition 1. The sequential segment-level transit network design problem (SSTNDP) selects a
segment across a sequential set of time steps to maximize the cumulative value
, where
is value of after trials under a learning policy , and consists of a
stochastic reward and an exploration term representing the potential benefit of exploration by
collecting additional information from , i.e. Eq. (5).
(5)
8
Figure 2. Hierarchy of problem stage and time dimension.
This study assumes that values of segments are dependent on OD demand, usually a unit by
which passenger travel data are collected. For example, appending to can connect nodes of
to , the node corresponding to . The broader coverage improvement is expected if the route
intersects with others, enabling transfers among routes. Moreover, during the expansion, the
operator should also consider correlations between different OD pairs of demand if they are
assumed to be correlated. Definition 2 presents the information about OD demand used in the
problem.
Definition 2. OD demand of two nodes is the total bidirectional demand between them and
considered as the size of a potential passenger group. True OD demand of node pair is
denoted as . ’s across all OD pairs are multivariate normal variates with ,
where is the array of , and is the true covariance matrix of .
The goal of the operator is to find the optimal sequence of segments appended and routes
expanded, achieving the maximum cumulative value of the route system. Let be total OD
demand coverage at , the sum of ’s served by the current system. If is the sum of
additional ’s expected to be covered by appending , after choosing is derivable
as Eq. (6). Furthermore, this implies that consists of past values of . This is possible since
’s for different ’s do not share common and are mutually exclusive.
9
(6)
A segment-level extension allows multiple ’s to use the system as their travel mode. Thus,
when considering demand as a primary measure, the aggregation of node pair flows associated
with segments is necessary. If with nodes attaches and extends to , it can cover more
node pairs. The number increases if there are routes intersecting with . By defining as the
-th segment of and as a set of possible ’s after integrating , Definition 3 explains
the mean and covariance of options.
Definition 3. For OD demand where , total OD demand after choosing is in
Eq. (7).
(7)
The covariance between and is shown in Eq. (8).
(8)
Both are elements of the mean vector and the covariance matrix , respectively.
For example, Nodes 1, 2, and 3 in an existing route can access a new segment between Node 3 and
4. Then, , . Thus,
, while
is the sum of covariances of 6×6 combinations of . At
, however, and are renewed as has new elements added by expansion. Due to its
dependence on , they are labeled as and .
Consequently, an online algorithm is used to solve the SSTNDP, which consists of a stochastic
process for demand, a learning policy to translate a segment selection into a stochastic reward, and
an optimization of the sequential segment selections to maximize the cumulative value. Section
3.2 presents a learning scheme needed to adopt optimal learning within sequential route expansion.
Section 3.3 then presents the proposed algorithm for solving the SSTNDP.
3.2. Preliminaries: learning scheme
3.2.1. Prior design
The learning policies depend on the set of priors being consistently updated to make their
choices on segments. For the stochastic process, prior knowledge about the relationship between
option and corresponding reward is essential for the choice and learning during sequential transit
network design. Due to the hidden truths, priors replace them and become references for decision
making processes. In this problem, it represents the most recent information about OD demand in
10
the given network that an operator can access. Since there are OD demand truths (, ) and
segment-level truths (, ) not revealed, there should be four priors substituting each. First,
and
are initial priors of , and updated to
and
at every . Second,
and
in
the segment level are prepared for the segment choice at and updated to
and
after each
. Operators cannot directly observe these segment level priors, but they can be aggregated from
OD demand priors by the same process with truths. They observe , update corresponding
element in
and
, and use them for synthesizing
and
. The calculation is the same
as the one in Definition 3, except for replacing the notation for and with and
.
This indicates the possibility of involvement of uncertainties during observations.
If available, OD demand level priors can be recalled from external information sources or
collected from pilot operations. Nevertheless, the initial pilots cannot cover all possible cases due
to the lack of time and budget. Thus, a subset of options is observed and included in a prior. This
may yield incomplete priors and cause empty elements in mean vectors or covariance matrices.
Since they can disturb choices by distorting values of options, it requires appropriate assumptions
not to bring an incomplete prior into the choice-making processes and to alleviate the inefficiency
in searching for the optimal option. After the prior information is obtained, operators can designate
the first segment to serve, and determine the direction of extension of the route. Simultaneously,
they can construct the database about the local demand despite a lack of completeness.
3.2.2. Evaluation and choice
In the sequential transit network design, revenue maximization is an important objective to
operators, but subject to uncertainties. Optimal learning balances the exploration of the unknown
and exploitation of the known during the evaluation. Because the calculation of
is based on
accumulated knowledge being continuously updated, from Eq. (5) can change during
observations for a single extension. A current route is extended to at the final time step.
’s of KG and KGCB are introduced in Literature Review, referring to Powell and Ryzhov
(2012b). Due to the negligence of interactions between options, the computational complexity is
lower with KG. It includes the calculation of: 1)
, the change in estimated variance of
after
the update by Eq. (9), 2)
, the normalized influence of choosing by Eq. (10), and 3)
,
the knowledge gradient by Eq. (11).
is the variance of observations.
(9)
(10)
(11)
(12)
11
Eq. (12) is a linear combination of , the standard normal density function, and , the
cumulative standard normal distribution as shown in Eqs. (13-14).
(13)
(14)
On the other hand, KGCB requires a more complicated derivation. Recalling Eq. (4),
and should be defined to derive the expected value of Eq. (15) for all option s. Eq. (16)
indicates the change in
after measuring option . in Eq. (17) is a random variable in
the form of a standard normalization of the difference between the
and
, the
observation and belief of the measurement of .
is also a random variable due to the
uncertainty in observations. Eq. (18) shows the equivalence between the denominator in Eqs. (16-
17).
(15)
(16)
(17)
(18)
Eq. (15) is now a linear function with the independent variable of , and Eq. (4) seeks the
expected values of maximum
for all options. If a set of lines are sorted by in
ascending order, s, intersections of lines are identified. A line that dominates others on [, ]
represents the maximum
within the section. After eliminating curves without sections
dominating others, the number of remaining curves is , and is defined as the slope of -th
curve with . Since some lines may have no section to be better than others, intersections are
labeled with different subscripts. After ordering curves, it is possible to estimate Eq. (4) by Eq.
(19).
(19)
12
where is the function in Eq. (12). Details of this derivation can be found in Frazier et al. (2009)
and Powell and Ryzhov (2012b).
is in Eq. (20). The stochastic factor is explained with an upper confidence bound which
can achieve logarithmic regret as increases (Auer et al., 2002; Lattimore, 2016).
(20)
where is the number of time steps, and is the number of chosen during the evaluation.
More observations of a certain lead to smaller disturbance of its
, increasing the
predictability of
.
3.2.3. Observation and knowledge update
An option returns a reward when chosen. Because the operator wants to attract as many
customers as possible, the covered demand is considered the reward measure. Under uncertainty,
expected rewards need to be estimated and the same action applied twice may not cause the same
result.
The model should update the archived knowledge from the observation of the choice result.
For route design, an expanded system can induce more passengers to the system, causing changes
to the previous knowledge about demand patterns. Consequently, the system should update the
knowledge after each expansion until the final route is implemented.
There are three types of knowledge that should be updated: prior mean, prior variation, and
prior covariance matrix. For an easier update, a precision , the inverse of variance, replaces
(). It captures how accurate values can be observed. If only some flows are assumed to
be correlated, priors should be separated accordingly. First, uncorrelated flows are considered
random variables following independent normal distributions. Then the update is done with a
simple Bayesian prior update procedure like Eqs. (21) – (22) (Powell and Ryzhov, 2012).
(21)
(22)
where and are current beliefs on means and precisions of flows, is the precision of
observations, is the observed flow, and and are updated beliefs.
Correlated flows require more complicated update procedures because of the incomplete
observations we can conduct. Since any observation cannot be complete unless the operating route
system covers all flows, the system can only collect information on a subset of flows. Additional
consideration of omitted flows is essential for correlated flows by introducing , a matrix
indicating whether a flow is observed on its diagonal. Eqs. (23) – (25) are expressions for updating
correlated priors (Perkerson, 2020).
13
(23)
(24)
(25)
3.3. Proposed AI-based sequential segment-level transit network design algorithm
There are differences between route design and conventional optimal learning. First, route
design does not repeat evaluation on the same option set. After a route is extended by a link,
evaluation of the next extension is based on a different option set. Next, the number of observations
and operations are relatively limited compared to other optimal learning applications. For instance,
if a system aims for the number of clicks on online news articles, it can collect responses of readers
due to its ubiquity. However, lower operational flexibility of mobility services prohibits operators
from conducting massive experiments. Moreover, rewards regarding available options during the
evaluation are aggregated OD flows, requiring the aggregation for every extension. Considering
these differences, the following new system design algorithm is proposed.
Figure 3. Proposed learning-based sequential segment-level network design algorithm.
Figure 3 describes an algorithm of the AI-based segment-based extension using optimal
learning. In this algorithm, the operator should prepare
and
from available sources such as
pilot service operations or existing dataset regarding OD demand since the truths and are
not revealed. Once the initial terminal (or both ends after the first segment choice) of is
designated, all segments that can be attached to the current become options.
and
are
aggregated to the segment level to form
and
used for the segment evaluation based on .
The segment with the highest
is labeled as the optimal option at the -th trial, and its
associated are observed. After trials are observed through Loop I, the operator moves on to
the next extension and repeats Loop II until is complete. Consequently, times of Loop III
yields the complete route system that the operator desired.
This study considers three policies as candidates of the learning policy in this proposed AI-
based network design algorithm, where they share common aspects in the algorithm including data
processing and input, feedback of interim results, and output production. The evaluation part of
14
the proposed algorithm can be replaced if someone needs another learning policy. Algorithm 1
explains the flow of the proposed methodology to solve the model in Eq. (5).
Algorithm 1. Proposed AI-based sequential segment-level transit network design algorithm
1. Prepare network and node pair flows
2. Input simulation settings: number of routes required , maximum route length , number of
pilots , minimum pilot route length , number of observations
per pilot , number of observations per extension , learning
policy
2.1. If available, import priors from existing knowledge (
,
).
3. Conduct pilots
For to do
3.1. Randomly choose two nodes as terminals.
3.2. Operate a route between , longer than .
3.3. Observe operations on chosen route.
3.4. From observed s, create
and
if not existent. Otherwise, update them.
End For
4. Implement segment-level extension
For to do
4.1. Designate an initial node as a starting terminal of .
While do
4.2. Identify available links from both ends of .
4.3. Estimate
of links.
For to do
4.3.1. Identify s covered by the route system with appended.
End For
4.3.2. Aggregate
and
of s to segment-level (
,
).
For to do
4.3.3. Input priors to optimal learning policy and find which to observe.
4.3.4. Observe s corresponding to and updated knowledge.
End For
4.4. Append to according to Eq. (5).
End While
End For
5. Yield route system and performance measures
Algorithm 1 is coded in MATLAB and is made available at the following Zenodo link: Yoon
(2023). Note that it makes use of KGCB and KG codes provided by Castle Labs (2022a,b). The
impact of the network size is limited in Algorithm 1; it only affects the designation of the initial
starting terminal. Instead, the complexity of this algorithm is determined by , , , and mean
, which have low correlation with the network size. The more important issue would be the
scalability in terms of the time frame and algorithm loops. Planners need to consider trade-offs
between computational efficiency and reliability of accumulated data. One example timeframe can
be expanding a system every few weeks and completing routes every few months, which would
15
be more computationally costly and reliable than a system that considers expansions every few
months and completes routes every few years. Detailed numbers will vary regarding parameters.
4. Numerical Experiment
The purpose of the numerical experiment is to validate whether the proposed algorithm can
establish a reasonable route system and evaluate its performance under three different learning
policies. This is accomplished using two numerical experiments. First, a small grid network and
travel demand within the region are generated. Second, we created an experimental network based
on New York City Public Use Microdata Areas (NYC PUMAs). The result from RHTS is assumed
as the source of an initial prior.
4.1. Simple Grid Networks
4.1.1. Problem and Network Illustrations
For the demonstration of the segment-level system expansion, randomly generated trips travel
on a 5-by-5 grid network with bidirectional links. In this region, an operator seeks a route set that
serves the demand the best, but the knowledge about trips is limited to the result of pilot services.
Figures 4(a) and 4(b) represent true flows and observations from pilot services, respectively, while
the thickness of a line is proportional to the amount of flow. This shows an example of a knowledge
gap between the truth and observed information.
Several conditions are given in this experiment: 1) the final system consists of three routes
with four nodes, 2) five pilot services are operated for 10 time periods, 3) 198 node pair flows are
assumed to be correlated among 300 available flows, 4) transfer between routes are allowed for
once, and 5) standard deviation is assumed to be 10-30% of .
(a) True flow pattern
(b) Flows observed by pilot service
Figure 4. Example of knowledge gap between truth and observation.
4.1.2. Experiment Inputs
Two kinds of demand information are required in the experiment: truth and prior. All truths of
mean, standard deviation, and covariance are assumed, and priors are derived from pilot services
conducted in the experiment.
16
First, flows between node and are assumed to be proportional to the multiplication of trip
production constant of and trip attraction constant of and the inverse of square of distance.
When flows calculated, true standard deviations are generated to lie between 10% and 30% of the
true mean. The true covariance matrix is derived from the multiplication of matrices as shown
in Eq. (26).
(26)
where is a diagonal matrix with on its diagonal, and is the correlation matrix generated by
the constant correlation model (Hardin et al., 2013), satisfying the positive semidefinite (PSD)
condition.
Second, priors are derived from observations during pilot services. Due to the incompleteness
of pilots, partially covering the region, observations are made with subsets of flows. If a flow is
observed, its prior mean and standard deviation can be calculated. Otherwise, both set to zero,
indicating the absence of observations. In the same manner, zeros are assigned to covariance matrix
elements of omitted observations. Nonetheless, diagonal elements should not be zeros to maintain
the PSD condition of the covariance matrix. Alternatively, we replace zeros with 1% of the mean.
The truth and prior data for this test instance can both be found at https://zenodo.org/badge/
latestdoi/636483323.
4.1.3. Results
The purpose of this experiment is to illustrate the algorithm operated with different learning
policies. Figure 5 explains the generated route sets that vary as different policies are applied. A
link is overlapped in the result of MAB and KG because the algorithm chooses the 1st node of the
1st route or the 1st link of other routes based on the remaining unserved flows of corresponding
nodes calculated from priors regardless of the current service availability. While outputs of MAB
and KG remain similar except for the last segment extension of the 3rd route, 3-3, the 1st route of
KGCB initially takes another direction, resulting in the avoidance of overlapping links.
Nevertheless, its system shares many common points with two previous results, and it implies the
similarity of evaluation processes of policies, depending on given priors, especially for the mean.
The demand coverage rate, a percentage of covered demand compared with total demand, is
the highest when using KGCB, showing about a 3pp difference. According to the result, KGCB is
the right policy to implement if this is the situation given to be analyzed. However, a sequential
route expansion result can be influenced by the initial network condition and pilot service plan. To
improve the reliability of the comparison, a more realistic case is investigated in the next
experiment.
17
(a) MAB (20.6%)
(b) KG (20.8%)
(c) KGCB (23.9%)
Figure 5. Route set outcome from policies with demand coverage rate.
4.2. NYC PUMA-based Network
4.2.1. Problem and Network Illustrations
The operator plans for sequential implementation of 5 routes with 8 nodes each. During
segment-level extensions, they can intersect with others to allow one-time transfer for riders to
enhance the coverage of the system. 30 pilot routes are planned, and each can observe flows 10
times. The minimum length of pilot routes is 6 nodes. The objective value is demand coverage of
which size is very large from several thousand to a million. This amount of demand may be
unrealistic with 5 routes operated by vehicles like buses, but the numbers are kept under the
assumption that actual ridership may be proportional to total travel demand. To make the problem
simple, travel time considerations by riders, congestion effects on links, vehicle capacity
limitations are all assumed to be negligible.
The generated network consists of 55 nodes and 123 bidirectional links as shown in Figure 6.
For simplicity, it is assumed that link costs for system implementation are equal to make all options
require the same cost during the segment extension. Since the number of possible node pairs is
1,485 (=55×54/2), the prior mean vector should have 1,485 elements, and the size of the covariance
matrix becomes 1,485-by-1,485 if correlations between all flows exist. Among 1,485 flows, 300
are assumed as correlated ones. Therefore, priors for both correlated and uncorrelated flows should
be updated separately.
There are five boroughs in NYC; Manhattan, Brooklyn, Staten Island, Queens, and Bronx.
First, we assumed flows may be correlated with each other if they connect the same pair of
boroughs. In addition, Manhattan and Bronx are integrated into one “megaborough” due to their
high connectivity. With four zones, there are 10 possible types of flows including inter- and
intraborough connections. However, 7 remain after flows connected to Staten Island are
aggregated into one group to reflect being an isolated island. Figure 7 is the OD flow matrix of
RHTS and overlapped groups. In this experiment, they are named “clusters.”
18
Figure 6. Experimental network with NYC PUMA.
Figure 7. OD flow matrix of RHTS and clusters of correlated flows.
Any node can be a route terminal when ignoring geography. Nevertheless, since the
methodology focuses on demand coverage, nodes located on the periphery are less likely to be
chosen due to its disadvantage of limited direction of extension. Instead, nodes in the central area
have higher chances to become an initial terminal.
4.2.2. Simulation inputs
As shown in the previous experiment, the truth and prior need to be defined. We use the RHTS
data as the initial prior. After assuming priors, synthetic truths are reverse-engineered from them.
As a result, truths in this experiment are artificially constructed under these assumptions.
19
Figure 8. Aggregation of 1622 Census tracts (left) to 55 PUMAs (right) by overlapping (middle).
First, the flow information from the RHTS becomes a prior mean. As the survey was conducted
in Census tract level, the result is aggregated to PUMA level as shown in Figure 8. Meanwhile,
the RHTS is a single observation, and other priors such as variances and covariances are impossible
to derive. Thus, prior standard deviations of flows are initially simply assumed, for example, 5%
of means while covariances are derived from the result of pilot services as the previous experiment
does.
Second, true standard deviations are selected from assumed ranges. For instance, a flow of
10,000 units may vary with a standard deviation between 5% and 19% of 10,000. The range
changes as the amount of flow varies. In addition, scenarios are prepared for three different levels
of flow variation: Low, Middle, and High. While the lower limit is equivalent, different upper
limits are set for each level. With the generated true standard deviations and prior means, a true
mean vector can be chosen from random numbers from the inverse normal distribution while true
covariance matrix is calculated from the true standard deviations. The truth and prior data for this
case study are provided in https://zenodo.org/badge/latestdoi/636483323.
4.2.3. Results
Figure 9 shows cumulative curves of covered demand in 15 scenarios with 3 flow variation
levels. Every scenario is simulated 100 times. The learning policies are compared with other
benchmarks for validation.
First, yellow transparent curves result from a greedy policy that chooses the option with the
highest expected value without considering exploration.
Second, blue transparent curves represent the CR reference policy (Chow and Sayarshad, 2016)
that estimates the distribution of maximum value policy by assuming that randomly selected
policies are samples of a Weibull-distributed policy. The advantage of using this reference policy
over a purely oracle or competitive ratio (Karp, 1992) is that the policy is consistently defined to
represent the best value that can be obtained with a committed design without any online
reoptimization, one that is sensitive to both the underlying network characteristics and the
stochasticity of the uncertain environment. For example, for the same instance over a time horizon,
changing the underlying variance of a stochastic variable may yield the same average competitive
ratio when solved over multiple simulated runs since the average of the best performance would
not change. However, the CR policy would result in a different cumulative distribution for the
maximum value obtainable with the best committed design. This means that while changing
scenarios may make it hard to compare outputs with each other directly, they can be compared
relative to the consistently defined CR policy.
20
Plotted curves illustrate differences among policies in terms of their performance.
Figure 9. Cumulative curves of covered demand from different learning policies.
In most diagrams, the dark gray curve representing cumulative covered demand of KGCB
tends to be located at the right of the curves from the other two policies. Detailed numerical results
are reported in TABLE 2. -tests are used to verify if differences in distributions of covered
demand from learning policies are statistically significant. From observation, there are instances
where MAB and KG underperform even the greedy policy, and none outperform any portion of
the CR policy. Meanwhile, KGCB does outperform some percentiles of the CR policy in some
scenarios. While comparing MAB with KG is not significant in 5 scenarios, the number of
statistically significant differences is 14 for MAB-KGCB and 13 for KG-KGCB, out of the 15
scenarios. This means that better performance of KGCB is not only graphically shown but also
statistically proven.
21
It is expected that gaps between KGCB and other policies may become larger when working
with a network with highly variable flows. However, a statistically significant trend between
objective values and flow variations is not found. As such, we can conclude that performances of
learning policies are mainly affected by network attributes other than level of flow variation.
TABLE 2 Descriptive Statistics of Simulated Covered Demand of Policies and -test Results
Scenario
Descriptive statistics (mean & std.dev in million)
Two-tail -test result (-value)
MAB
KG
KGCB
MAB vs
KG
MAB vs
KGCB
KG vs
KGCB
#1L
10.78 (0.49)
10.80 (0.64)
11.29 (0.26)
0.854
***
***
#1M
11.27 (0.45)
11.48 (0.43)
11.76 (0.22)
0.001
***
***
#1H
11.48 (0.55)
11.59 (0.45)
11.90 (0.14)
0.133
***
***
#2L
10.78 (0.52)
10.95 (0.42)
11.37 (0.21)
0.001
***
***
#2M
10.05 (0.58)
10.24 (0.46)
10.18 (0.41)
0.011
0.057
0.376
#2H
11.68 (0.48)
11.75 (0.52)
12.08 (0.31)
0.314
***
***
#3L
11.46 (0.46)
11.61 (0.52)
11.83 (0.27)
0.024
***
0.000
#3M
10.94 (0.46)
11.04 (0.44)
11.10 (0.45)
0.151
0.016
0.318
#3H
10.85 (0.46)
10.93 (0.44)
11.12 (0.38)
0.209
***
0.001
#4L
10.41 (0.54)
10.64 (0.35)
10.95 (0.12)
0.000
***
***
#4M
10.10 (0.58)
10.25 (0.41)
10.59 (0.29)
0.028
***
***
#4H
10.35 (0.52)
10.57 (0.34)
10.70 (0.37)
0.001
***
0.007
#5L
9.73 (0.40)
10.03 (0.45)
10.40 (0.35)
***
***
***
#5M
9.85 (0.46)
10.20 (0.37)
10.30 (0.20)
***
***
0.022
#5H
10.87 (0.55)
11.12 (0.50)
11.46 (0.33)
0.001
***
***
Note: Underlined and italic numbers mean statistically insignificant -values with . *** are ones with very
strong significance.
Figure 10. Gap between learning and reference policy at 50-percentile.
Figure 10 is the plot indicating gap between performance of 50-percentiles of the learning
policies and the CR reference policy curves in Figure 9. There is no firm consistency observed in
the order of performance, and incorporating learning in system design even deteriorates the
performance in some scenarios. In some cases, KGCB even outperforms the 50-percentile CR
policy (though they never outperform the 100-percentile). In general, KGCB yields the best output
22
relative to the reference policy, and its ability to consider correlations among demand may support
this advance.
Figure 11. Link choice frequencies of learning policies (based on Scenario #5H) averaged over 100
simulations each.
Figure 11 shows networks with links weighted proportionately to the number of choices made
during the 100 simulations under the three different learning policies for the proposed network
design algorithm. Most parts of the networks are similar, but KGCB tends to focus more on the
Queens neighborhoods and less on the Bronx areas. The result of chi-squared tests between
policies suggests the differences in their link choice behaviors are statistically significant
(
,
, and
). The main reason can be the different way that they evaluate links and the
extent of the exploration. Since KGCB uses more information than MAB and KG, it decreases the
chance of exploring unknowns. Namely, it may choose links with more confidence.
5. Conclusion
Considering the limitation of information about prevailing demand and its variability,
designing a mobility route system with one-time planning by assuming static demand pattern may
yield lower system performance. This is especially the case for transit networks built off emerging
technology with limited user experience: automated vehicle fleets, electric buses, microtransit,
semi-flexible routes, first-last mile hubs, etc. As an alternative, this study presents a new type of
transit network design problem based on using information in a sequential manner, the SSTNDP.
We further propose a new algorithm for extending existing routes and expanding new routes,
making use of optimal learning techniques to consider the difference between the existing
knowledge and actual data observed. The proposed approach sequentially extends the system by
choosing an option within a given set based on the most recent knowledge updated after observing
the consequence of choices made, as an optimal learning system.
Three learning policies, KGCB, KG, and MAB are integrated as samples into the proposed
framework and compared in the numerical experiment with a grid network and artificial NYC
PUMA network. Among two case studies with different truths and flow variation levels, the
proposed algorithm with integrated KGCB tends to show the best performance represented by
mean covering demand. Since it is the only policy that considers correlations between flows, it
may have a stronger ability to explore other options.
One of the biggest advantages of this methodology is the applicability to the region without
good quality demand pattern data. It reduces the dependency on existing data sources and promotes
23
direct information accumulation which makes the system more responsive to gaps that could not
be captured by previous data collection. Furthermore, due to its gradual expansion of the system,
it can be more attractive when the project budget is divided over time and sequentially executed.
For simplicity, this approach excludes some details that exist in the real world, such as no
limitations to the number of transfers between routes and actual segment extension costs. Without
transfer, a designed route system is equivalent to a set of individual routes that passengers cannot
have benefits from the route expansion. The main difference should be the reward of choices.
While it only includes demand within the newly extending route in this study, it is necessary to
consider additional OD flows connected to other routes accessible from the current route.
Moreover, since the approach refers to optimal learning policies that assume that costs of choosing
different options are identical, extending routes to various links in this methodology also requires
the same cost. Such additional functions can be considered in future versions of this methodology.
ACKNOWLEDGEMENTS
This study is supported by C2SMART, a USDOT Tier 1 University Transportation Center, and
NSF CMMI-1652735. The work forms one chapter of the dissertation submitted to the Faculty of
the New York University Tandon School of Engineering in partial fulfillment of the requirements
for the degree Doctor of Philosophy (Transportation Planning & Engineering), January 2023. We
thank the dissertation committee members for their insightful comments: Kaan Ozbay, Zhibin
Chen, and external member Ilya Ryzhov.
AUTHOR CONTRIBUTIONS
The authors confirm contribution to the paper as follows: study conception and design: GY,
JC; data collection: GY; analysis and interpretation of results: GY, JC; draft manuscript
preparation: GY, JC. All authors reviewed the results and approved the final version of the
manuscript.
REFERENCES
1. Abdulhai, B., Pringle, R., & Karakoulas, G.J. (2003). Reinforcement learning for true adaptive
traffic signal control. Journal of Transportation Engineering, 129(3), pp.278-285.
2. Allahviranloo, M. and Chow, J.Y., 2019. A fractionally owned autonomous vehicle fleet sizing
problem with time slot demand substitution effects. Transportation Research Part C:
Emerging Technologies, 98, pp.37-53.
3. An, K., & Lo, H.K. (2015). Robust transit network design with stochastic demand considering
development density. Transportation Research Part B: Methodological, 81, 737-754.
4. An, K., & Lo, H.K. (2016). Two-phase stochastic program for transit network design under
demand uncertainty. Transportation Research Part B: Methodological, 84, 157-181.
5. Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit
problem. Machine learning, 47(2), 235-256.
6. Baaj, M.H. and Mahmassani, H.S., 1995. Hybrid route generation heuristic algorithm for the
design of transit networks. Transportation Research Part C: Emerging Technologies, 3(1),
pp.31-50.
7. Borndörfer, R., Grötschel, M., Pfetsch, M.E. (2007). A column-generation approach to line
planning in public transport. Transportation Science, 41(1), 123-132.
24
8. Bubeck, S., & Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-
armed bandit problems. Foundations and Trends® in Machine Learning, 5(1), 1-122.
9. Bunch, D.S., Bradley, M., Golob, T.F., Kitamura, R. and Occhiuzzo, G.P., 1993. Demand for
clean-fuel vehicles in California: a discrete-choice stated preference pilot project.
Transportation Research Part A: Policy and Practice, 27(3), pp.237-253.
10. Canca, D., De-Los-Santos, A., Laporte, G., & Mesa, J.A. (2017). An adaptive neighborhood
search metaheuristic for the integrated railway rapid transit network design and line planning
problem. Computers & Operations Research, 78, 1-14.
11. Caros, N. S., & Chow, J. Y. (2021). Day-to-day market evaluation of modular autonomous
vehicle fleet operations with en-route transfers. Transportmetrica B: Transport Dynamics, 9(1),
109-133.
12. Castle Labs, 2022a. The Optimal Learning Calculator, URL: http://optimallearning.
princeton.edu/software/KnowledgeGradient_IndependentNormal.xlsx. Accessed on July 30,
2022.
13. Castle Labs, 2022b. Matlab implementation of the knowledge gradient for correlated beliefs
using a lookup table belief model, URL: http://optimallearning.princeton.edu/
software/KGCB.zip. Accessed on July 30, 2022.
14. Cats, O., & West, J. (2020). Learning and adaptation in dynamic transit assignment models for
congested networks. Transportation Research Record, 2674(1), pp.113-124.
15. Ceder, A. and Israeli, Y., 1998. User and operator perspectives in transit network design.
Transportation Research Record, 1623(1), pp.3-7.
16. Ceder, A., & Wilson, N. H. (1986). Bus network design. Transportation Research Part B:
Methodological, 20(4), 331-344.
17. Chakroborty, P. and Wivedi, T., 2002. Optimal route network design for transit systems using
genetic algorithms. Engineering Optimization, 34(1), pp.83-100.
18. Chien, S., Yang, Z., & Hou, E. (2001). Genetic algorithm approach for transit route planning
and design. Journal of transportation engineering, 127(3), 200-207.
19. Chow, J.Y. and Regan, A.C., 2011. Network-based real option models. Transportation
Research Part B: Methodological, 45(4), pp.682-695.
20. Chow, J. Y., & Sayarshad, H. R. (2016). Reference policies for non-myopic sequential network
design and timing problems. Networks and Spatial Economics, 16(4), 1183-1209.
21. Cipriani, E., Gori, S. and Petrelli, M., 2012a. Transit network design: A procedure and an
application to a large urban area. Transportation Research Part C: Emerging Technologies,
20(1), pp.3-14.
22. Cipriani, E., Gori, S. and Petrelli, M., 2012b. A bus network design procedure with elastic
demand for large urban areas. Public Transport, 4(1), pp.57-76.
23. Duff, M. and Barto, A., 1996. Local bandit approximation for optimal learning problems.
Advances in Neural Information Processing Systems, 9.
24. Fan, W. and Machemehl, R.B., 2006. Optimal transit route network design problem with
variable transit demand: genetic algorithm approach. Journal of Transportation Engineering,
132(1), pp.40-51.
25. Fan, W. and Machemehl, R.B., 2008. A tabu search based heuristic method for the transit route
network design problem. In Computer-aided Systems in Public Transport (pp. 387-408).
Springer, Berlin, Heidelberg.
26. Frazier, P. I., Powell, W. B., & Dayanik, S. (2008). A knowledge-gradient policy for sequential
information collection. SIAM Journal on Control and Optimization, 47(5), 2410-2439.
25
27. Gallo, M., Montella, B. and D’Acierno, L., 2011. The transit network design problem with
elastic demand and internalisation of external costs: An application to rail frequency
optimisation. Transportation Research Part C: Emerging Technologies, 19(6), pp.1276-1305.
28. Guo, Q. W., Chow, J. Y., & Schonfeld, P. (2017). Stochastic dynamic switching in fixed and
flexible transit services as market entry-exit real options. Transportation research procedia,
23, 380-399.
29. Hardin, J., Garcia, S. R., & Golan, D. (2013). A method for generating realistic correlation
matrices. The Annals of Applied Statistics, 1733-1762.
30. Huang, Y., Zhao, L., Powell, W.B., Tong, Y. and Ryzhov, I.O., 2019. Optimal learning for
urban delivery fleet allocation. Transportation Science, 53(3), pp.623-641.
31. Iliopoulou, C. and Kepaptsoglou, K., 2019. Integrated transit route network design and
infrastructure planning for on-line electric vehicles. Transportation Research Part D:
Transport and Environment, 77, pp.178-197.
32. Israeli, Y. and Ceder, A., 1995. Transit route design using scheduling and multiobjective
programming techniques. In Computer-aided Transit Scheduling (pp. 56-75). Springer, Berlin,
Heidelberg.
33. Karp, R. M. (1992, July). On-line algorithms versus off-line algorithms: How much.
In Algorithms, Software, Architecture: Information Processing 92: Proceedings of the IFIP
12th World Computer Congress (Vol. 1, p. 416).
34. Lattimore, T. (2016, June). Regret analysis of the finite-horizon Gittins index strategy for
multi-armed bandits. In Conference on Learning Theory (pp. 1214-1245). PMLR.
35. Lee, Y.J. and Vuchic, V.R., 2005. Transit network design with variable demand. Journal of
Transportation Engineering, 131(1), pp.1-10.
36. Li, L., Chu, W., Langford, J., & Schapire, R.E. (2010, April). A contextual-bandit approach to
personalized news article recommendation. In Proceedings of the 19th international
conference on World wide web (pp. 661-670).
37. Li, Z.C., Guo, Q.W., Lam, W.H. and Wong, S.C., 2015. Transit technology investment and
selection under urban population volatility: A real option perspective. Transportation
Research Part B: Methodological, 78, pp.318-340.
38. Liang, J., Wu, J., Gao, Z., Sun, H., Yang, X. and Lo, H.K., 2019. Bus transit network design
with uncertainties on the basis of a metro network: A two-step model framework.
Transportation Research Part B: Methodological, 126, pp.115-138.
39. Liu, K., & Zhao, Q., 2010. Distributed learning in cognitive radio networks: Multi-armed
bandit with distributed multiple players. In, 2010 IEEE International Conference on Acoustics
Speech and Signal Processing (ICASSP) (pp. 3010-3013). IEEE.
40. Ma, T. Y., Rasulkhani, S., Chow, J. Y., & Klein, S. (2019). A dynamic ridesharing dispatch
and idle vehicle repositioning strategy with integrated transit transfers. Transportation
Research Part E: Logistics and Transportation Review, 128, 417-442.
41. Mohammed, A., Shalaby, A. and Miller, E.J., 2006. Empirical analysis of transit network
evolution: case study of Mississauga, Ontario, Canada, bus network. Transportation Research
Record, 1971(1), pp.51-58.
42. Ngamchai, S. and Lovell, D.J., 2003. Optimal time transfer in bus transit route network
design using a genetic algorithm. Journal of Transportation Engineering, 129(5), pp.510-
521.
26
43. NYMTC. (2010). Household travel survey: Executive summary. URL: https://www.
nymtc.org/portals/0/pdf/RHTS/RHTS_FinalExecSummary10.6.2014.pdf. Accessed on Feb 26,
2023.
44. Owais, M., Osman, M.K. and Moussa, G., 2015. Multi-objective transit route network design
as set covering problem. IEEE Transactions on Intelligent Transportation Systems, 17(3),
pp.670-679.
45. Pattnaik, S.B., Mohan, S. and Tom, V.M., 1998. Urban bus transit route network design using
genetic algorithm. Journal of Transportation Engineering, 124(4), pp.368-375.
46. Perkerson, E. (https://stats.stackexchange.com/users/256670/eric-perkerson), Bayesian
updating with normal but incomplete signals, URL (version: 2020-04-02): https://stats.
stackexchange.com/q/456041
47. Powell, W. B. (2021). From reinforcement learning to optimal control: A unified framework
for sequential decisions. In Handbook of Reinforcement Learning and Control (pp. 29-74).
Springer, Cham.
48. Powell, W.B., 2007. Approximate Dynamic Programming: Solving the Curses of
Dimensionality (Vol. 703). John Wiley & Sons.
49. Powell, W. B., & Ryzhov, I. O. (2012a). Optimal learning and approximate dynamic
programming. Reinforcement Learning and Approximate Dynamic Programming for
Feedback Control, 410-431.
50. Powell, W.B., & Ryzhov, I.O. (2012b). Optimal learning (Vol. 841). John Wiley & Sons.
51. Römer, C., Hiry, J., Kittl, C., Liebig, T. and Rehtanz, C., 2019. Charging control of electric
vehicles using contextual bandits considering the electrical distribution grid. arXiv preprint
arXiv:1905.01163.
52. Ryzhov, I. O., Powell, W. B., & Frazier, P. I. (2012). The knowledge gradient algorithm for a
general class of online learning problems. Operations Research, 60(1), 180-195.
53. Ryzhov, I. O., Mes, M. R., Powell, W. B., & van den Berg, G. (2019). Bayesian exploration
for approximate dynamic programming. Operations research, 67(1), 198-214.
54. Schmid, V., 2014. Hybrid large neighborhood search for the bus rapid transit route design
problem. European Journal of Operational Research, 238(2), pp.427-437.
55. Sun, Y., Guo, Q., Schonfeld, P. and Li, Z., 2017. Evolution of public transit modes in a
commuter corridor. Transportation Research Part C: Emerging Technologies, 75, pp.84-102.
56. Tom, V.M. and Mohan, S., 2003. Transit route network design using frequency coded genetic
algorithm. Journal of Transportation Engineering, 129(2), pp.186-195.
57. Walteros, J.L., Medaglia, A.L. and Riaño, G., 2015. Hybrid algorithm for route design on bus
rapid transit systems. Transportation Science, 49(1), pp.66-84.
58. Walraven, E., Spaan, M.T., & Bakker, B. (2016). Traffic flow optimization: A reinforcement
learning approach. Engineering Applications of Artificial Intelligence, 52, pp.203-212.
59. Watson, V., Luchini, S., Regier, D. and Schulz, R., 2020. Monetary analysis of health
outcomes. In Cost-Benefit Analysis of Environmental Health Interventions (pp. 73-93).
Academic Press.
60. Wei, Y., Mao, M., Zhao, X., Zou, J., & An, P. (2020, August). City metro network expansion
with reinforcement learning. In Proceedings of the 26th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining (pp. 2646-2656).
61. Yoo, G.S., Kim, D.K. and Chon, K.S., 2010. Frequency design in urban transit networks with
variable demand: model and algorithm. KSCE Journal of Civil Engineering, 14(3), pp.403-
411.
27
62. Yoon, G., & Chow, J.Y. (2020). Contextual bandit-based sequential transit route design under
demand uncertainty. Transportation Research Record, 2674(5), pp.613-625.
63. Yoon, G. (2023). BUILTNYU/SSTNDP: SSTNDP v1.0.0. Dataset on Zenodo (2018).
doi:10.5281/zenodo.7939632.
64. Yu, W., Chen, J. and Yan, X., 2019. Space‒time evolution analysis of the Nanjing metro
network based on a complex network. Sustainability, 11(2), 523.
65. Zarrinmehr, A., Saffarzadeh, M., Seyedabrishami, S. and Nie, Y.M., 2016. A path-based
greedy algorithm for multi-objective transit routes design with elastic demand. Public
Transport, 8(2), pp.261-293.
66. Zhao, F. and Zeng, X., 2008. Optimization of transit route network, vehicle headways and
timetables for large-scale transit networks. European Journal of Operational Research, 186(2),
pp.841-855.
67. Zhou, J., Lai, X., & Chow, J.Y. (2019). Multi-armed bandit on-time arrival algorithms for
sequential reliable route selection under uncertainty. Transportation Research Record,
2673(10), 673-682.
68. Zhu, R. and Modiano, E., 2018. Learning to route efficiently with end-to-end feedback: The
value of networked structure. arXiv preprint arXiv:1810.10637.
69. Zolfpour-Arokhlo, M., Selamat, A., Hashim, S.Z.M. and Afkhami, H., 2014. Modeling of route
planning system based on Q value-based dynamic programming with multi-agent
reinforcement learning algorithms. Engineering Applications of Artificial Intelligence, 29,
pp.163-177.