Content uploaded by Guoyang Qin
Author content
All content in this area was uploaded by Guoyang Qin on Jun 19, 2021
Content may be subject to copyright.
Optimizing matching time intervals for ridehailing services using
reinforcement learning
Guoyang Qina, Qi Luob, Yafeng Yinc,∗, Jian Suna, Jieping Yed
aKey Laboratory of Road and Traﬃc Engineering of the State Ministry of Education, Tongji University,
Shanghai 201804, China
bDepartment of Industrial Engineering, Clemson University, Clemson, SC 29631, United States
cDepartment of Civil and Environmental Engineering, University of Michigan, Ann Arbor, MI 48109,
United States
dDiDi AI Labs, Didi Chuxing, Beijing, 100085, China
Abstract
Matching trip requests and available drivers eﬃciently is considered a central operational
problem for ridehailing platforms. A widely adopted matching strategy is to accumulate a
batch of potential passengerdriver matches and solve bipartite matching problems repeat
edly. The eﬃciency of matching can be improved substantially if the matching is delayed by
adaptively adjusting the matching time interval. The optimal delayed matching is subject
to the tradeoﬀ between the delay penalty and the reduced wait cost and is dependent on
the system’s supply and demand states. Searching for the optimal delayed matching policy
is challenging, as the current policy is compounded with past actions. To this end, we tailor
a family of reinforcement learningbased methods to overcome the curse of dimensionality
and sparse reward issues. In addition, this work provides a solution to spatial partition
ing balance between the state representation error and the optimality gap of asynchronous
matching. Lastly, we examine the proposed methods with realworld taxi trajectory data
and garner managerial insights into the general delayed matching policies. The focus of
this work is singleride service due to limited access to shared ride data, while the general
framework can be extended to the setting with a ridepooling component.
Keywords: Ridehailing service, online matching, reinforcement learning, policy gradient
method
1. Introduction
Assigning trip requests to available vehicles is one of the most fundamental operational
problems for ridehailing platforms, who act as a mediator that sequentially matches supply
(available drivers) and demand (pending trip requests) (AlonsoMora et al.,2017;Wang
and Yang,2019). The common objective is to maximize the system revenue or minimize
passengers’ average wait time, both of which necessitate devising eﬃcient matching strategies
(Zha et al.,2016).
∗Corresponding author, email: yafeng@umich.edu
Preprint submitted to Transp. Res. Part C Emerg. Technol. (accepted version) June 19, 2021
Qin et al. You can read the published version here Transp. Res. Part C
A matching strategy widely adopted by ridehailing platforms is to formulate and solve a
bipartite matching problem between available vehicles and trip requests (Xu et al.,2018). As
illustrated in Fig. 1, intentionally controlling the matching time interval to delay the match
ing of passengers and drivers can adjust the buﬀer size of potential matches and improve
the operational eﬃciency, because drivers may ﬁnd a better (i.e., closer, more proﬁtable, or
more compatible) match when demand accumulates and vice versa. As a result, an obvious
tradeoﬀ arises when extending this matching time interval. On the one hand, the outcome
of matching is improved with the increasing length of passengers’ and drivers’ queues; on
the other hand, waiting induces penalty costs for each driver and passenger in queues and, in
worstcase scenarios, leads to abandonment. This type of rewarding delay eﬀect is prevalent
in online matching markets. Various online platforms, such as dating applications or organ
donation systems, have adopted this kind of delayed matching strategy (Azar et al.,2017).
(a)
DELAYING
MATCHING
(b)
Figure 1: An appropriate matching delay can increase the passengerdriver buﬀer size and improve the opera
tional eﬃciency. (a) Passengers and drivers may suﬀer rather long pickup wait times under an instantaneous
matching scheme, (b) moderately delaying the matching can signiﬁcantly reduce the pickup wait times.
Upon closer inspection into the dynamic matching process, one may see that the appro
priate matching time interval depends on the ridehailing system’s state, i.e., spatiotemporal
distributions of supply and demand. One of this work’s main ﬁndings is that, when ei
ther the supply or demand is scarce, the platform should use instantaneous matching to
maximize the system throughput. However, when the supply and demand are relatively
balanced, strategically controlling the matching time interval is superior. There has been
a signiﬁcant research interest in the optimal control policy for the latter case (Azar et al.,
2017;Ashlagi et al.,2018;Yang et al.,2020). Limitations in prior studies include requiring
parametric models to obtain analytical solutions or mandating unrealistic assumptions on
the system’s stationarity. Nevertheless, these studies’ numerical experiments show that the
optimal matching time interval is distinctly sensitive to these assumptions. Hence, these
modelbased approaches are not implementable in complex ridehailing applications.
Due to the state space’s large dimensionality, including vehicle locations, demand fore
cast, and passenger characteristics, obtaining a pragmatic and eﬀective nonparametric con
trol policy for the matching time interval remains an open question. Moreover, when the
Qin et al. You can read the published version here Transp. Res. Part C
system uses a ﬁxed matching time interval, there is no prior information on the ridehailing
system dynamics under alternative time intervals. Reinforcement learning (RL) is therefore
an obvious choice to handle this task. This work aims to develop an eﬃcient, datadriven
approach to ﬁnd the optimal matching time intervals. The focus of this work is single
ride service due to limited access to shared ride data, while the general framework can be
extended to the setting with a ridepooling component.
The main contributions of this work include:
a. Proposing a dedicated reward signal for passengerdriver matching with delay, which
addresses the sparse reward issue related to a long matching time interval.
b. Providing a remedy to the spatial partitioning tradeoﬀ between the state represen
tation error and the optimality gap of local matching. This tradeoﬀ arises when we
partition the area and apply asynchronous policies over those grids.
c. Presenting managerial insights for an optimized matching delay based on experiments
with realworld data. The RL algorithm identiﬁes the optimized delayed matching
policy delay under medium driver occupancy.
The remainder of this paper is organized as follows. We ﬁrst review the related literature
in Section 2, then provide a concrete formulation of the sequential decision problem in Section
3. We investigate three RL algorithms in Section 4as each one is a building block for the
more advanced one. Finally, we show the results in a realworld case study in Section 5and
draw the ﬁnal conclusion in Section 6.
2. Literature review
Online matching strategies are characterized by two spatiotemporal variables in general:
the matching time interval and the matching radius (Yang et al.,2020). The platforms
dynamically adjust these variables to improve the system performance measures such as the
gross revenue or cumulative system throughput. The eﬀectiveness of matching strategies
is critical for the platform’s proﬁtability and the quality of service perception of customers
(Wang and Yang,2019). Moreover, due to the negative externalities induced by emptycar
cruising (Luo et al.,2019), developing advanced matching strategies ensures the sustainability
of ridehailing services.
A number of studies in the literature obtained analytical results for optimal matching
strategies assuming that the system dynamics is known. Most matching strategies devel
oped in the past consider instantaneous matching, i.e., the platform assigns trip requests to
drivers upon arrival. For example, considering instantaneous matching, Xu et al. (2020b)
investigated the impact of matching radius on the eﬃciency of a ridehailing system and the
oretically proved that adaptively adjusting matching radius can avoid throughput loss due
to ”wild goose chases.” Aouad and Sarita¸c (2020) modeled the dynamic matching problem
as a Markov decision process with Poisson arrivals and developed a 3approximation algo
rithm for the costminimization with uniform abandonment rates. They also showed that the
batched dispatching policy’s performance can be arbitrarily bad in the costminimization set
ting, even with the optimally tuned batching intervals. ¨
Ozkan and Ward (2020) studied the
optimal dynamic matching policy in a ridesharing market modeled as a queueing network.
Since solving the exact dynamic program for the joint pricing and matching problem with
Qin et al. You can read the published version here Transp. Res. Part C
endogenous supply and demand is intractable, they proposed a continuous linearprogram
based forwardlooking policies to obtain the upper bounds of the original optimization. The
value of delayed matching is obvious for twosided online service systems (Azar et al.,2017).
As the thickness of the market increases over time, passengers and drivers are provided with
superior candidates for potential matching with a cost of delay. Ashlagi et al. (2018) studied
the nearoptimal matching strategies with either random or adversarial arrivals of passen
gers and drivers. The approximation ratio for the random case is constant, which means
the algorithm’s eﬃciency is unchanged for large networks. However, challenges remain as
the deadlines for matching were given in their work. Yang et al. (2020) separated the opti
mization of the matching radius and matching time interval intending to maximize the net
beneﬁt per unit time. They obtained the optimal matching time interval only under the
excess supply scenario and ignoring passengers’ abandonment behaviors. Yan et al. (2020)
studied the jointed matching and pricing problem in a batched matching setting. The wait
time and matching time tradeoﬀ was identiﬁed in the equilibrium analysis. In addition,
high carryover demand or supply impeded their analysis. Most of these analytical results
focused on the nonridepooling setting.
Practitioners are more concerned about developing purely datadriven approaches for
passengerdriver matching in ridehailing platforms. Since online matching is a sequential
decision under uncertainty, reinforcement learning (RL) is a powerful technique to adaptively
improve the matching policy (Li et al.,2019;AlAbbasi et al.,2019;Lin et al.,2018;Ke
et al.,2020;Mao et al.,2020). The main technical challenge is the curse of dimensionality
due to the large state space – the system contains a large number of cruising drivers and
waiting passengers. The beneﬁt of matching a speciﬁc passengerdriver pair, i.e., the reward
function, is determined by the geographical information and characteristics. Li et al. (2019)
approximated the average behavior of drivers by mean ﬁeld theory. AlAbbasi et al. (2019)
used a distributed algorithm to accelerate the search of nearoptimal vehicle dispatching
policies. Mao et al. (2020) studied the vehicle dispatching policy using an ActorCritic
method. These studies require instantaneous matching to avoid many of the diﬃcult reward
design problems. Ke et al. (2020) adopted an RLbased method for the delayed passenger
driver matching problem. They employed a twostage framework that incorporated a multi
agent RL method at the ﬁrst stage and a bipartite matching at the second stage. Their work
addressed several technical issues that were unresolved in previous studies, such as learning
with delayed reward and value error reduction. By combining a queueingtype model with
ActorCritic with experience replay (ACER), we improve the practicability and stability of
RL approaches for passengerdriver matching with delay.
3. Problem description and preliminaries
This section ﬁrst formulates the delayed matching problem as a Markov decision process
(MDP). We then show the existence of optimal matching time interval due to the delayed
matching tradeoﬀ. Lastly, we describe how to create a simulator for agents to learn optimal
policies with streaming information from a ridehailing system, which is of particular interest
to the transportation ﬁeld.
Qin et al. You can read the published version here Transp. Res. Part C
3.1. Delayed matching problem in ridehailing
In the process of online passengerdriver matchmaking, a matchmaker makes recurring
twostage decisions through a sequence of discretetime steps t∈ {0,1,· · · , T }.Tcan be
either ﬁnite or inﬁnite. Given a partitioned area and time step t, the ﬁrststage decision is
about whether to hold or match the batched passengers and drivers in each grid; this decision
can be made asynchronously in diﬀerent grids. The secondstage decision is to optimize the
matching of the batched passengers and drivers in the combined grids where the matchmaker
decides to match. The optimization objective of these sequential decisions is to maximize
the total wait reward 1The notation used throughout this work is summarized in Table A.2
in Appendix A.
This sequential decisionmaking problem can be elaborated as a Markov decision process
(MDP). At each time step t, the matchmaker ﬁrst observes the gridbased system state
matrix S
S
S(t) by gathering information, including currently batched passenger trip details and
idle driver locations, as well as predictions about the future supply and demand in each
grid. The matchmaker then follows a policy function π(A
A
A(t) = aS
S
S(t) = s) to make the ﬁrst
stage decision. All passengers and drivers within the grids with action A
A
A(t) = 0 (“hold”)
will be held in the buﬀer for one more time step. Otherwise, all passengers and drivers
within the grids with action A
A
A(t) = 1 (“match”) will be globally matched in the second
stage decision. Assuming that each driver can take at most one passenger request per match
(i.e., not considering ridepooling), we make the secondstage decision by solving a mincost
bipartite matching problem (Williamson and Shmoys,2011). At the end of these twostage
decisions, unless abandoning the queue, unmatched drivers or passengers will circulate back
to the buﬀer and new arrivals will be admitted into the buﬀer. Further, the system state
will transition to S
S
S(t+ 1) accordingly.
If we deﬁne the wait reward incurred as R
R
R(t) = r(S
S
S(t),A
A
A(t),S
S
S(t+ 1)) and a state tran
sition probability p(S
S
S(t+ 1)S
S
S(t),A
A
A(t)), the expected total wait reward starting from tcan
be expressed as Vπ(S
S
S(t)) = Eπ[P∞
k=t+1 γk−t−1R
R
R(k)], where γis a discount factor. Then the
matchmaker’s objective is to ﬁnd an optimal policy π∗to maximize the expected total reward
Vπ(S
S
S(0)),
maxπVπ(S
S
S(0)) = Eπ[
∞
X
t=1
γt−1R(t)] (1)
=X
a
π(aS
S
S(0)) X
s0
[r(S
S
S(0), a, s0) + Vπ(s0)]p(s0S
S
S(0), a),
where the second summation term can be denoted as an action value function qπ(s, a),
qπ(s, a) := X
s0
[r(s, a, s0) + Vπ(s0)]p(s0s, a).(2)
1To align with the reinforcement learning terminology, we use the term “wait reward”, the negated wait
cost, and its derivatives such as reward signal, reward design, and sparse reward issue when dealing with the
RL modelling, while in other cases, we use the more intuitive term “wait cost”. To this end, maximizing the
wait reward and minimizing the wait cost become the equivalent objectives of the optimization.
Qin et al. You can read the published version here Transp. Res. Part C
If all the aforementioned functions and their parameters are known,π∗can be exactly
solved, regardless of the large state space of S
S
S(t). However, the formulation above deems
intractable in practice. The reasons are twofold:
a. As we have no prior knowledge of either the transition probability p(s0s, a) or the
reward function r(s, a, s0), exact solutions are ruled out for this problem.
b. Aggregation of the original state space matrix S
S
S(t) to some smaller space produces
partially observed state, which may violate the Markov assumption in MDP2.
These two challenges suggest that ﬁnding an optimal on/oﬀ decision policy for an online
passengerdriver matchmaking system is by no means trivial. Next sections will elaborate
on how they are approached.
3.2. Preliminaries on the matching delay tradeoﬀs
This subsection explains how the matching time interval aﬀects the matching wait time
and the pickup wait time in a ridehailing system. We introduce a deque model to approx
imate the matching process (Xu et al.,2020b). Due to deque’s conciseness in theory and
descriptiveness in practice, we can derive its performance metrics accordingly.
Without loss of generality, we assume passengers and idle drivers arrive by two inde
pendent stationary Poisson processes (arrival rate λPand λDresp.). Their locations are
uniformly distributed within a square area (size of d×d). Note that the choice of demand
and supply models does not aﬀect the proposed RL algorithm in the following section. Newly
arrived passengers and drivers will balk or renege the queue if the expected wait time exceeds
a threshold tmax. Upon their arrival, both passengers and drivers will ﬁrst be batched in a
buﬀer pending matching. Pickup wait times are estimated by the Manhattan distance be
tween passengers and drivers divided by the average pickup speed v. Unmatched passengers
or drivers will circulate back to the buﬀer and be carried over to the next matching time win
dow. A ﬁxed matching time interval strategy matches passengers and drivers in a bipartite
graph in the second stage at each of a sequence of discretetime steps t∈ {∆t, 2∆t, 3∆t, · · · }.
This method is suitable if deque arrivals are stationary, in which the delayed matching prob
lem is equivalent to ﬁnding an optimal ﬁxed ∆t. If the arrivals are nonstationary, optimizing
adaptive ∆t’s becomes substantially more challenging.
Fig. 2illustrates the tradeoﬀ curves derived via simulation from the deque model under
scenarios with diﬀerent supply (λD) and demand (λP). As the matching time interval ∆t
extends, the matching wait time increases in a approximately linear fashion, as shown in Fig.
2(a), while the pickup wait time curves in Fig. 2(b) descend more steeply. In particular, the
scenarios with balanced supply and demand have the steepest Ushape tradeoﬀ curves in Fig.
2(c). These tradeoﬀ curves empirically manifest the considerable potential of implementing
delayed matching in a ridehailing platform.
Contour plots in Fig. 3further depict the minimum expected total wait time and its
corresponding optimal ∆t, which can be regarded as a value function and a policy function
under stationary arrivals. Fig. 3(a) shows that the minimum expected total wait time
2The transformation of state space may retain the Markovian properties under conditions, and interested
readers are referred to Proposition 3 in Kim and Smith (1995).
Qin et al. You can read the published version here Transp. Res. Part C
,
300
~
200
:;:::::;
.....
"cti
:S:
100
0
Matching wait time
I J4.0 
4.5 
Ao=5.0
Ap=l
5.0 
5.5 
6.0 
l
Pickup wait time Total wait time
L....+++~
..........
~++..J...+~~~'
0 50 100 150
Fixed matching time interval (s)
(a)
0 50 100 150
Fixed matching time interval (s)
(b)
0 50 100 150
Fixed matching time interval (s)
(c)
Figure 2: Change of passenger wait time as the matching time interval is extended in a deque model.
Minimum expected total wait time (s)
6.0
.,
r
:
r;;1
.............
:
~~

::::;;,,,
Optimal fixed matching time interval (s)
6.0
.
~..;:::~~~~""""7""'7"""'"'.......,..,
rr
,
5.5 5.5
~ 5.0 ~ 5.0
4.5 4.5
4.0
1~
4

..,,c;_
~1
4.0 ~ ~ L_...l...........l..,_
......t:==~
...L..,_1.
~l
4.0 4.5 5.0 5.5 6.0 4.0
Ap
(a)
4.5 5.0
Ap
(b)
5.5 6.0
Figure 3: Minimum expected total wait time and its corresponding optimal ﬁxed matching time interval.
increases as the gap between λPand λDbroadens, while the minimum wait times are similar
if the diﬀerence between λPand λDis held constant. Fig. 3(b) shows that the optimal
matching intervals must be substantially prolonged if λPand λDare balanced, otherwise
they should have a shorter delay. The contour patterns clearly corroborate the necessity of
delayed matching, especially when λP≈λD.
In the next subsection, we will generalize the deque model to a ridehailing simulator as
a dock for accommodating nonstationary and spatiallydependent arrivals. The rest of the
current work is underpinned by this simulator.
3.3. A queueingbased ridehailing simulator
Qin et al. You can read the published version here Transp. Res. Part C
This section adopts a reinforcement learning approach that approximates the transition
dynamics in the MDP to resolve the ﬁrst challenge in the original formulation. A ride
hailing system simulator enabling the RL approach to sample sequences of states, actions,
and matching rewards is needed. To be speciﬁc, a ridehailing simulator provides a control
lable interface that connects RL agents to an approximate realworld ridehailing system.
On the realworld side, the simulator is required to simulate trip request collection, delayed
matching, and delivery of passengers. On the learning agents’ side, the simulator is required
to generate observations O
O
O(t) that characterize the ridehailing system’s state S
S
S(t) and pro
vide informative reward signals. In this regard, we decompose the simulator creation task
into two parts: (a) developing an online ridehailing service simulator and (b) designing
observations and reward signals that assist in agent learning.
3.3.1. Creating an interactive simulator for RL
Our ﬁrst attempt is to reduce the environment complexity by modeling the arrival of the
passenger requests and idle drivers as a multiserver queueing process, as shown in Fig. 4.
buffer
passenger
requests
vehicle servers
bipartite matchingaction taking
retain holding
HIRED
HIRED
IDLE
IDLE
IDLE
IDLE
MATCH?
N
Y
Figure 4: Workﬂows of the online ridehailing service simulator.
Arriving passenger requests are ﬁrst admitted into a buﬀer. Apropos of action matrix
A
A
A(t), actions will be taken separately in two subareas (illustrated in Fig. 5), where the
subarea containing all the grids whose A(t) = 0 is denoted as S0, and the subarea containing
all the grids whose A(t) = 1 as S1. The buﬀer will retain holding requests in the subarea
S0while releasing the batched requests in the subarea S1to the bipartite matching module.
Note that partitioning the area and expanding the onevalue action to a mixedaction matrix
allows the agents to accommodate spatially dependent arrivals. Once the bipartite matching
module receives requests, it will pull all idle drivers and execute a global bipartite matching
between these two sides in S1. For simplicity, we estimate the edge cost by the Manhattan
distance between passengers and drivers divided by the average pickup speed vand solve for
the minimum pickup wait time. Those requests matched are forwarded to the corresponding
vehicle servers module, while those unmatched requests return to the buﬀer.
To be speciﬁc, the second stage bipartite matching is solved as a linear sum assignment
Qin et al. You can read the published version here Transp. Res. Part C
gridbased actions
AREA PARTIONED
INTO GRIDS
MATCH
HOLD
SUBAREA 1
SUBAREA 0
bipartite matching
buﬀer
01001 1
10110 0
10110 0
11110 0
10101 1
Figure 5: Gridbased actions and subareabased matching in the ridehailing simulator.
problem (LSAP). LSAP is modelled as
Minimize:
n
X
i=1
n
X
j=1
Tij xij (3)
s.t.
n
X
j=1
xij = 1 (i= 1,2,· · · , n)
n
X
i=1
xij = 1 (j= 1,2,· · · , n)
xi,j ∈ {0,1}(i, j = 1,2,· · · , n),
where Tij is the pickup time between driver i∈(1,2,· · · , ND) and passenger j∈(1,2,· · · , NP)
estimated by their Manhattan distance divided by the average pickup speed v. In case
ND6=NP,Tij will be ﬁlled into an n×nmatrix with a very large value M, where
n= max(ND, NP). Namely, Tij =Mfor i∈(ND+ 1,· · · , n) or j∈(NP+ 1,· · · , n). In
the ridepooling setting, the second stage problem is to solve a general assignment problem
(AlonsoMora et al.,2017), and the queueing model needs to be modiﬁed to be a match
ing queue (Cao et al.,2020). Note that these extensions do not aﬀect the general learning
framework.
Simulating idle drivers’ repositioning decisions between these grids in a multiserver queue
is critical for the accuracy of the simulator. This work assumes that drivers stay in the same
grid, since each grid’ size is relatively large compared to the idle time. In addition, the
simulator uses a ﬁxed patience time for passengers throughout this study and assumes that
drivers always stay active in the system. These features can be easily extended and embodied
in the simulator with new data access. The key motivation for developing such a ridehailing
Qin et al. You can read the published version here Transp. Res. Part C
simulator is to provide an interface for agents to evaluate and improve their policy π(as),
which can been seen as a special case of the modelbased RL (Sutton and Barto,2018).
4. Methodology
This section ﬁrst describes the design of states and reward signals of the ridehailing
environment, which provide essential information for RL agents to improve their policies. It
then introduces a family of stateoftheart RL algorithms called policy gradient methods.
These methods take advantage of the special structure of the delayed matching problem
and navigate the sampled trajectories to approximate the value and policy functions. To
further improve the stability of the algorithm in a dynamic ridehailing system, we introduce
variants of the ActorCritic method to solve for the nearoptimal delayed matching policy
with streaming realworld data.
4.1. Designing observable states and reward signals of the ridehailing environment
4.1.1. Observable states
A state can be treated as a signal conveying information about the environment at a
particular time (Sutton and Barto,2018). In a ridehailing environment, a complete sense
of “how the environment is” should include full information consisting of, but not limited
to, timestamp and origindestination (OD) of all the batched requests, current locations
of drivers, and predictions about the future supply and demand. However, the complete
state space of the original ridehailing system is enormous and sampling from it eﬃciently
becomes impossible. Inspired by the analytical models built in the previous literature (Yang
et al.,2020), we abstract aggregated variables from the ridehailing simulator as partial
observations of the environment. The observations are concatenated into a tuple as
O
O
O(t) = {(NP(i, t), ND(i, t), λP(i, t), λD(i, t)) i∈S},(4)
where NP(i, t) (resp., ND(i, t)) is the number of batched passenger requests (resp., idle
drivers) in grid iat time step t, and λP(i, t) (resp. λD(i, t)) is the estimated arrival rate
of passenger requests (resp., idle drivers) in grid ifrom tonward.
Aggregating the complete state space S
S
S(t) into the observation O
O
O(t) will generalize the
previously formulated MDP to a partially observable Markov decision process (POMDP),
which echos the second challenge aforementioned. We can assume an unknown distribution
p(O
O
O(t)S
S
S(t)) linking the complete state S
S
S(t) and a partial state O
O
O(t) observed from the deque
model. The matchmaker can reestablish the value function and optimization objective in
this setting by replacing S
S
S(t) in the MDP setting with O
O
O(t). In addition, stochastic policies
are enforced to obtain a higher average reward than a deterministic policy (Jaakkola et al.,
1995).
Of all the variables in the observation, λP(i, t) is the only variable that is not directly
collectable from the simulator. The learner uses either the average recent arrival rates as a
proxy or a more sophisticated predictor to estimate future demand. In contrast, λD(i, t) is
easy to obtain as the simulator continues to track vehicle locations in realtime. The lost
information of exact locations of vehicles and passengers only aﬀects the ﬁrststage decision
A
A
A(t). At the second stage, this information has been considered in the edge cost in the
Qin et al. You can read the published version here Transp. Res. Part C
bipartite matching and included in the reward function. In summary, we reduce the state
space of S
S
S(t) signiﬁcantly by transferring partial information noise in O
O
O(t) of the rewarding
process.
With observable queueingbased states, the state transition dynamics, i.e., how O
O
O(t)
should be updated after taking an action, is straightforward:
(NP(i, t) = NP(i, t −1) + nP(i, t −1 : t)
ND(i, t) = ND(i, t −1) + nD(i, t −1 : t),i∈S0(hold) (5)
NP(i, t−) = NP(i, t −1) + nP(i, t −1 : t)
ND(i, t−) = ND(i, t −1) + nD(i, t −1 : t)
NM(i, t) = min (NP(i, t−), ND(i, t−))
NP(i, t) = NP(i, t−)−NM(i, t)
ND(i, t) = ND(i, t−)−NM(i, t)
,i∈S1(match) (6)
where nP(i, t −1 : t) and nD(i, t −1 : t) are the number of trip requests and idle drivers
arriving in grid ibetween t−1 and t, respectively. NP(i, t−) and ND(i, t−) are the number of
requests and drivers updated before being matched at t, as they may change during the time
step. Meanwhile, λP(i, t) and λD(i, t) are continuously reestimated from historical passenger
arrivals and realtime vehicle locations accordingly.
4.1.2. Reward signals
A reward signal is the environment’s valuation of an implemented action A
A
A(t). In the
context of supervised learning, this reward signal is required to instruct what action the
agents should take; however it is not required in RL. To this end, a major advantage of RL
distinguishing itself from its counterpart is that the instruction for optimal actions is not
mandated.
Nevertheless, the reward signals still must be carefully designed, especially when the
environment faces sparse rewards in this setting (i.e., the reward is 0 most of the time in
delayed matching). However, a na¨ıve reward signal will remain unknown until a match is
executed and the total wait cost for all served passengers is determined. In this regard, the
buﬀer that continues to hold upcoming requests receives zero rewards. This sparse reward
issue makes it extremely diﬃcult for RL to relate a long sequence of actions to a distant
future reward (Hare,2019). In this ridehailing environment, the agents fed with this na¨ıve
reward are stuck taking a= 0, since the zero reward of taking a= 0 is greater than any
negative rewards incurred by taking a= 1.
To tackle this challenge, we try to decompose this oneshot reward happening at a certain
time into each time step within the passengers’ queueing lifetime. Since the total matching
wait time is in itself a summation of the wait time at each time step, it is straightforward to
delineate its increment as a matching wait reward:
Rm(t) = −X
i∈S
[NP(i, t −1) + τm(nP(i, t −1 : t))] ,(7)
where τm(·) computes the matching wait time of the newlyarrived requests in grid ifrom
t−1 to t. In particular, if the arrival of passengers is assumed to be a Poisson process,
Qin et al. You can read the published version here Transp. Res. Part C
τm(nP(i, t −1 : t)) ≈nP(i, t −1 : t)/2.
However, the pickup wait time is not a summation per se; its decomposition is less
straightforward. To resolve this issue, we devise an artiﬁce so that these increments add up
to the ﬁnal pickup wait time in the matching process. As shown in Fig. 6, the intuition is
to decompose the oneshot pickup wait time τp(t) to a summation of stepwise incremental
pickup wait times, i.e., τp(1),· · · , τp(t−1) −τp(t−2), τp(t)−τp(t−1).
t
t+1
t11 ...
...
0
DELAYED
DELAYED
DELAYED
Δ pickup wait = τp(1)
τp(t+1)
τp(t)τp(t1)
∑τp
=[τp(t)τp(t1)]+[τp(t1)τp(t2)]+...+[τp(1)]
=τp(t)
τp(t1)τp(t2)
τp(t+2)τp(t+1)
Pooled passengers
Pooled passengers
M
A
T
C
H
E
D
oneshot pickup wait time
summation of Δ pickup wait
decomposing pickup wait
Figure 6: Decomposition of the oneshot pickup wait time τp(t) to a summation of stepwise incremental
pickup wait times.
In light of this decomposition, we deﬁne the step reward mathematically as:
Rp(S, t) = (−[τp(NP(S0, t), ND(S0, t)) −τp(NP(S0, t −1), ND(S0, t −1))] ,S=S0
−τp(NP(S1, t), ND(S1, t)) ,S=S1
(8)
where τp(NP(S0,0), ND(S0,0)) := 0, and
NP(S0, t) = X
i∈S0
NP(i, t), NP(S1, t) = X
i∈S1
NP(i, t),
ND(S0, t) = X
i∈S0
ND(i, t), ND(S1, t) = X
i∈S1
ND(i, t).
Note that the diﬀerence in Eq. 8is the net increase of the pickup wait reward if holding
(A(S0, t) = 0) requests for one more time step. Comparing this net increase with the net
loss of the matching wait reward within the same time step can indicate whether holding
the passenger requests is a better option. Thus, this decomposition not only addresses the
sparse reward issue but also introduces a shortterm signal to facilitate the learning process.
Therefore, the total pickup wait reward at tis
Rp(t) = Rp(S0, t) + Rp(S1, t).(9)
Summing up Eq. 7and Eq. 9, we obtain the total wait reward at time t:
Rw(t) = cmRm(t) + cpRp(t),(10)
where cmand cpare the perceptual cost per unit wait time. These weights can be set to
accommodate the asymmetrical perception of time values in diﬀerent stages of the waiting
process (Guo et al.,2017). In general, cmcpas passengers are less patient in the face of
uncertain matching outcome (Xu et al.,2020a).
Qin et al. You can read the published version here Transp. Res. Part C
4.2. Learning optimal delayed matching policy through ActorCritic methods
The intricacy of reinforcement learning in a realistic ridehailing environment provokes
the need to appropriately adapt the generalpurpose RL methods for improved robustness.
On account of this, we present two policy gradientbased RL methods along with three
aspects of meticulous adaptation.
First, outputting multibinary actions. The actions taken in the conventional RL setting
are mostly discrete or continuous scalars, occasionally being continuous vectors, while the
action space in the ridehailing environment, which is deﬁned over a spatial network, is a
multibinary matrix. To resolve this diﬃculty of representation, we propose a set of action
schemes A, as presented in Fig. 7, to map the onedimensional action scheme ID (ranging
from 1 to 2S) chosen by the agent to the action matrix (ranging from a matrix of zeros to
ones) for feeding the environment. Similarly, we obtain the probability of choosing to match
in each grid by summing the learned policy probability π(A
A
As) of choosing each action scheme.
p(i) = X
A
A
A∈A
π(A
A
As)·1A
A
A(i)=1,(11)
where 1A
A
A(i)=1 is an indicator function that equals 1 if the action in grid i, i.e., A
A
A(i), is 1
(match), and equals 0 otherwise.
gridbased observations
input layer output layer π(a) action schemes
hidden layers
FLATTEN SOFTMAX SAMPLE
00000
000000
000000
000000
00000
0
0
00000
000000
000000
000000
00000
0
1
11111
111111
111111
111111
11111
1
1
Figure 7: Structure of the policy neural network.
Second, increasing sample eﬃciency and reducing sample correlation. In the standard
RL setting, the agent’s experiences (observations, actions, and rewards) are usually utilized
once in value and policy updates. In the light of the large dimensionality of observation O
O
O(t),
this onetime utilization can amount to a substantial waste of rare experiences and hinders
the rate of learning. We therefore adopt a technique called experience replay (Foerster et al.,
2017) to enable reuse of sampled experiences. Additional usefulness of experience replay is
the reduction of sample correlation. More speciﬁcally, the sampled observations are stepwise
Qin et al. You can read the published version here Transp. Res. Part C
correlated because each observation is updated upon the previous step’s carryover of supply
and demand. Experience replay blends the action distribution over the agent’s previous
observations, which can mitigate the divergence in parameters (Mnih et al.,2013).
Third, training agents with speciﬁc time limits. The interaction between an agent and
the ridehailing environment has no terminal condition in standard RL literature. However,
the ridehailing system’s matching requires setting time limits at the end of each sampling
period to smooth out the training processes. This coincides with the notion of episodes in
the experience sampling. It is worth noticing that, if the time limits are regarded as terminal
conditions, the RL algorithm will render an incorrect value function update. To distinguish
time limits from terminal conditions, we must perform a partial episode bootstrapping to
continue bootstrapping at the time limit. With this treatment, agents can learn the delayed
matching policy beyond the time limit (Pardo et al.,2018).
In the next subsections, we start by brieﬂy introducing the general concept of the policy
gradient methods. Then, we describe two policy gradientbased methods, ActorCritic and
ACER, in which ACER is our solution to addressing those issues mentioned above.
4.2.1. Overview of policy gradient methods
Policy gradient methods are a family of RL methods that skip consulting the action
values and learn a parameterized policy function πθ(as) directly. In general, they have
many advantages over valuebased methods, including stronger convergence guarantees and
the capacity of learning a stochastic policy (Sutton and Barto,2018). As discussed in the
section of designing observable state, using a stochastic policy can also obtain higher average
rewards than deterministic ones in POMDP.
Policy gradient methods deﬁne a performance measure J(θ), where θdenotes the param
eterization of the policy in terms of the observable state, and improves it along the direction
of gradient ascent. The performance measure of a given policy is formally deﬁned as
J(θ) = Vπθ(O
O
O(0)) ,
where the righthand side value function denotes the expected return starting from state
O
O
O(0) = sby following sequences of actions taken from the policy πθ:= πθ(as).
The state value function Vπθ(O
O
O(0)) depends on both the policy and its changes on the
ridehailing environment’s state distributions. However, the latter eﬀect is unknown. As a
result, it is challenging to estimate ∇J(θ). Fortunately, a solution in the form of the policy
gradient theorem has circumvented the derivative of the state distributions and provided an
analytic expression for ∇J(θ),
∇J(θ)∝X
s
η(s)X
a
qπ(s, a)∇πθ(as) = Eπ"X
a
qπ(O
O
O(t), a)∇πθ(aO
O
O(t))#.(12)
where qπis the true actionvalue function for policy π, and η(s) is the onpolicy distribution
of sunder policy π.
Following the gradient ascent method, the policy parameters are updated by
θ←θ+αX
a
ˆq(O
O
O(t), a)∇πθ(aO
O
O(t)),
Qin et al. You can read the published version here Transp. Res. Part C
where ˆqis the learned approximation to qπ, and αis the learning rate.
Replacing ain Eq. (12) with A
A
A(t), we derive a vanilla policy gradient algorithm called
REINFORCE whose ∇J(θ) can be computed by
∇J(θ)∝Eπ[G(t)∇ln πθ(A
A
A(t)O
O
O(t))] ,
where G(t) is the total matching wait reward from tonward.
4.2.2. ActorCritic and ACER
Unlike REINFORCE and its variant methods, ActorCritic estimates the total return
G(t) based on a bootstrapping, i.e., updating the value estimation of the current state from
the estimated value of its subsequent state,
G(t) := Vw(O
O
O(t)) ←R(t) + γVw(O
O
O(t+ 1)) .(13)
Bootstrapping the state value function Vw(·) (termed “Critic”) introduces bias and an
asymptotic dependence on the performance of the policy function (termed “Actor” ) estima
tion. This is inversely beneﬁcial for reducing the variance of estimated value function and
accelerates learning. To meet the needs of training in the timelimit ridehailing environ
ment, we adapt the ActorCritic method, as summarized in Algorithm 1, to bootstrap at the
last observation in each episode by replacing the update on Line 6 with R+γVw(O0).
Algorithm 1: The ActorCritic method
Input: πθ(as), Vw(s), step size αθ>0, αw>0,discount factor γ∈[0,1]
1Initialize policy parameter θ∈Rd, and statevalue weights w∈Rd0(e.g., to 0);
2for each episode do
3Initialize O(initial state of the episode), I←1
4while Ois not terminal do
5A∼πθ(·O), take action A, observe O0, R
6δ←[R+γVw(O0)] −Vw(O)
7w←w+αwδ∇Vw(O)
8θ←θ+αθIδ∇ln πθ(AO)
9I←γI , O ←O0
10 end
11 end
ACER, standing for ActorCritic with experience replay, is an upgrade to ActorCritic
by enabling reuse of sampled experience (Wang et al.,2016). It integrates several recent
advances in RL to help stabilize the training process in the intricate ridehailing environ
ment. More speciﬁcally, ACER is a modiﬁcation of ActorCritic by adding a buﬀer to store
past experiences step by step for later sampling to batch update the value and policy neural
networks. Some other techniques, such as truncated importance sampling with bias correc
tion, stochastic dueling networks and an eﬃcient trust region policy optimization method,
are also added to enhance the learning performance. The proposed ACER method is also
adapted to the timelimit ridehailing simulator by enabling bootstrapping at the last obser
vations. Since the entire algorithm is rather intricate and out of the scope of this study, we
just present a general framework of ACER in Algorithm 2to give a broad picture of how it
works.
Qin et al. You can read the published version here Transp. Res. Part C
Algorithm 2: ActorCritic with experience replay (ACER)
Input: Dense NN Actor(as), Critic(s), and TargetCritic(s), discount factor γ∈[0,1], experience
replay batch size N, delayed update rate τ
1for each episode do
2Initialize O(initial state of the episode), buﬀer
3while Ois not terminal do
4A∼Actor(·O), take action A, observe O0, R from the ridehailing simulator
5buﬀer.push(O, A, R, O0)// Collect experience
6// Onpolicy update
7Follow ActorCritic update
8O←O0
9end
10 // Offpolicy update
11 if buﬀer.size > N then
12 O, A, R, O0= buﬀer.sample(N)// Sample experience to replay
13 δ←[R+γTargetCritic(O0)] −Critic(O)
14 CriticLoss ←δ2,ActorLoss ←δ∇ln Actor(AO)
15 CriticLoss.minimize(),ActorLoss.minimize()
16 TargetCritic.w0= (1 −τ)TargetCritic.w0+τCritic.w// Delayed update
17 end
18 end
4.2.3. Baseline: ﬁxed matching time interval strategy
Industries currently implement a ﬁxed matching time interval throughout an extended
period. More speciﬁcally, the matchmaker matches batched passenger requests every ntime
steps. A batchlearning version for the ﬁxed strategy is to run the experiments repeatedly
to generate suﬃcient samples for each matching time interval candidate and choose the one
that yields the minimum mean wait costs. The chosen ﬁxed matching time interval strategy
is a useful baseline, especially when the arrival of passenger requests is spatiotemporally
homogeneous.
It is worth mentioning that this comparison is unfair for the RLpolicy as we assume
no prior knowledge is available at the beginning of the learning process. On the contrary,
the baseline has found the optimal ﬁxed matching time interval using the historical data. A
ﬁxed matching time interval can oﬀer a reasonable upper bound wait cost for reference in a
realworld scenario with nonhomogeneous arrivals and complicated interdependence.
5. Result
In this section, we ﬁrst delineate the experimental setup for testing the abovementioned
policy gradient methods, then we present and compare the algorithmic performances of these
methods and the baseline strategy. Lastly, we analyze the learned policy and interpret how
it outperforms others.
5.1. Data description and experimental setup
We conduct a numerical experiment using a largescale dataset from Shanghai Qiangsheng
Taxi Co. The oneweek taxi dataset was collected in March 2011. It contains trip data,
sequential taxi trajectories, and operational status for nearly 9,000 taxis, with ﬁelds including
Qin et al. You can read the published version here Transp. Res. Part C
taxi ID, date, time, longitude, latitude, speed, bearing, and status identiﬁer (1 for occupied/0
for idle).
Taxi
Passenger
2
1
2
1
2
1
0 5 10 15 20 0 5 10 15 20
Num. of requests per sec.
Hour of day
8:00 8:00
O
u
t
e
r
R
i
n
g
R
o
a
d
A1 B1
A2 B2
A3 B3
A1 B1
A2 B2
A3 B3
30 km
30 km
Figure 8: Spatial and temporal distributions of taxi supply and demand in downtown Shanghai, China.
Empirical evidence has shown that over 80% of taxi trips are concentrated in downtown
Shanghai outlined by the Outer Ring Road (Qin et al.,2017), which is selected as the area
of interest for our experiments. To balance the computational overhead incurred by the
dimension of actions, we partition the experiment area into six grids (3×2) as shown in Fig.
8. Experiments with ﬁner grids are observed to increase the training time signiﬁcantly with
marginal improvement on the overall performance. Three diﬀerent partitioning granularities
of action are hereafter implemented and compared (i.e. 1 ×1, 1 ×2, and 3 ×2). In such a
partitioning, a desirable spatial variation of arrival patterns is well captured, with grid A2
producing high demand, grids A1, B1, and B3 producing low demand, and the remaining
grids producing mediumdemand. The simulation initiates at 8:00 AM, which is the start of
the morning peak hours.
The ridehailing environment is initialized with trip requests and taxi supply sampled
from the historical dataset as shown in Fig. 9. Then, the environment warms up for the
ﬁrst 10 minutes, during which time all passengers and drivers are matched instantaneously.
Once the warmup is complete, the learning agents begin interacting with the environment
through the simulator. The agents must match requests coming within the next 10 minutes,
which is considered an episode. After every episode, the environment is reset and rewarmed
up before the agents resume the learning process.
In terms of the algorithm implementation, we use a library called Stable Baselines (Hill
et al.,2018) to train the models. Additionally, we set unit wait cost cm= 4 and cp= 1 to
reﬂect a reasonably higher perceived value of the matching wait time over the pickup wait
time, and set average pickup speed v= 20km/h.
Qin et al. You can read the published version here Transp. Res. Part C
Taxi trip data
Trip OD between 7:50 and 8:10 AM
Taxi location at 7:50 AM
id, time, olat, olon, dlat, dlon
xxx,xxx, xxx, xxx, xxx, xxx
...
taxi_id, lat, lon, status
xxx, xxx, xxx, xxx
...
RL agents
...
...
Episode 1
Episode n
RE
SAMPLING
action
reward
action
reward
...
π1
πn
π*
WARMUP (7:50~8:00)
Instantaneous matching MODEL TRAINING (8:00~8:10)
learned policy
Figure 9: Experimental setup of the ridehailing environment and model training.
5.2. RL algorithm performance
5.2.1. Revisiting tradeoﬀs in delayed matching
As a complement to the previous explanation of matching delay tradeoﬀ under stationary
arrivals, we now justify how this tradeoﬀ is established with realworld data. The wait time
curves under diﬀerent driver occupancies are illustrated in Fig. 10(a), where the driver
occupancy is a parameter set to mimic a particular supplydemand ratio (100% corresponds
to the driver supply observed in the historical data). The result shows that the total wait
costs are constantly increasing under either a high (≥80%) or low driver occupancy3(≤50%)
(an instantaneous matching is preferred for this case). Under a moderate driver occupancy
(70%), the total wait cost indicates a Ushape curve.
Since the vehicle trajectory data is from a traditional streethailing taxi market with
signiﬁcant empty car mileage, feeding 100% of driver supply into the simulator corresponds
to a highsupply scenario. As a result, the search frictions are lower, and the required number
of drivers in the context of a ridehailing platform should also be smaller. In light of this, the
ﬁnding here is consistent with the previous results under stationary arrivals – it is necessary
and beneﬁcial to perform the delayed matching in the balanced supplydemand scenario.
This tradeoﬀ is more comprehensible after further separating the total wait costs into the
3Under a low driver occupancy (≤50%), some passengers experience a long wait beyond their tolerance
and choose to renege the queue. Reneging destabilizes the matching wait time in two opposite aspects: ﬁrst,
those reneged passengers have experienced a longer matching wait than otherwise they had been matched
in time; second, the matching of the remaining passengers is expedited to some extent. Since the extent of
these impacts varies with matching time intervals, both the wait cost (Fig. 10(a)) and the matching wait
time (Fig. 10(b)) turn out to be zigzag.
Qin et al. You can read the published version here Transp. Res. Part C
90%
60%
(a) (b)
Matching wait time
Pickup wait time
50%
driver occupancy
50%
50%
100%
90%
80%
70%
60%
60%
Others
70%
80%
70%
100%
480~~
Fixed matching time interval
50
Relative changes
in
wait times
1000
~
25
460 800
600
~
0
,
440 400
.....
___,
25



.c:
200
'


Cl)
8 420 Q)
E
50
:;:::::;

·ro
5
75
400
100
380 125
150
360
.............
~~~~
...........
0
20
40
60
80
100 120 0
20
40
60
80
100 120
Matching time interval (s) Matching time interval (s)
Figure 10: The relationship between the total wait costs and the matching time interval based on the ﬁxed
matching time interval strategy under diﬀerent driver occupancies: (a) total wait costs and (b) relative
changes to the matching and pickup wait times under onesecond matching time interval. The lines stand
for the mean wait times from experiments and the shaded regions represent 95% conﬁdence interval.
pickup and matching wait times, as seen in Fig. 10(b). Under diﬀerent driver occupancies,
the relative increases of matching wait times (dotted curves) are almost at a ﬁxed rate,
while in contrast, the corresponding decreases of the pickup wait times (solid curves) vary
drastically. Such a distinction suggests that the slope of the pickup wait times hugs the U
shape of the matching delay curve. More speciﬁcally, this is because the pickup wait times
decrease much faster than the ﬁxed increase rate of the matching wait times.
A key implication is that the desirable Ushape of matching delay tradeoﬀ merely appears
under a speciﬁc range of supplydemand ratios. The Ushape tradeoﬀ curve’s existence
condition, from a theoretical perspective, is worth further study; for example, future work
could model when an agent should apply instantaneous matching and when the agent should
otherwise follow the learned delayed matching policy.
5.2.2. Algorithmic performance
Following the discussion above, we ﬁnd that sampling 70% of the existing taxi drivers is
a desirable value to prompt the matching delay tradeoﬀ. Using this driverpassenger ratio
as the instance for each RL algorithm, we repeat the model training for 1 million steps. The
training curves and their comparison with that of the ﬁxed matching time interval strategy
are presented in Fig. 11.
In the baseline model, matching every 40 seconds is found to be optimal, and the served
passengers’ total wait cost is around 371.20 hours, as seen in Fig. 11(a). Note that the
Qin et al. You can read the published version here Transp. Res. Part C
Figure 11: Performance comparison between (a) the ﬁxed matching time interval strategy and (b) policy
gradient algorithms: ActorCritic with experience replay (ACER, main) and ActorCritic (inset).
tradeoﬀ curve is by no means smooth with realworld data, while the change of wait costs
is subject to perturbation of matching time intervals. Learning a nearoptimal adaptive
delayed matching policy is quite challenging to this end. The proposed algorithm ACER
under partitioning 3 ×2 reports the ﬁnal converged wait cost, which is on a par with the
baseline, while ACERs under partitioning 1×2 and 1×1 fail to converge (Fig. 11(b)). These
results suggest that partitioning the area assists the agents in learning better policies. In
comparison, the original ActorCritic algorithm only converges under the partitioning 1 ×2
to the wait cost around 380 hours, however its cost variance is greater than the result of the
ACER algorithm.
Table 1: Comparison of model performances
Algorithm Mean matching wait time Mean pickup wait time Mean total wait time
Instant. 3.01 675.71 678.72
Fixed int. 21.20 525.69 546.89
ACER (3 ×2) 26.95 513.22 540.17
Note: “Instant.” stands for instantaneous matching, “Fixed int.” for ﬁxed matching time interval
matching, and ACER for ActorCritic with delayed update and experience replay. Wait time unit:
s/trip request.
Table 1presents the wait time measures of the baseline strategies and the proposed ACER
algorithm (AC algorithm is excluded due to its mediocre performance). Compared with the
instantaneous matching strategy, the measures show that the learned ACER policy can
substantially reduce the mean pickup wait time. This policy can reduce total wait time by
20.41%, which is on a par with previous results in Ke et al. (2020) (292.60 →221.88 : 24.17%
reduced). It is worth mentioning that such a comparison with prior work’s RL algorithms is
unjust, as the environment settings and data sources are diﬀerent. However, it is plausible
to state that, as compared with the multiagent model in Ke et al. (2020), our singleagent
Qin et al. You can read the published version here Transp. Res. Part C
model has a signiﬁcantly smaller state space, but achieves a similar performance.
Mean match wait sec Mean pickup wait sec Mean total wait cost
600...,
~
400
~
Q)
Q)
~
200
~
400
~
Q)
Q)
~
200
Baseline
ACER
08:00 14:00 19:00 08:00 14:00
Hour 19:00 08:00 14:00 19:00
Figure 12: Testing the learned policy. Baseline: ﬁxed matching time interval strategy. Bar height indicates
mean value, error bar 95% conﬁdence interval.
In addition, we test the performance of the learned policy for other hours and days in
Shanghai. We show that the learned policy is robust within and across days. Speciﬁcally, we
evaluate the performance of the ACER policy, which is learned with realworld data at 08:00
on a single weekday, in other weekday or weekend hours. The performance is compared with
the baseline strategy using ﬁxed matching time interval. As Fig. 12 shows, total wait costs
yielded by the ACER policy are close to the baseline’s at all weekday 08:00 and weekend
14:00. This indicates that the learned policy generalizes well within and across days of those
periods. In the remaining periods, the ACER policy does not generalize well. However, this
can be reconciled by taking instantaneous matching because these periods have suﬃcient
supply and hence the instantaneous matching is more desirable. This case can be seen from
the nearzero matching wait time in the baseline. This can be an extension of our study to
introduce a metamodel that switches the platform’s matching policy between instantaneous
and delayed matching, mainly aﬀected by the driver occupancy.
To summarize, the proposed twostage ACER model has demonstrated its capacity to
reduce passenger wait costs and improve the quality of ridehailing services.
5.3. Delayed matching policy interpretation
As the policy neural network is established on a highdimensional state space, it is im
possible to completely visualize the structure of the optimal policy in a lowdimensional
space. Instead, we try to interpret the learned policy by analyzing how the matching process
unfolds under a learned delayed matching policy. In other words, we replay episodic samples
by following the learned policy so as to reveal the decisionmaking step by step.
To be speciﬁc, we will track the matching probability in each grid along the episode to
reveal how the learned policy is responding to a certain environment state. Fig. 13 shows the
Qin et al. You can read the published version here Transp. Res. Part C
Observation
Matching probability
Observation
Matching probability
Observation
Matching probability
Observation
Matching probability
Observation
Matching probability
Observation
Matching probability
Delayed
matching
Instantaneous
E[Δt] =
A1 B1
A2 B2
A3 B3
5~10s
2~5s
1~2s
>10s
5~10s
2~5s
1~2s
>10s
5~10s
2~5s
1~2s
>10s
Matching probability along steps by following the learned optimal policy
1  D
0.2 
>,
0.1

'
0
1.L.._
__
__.__
__
__._
__
_,_
___
L..._____c
1 
0.8 
0.6 
0.4 
1 
r_
Np
No
i\p
D D

0.2 
...,,_
;:~

;
,
0.1


""""'
c,
o
~.L.._
__
__.__
__
__._
__
_,_
___
L...____;
1 
0.8 f
0.6 
..
• :
,,
..
. . •
..
,.
..
0.4 
..
.
!
0.2 :
1  D
0.2 
0.1



0 
o.L.._'__._
__
_,_
___
L..._____c
1 
0.8 
0.6 
0 100 200 300
Step (s) 400 500 600 100 200 300
Step (s) 400 500
600
Figure 13: Episodic replay of a matching process under the learned policy for each grid.
Qin et al. You can read the published version here Transp. Res. Part C
gridbased matching probabilities and their corresponding environment state transitions. It
can be seen that the learned delayed matching policy appears to have diﬀerent timevarying
matching probabilities (green curves) in diﬀerent grids.
As for the three grids on the left (A1, A2 and A3), their matching probability curves look
alike on most parts and can be summarized in the following three stages. (a) Initial stage (0s
to 150s): they match passengers and drivers every 2s to 5s on average, since passengers in this
period arrive with a shortly increasing estimated rate λP(blue dotted curves), a mild delay
accumulating the arriving passengers is expected to shorten the pick up distance; (b) stable
stage (150s to 450s): they shorten the matching time interval and make an instantaneous
matching every 1s to 2s. Since arrival rates λPand λDat this stage are stabilized, while
the number of available drivers ND(magenta curves) is remarkably greater than the number
of waiting passengers NP(blue curves), the delayed matching becomes unnecessary and an
instantaneous matching is called for; (c) ﬁnal stage (450s to 600s): A1 ﬁrst transitions to 2s
to 5s and then returns to 1s to 2s, since its NDNPconsidering the lower level of both λP
and λD, an instantaneous matching is still desirable. By contrast, both A2 and A3 extend
their matching time interval beyond 10s, since their ND≈NPconsidering the higher level
of both λPand λD, a delayed matching is expected under such a condition.
As for the three grids on the right (B1, B2 and B3), their matching probability curves can
be summarized in two stages. (a) Initial stage (0s to 300s): Both B1 and B2 begin matching
with a matching time interval beyond 10s, intending to accumulate available drivers to
shorten the pickup distance, while B3 makes an instantaneous matching due to very few
passengers and driver expected to arrive into this grid; (b) ﬁnal stage (300s to 600s): all
of the grids choose to make a matching every 1s to 2s, since at this stage their NPND
considering the lower level of both λPand λD. It is worth noting that compared with B2 and
B3, B1 takes an even shorter matching interval (almost 1s) due to its very stable observation
which lacks the beneﬁt of delayed matching.
In general, the intuition behind the learned policy is consistent with the previous analysis
in Section 3.2. This implies that incorporating a queueingbased model can intrinsically
capture the delayed matching tradeoﬀ in the realworld ridehailing platforms and adaptively
delay the matching to reduce passengers’ average wait costs. Integrating a structured model
into a modelfree RL framework is promising in improving purely datadriven approaches.
6. Conclusion
This paper presents a family of policy gradient RL algorithms for the delayed matching
problem. Leveraging the sequential information collected by alternating the matching time
intervals when assigning trip requests to idle drivers, the proposed RL algorithm balances
the wait time penalties and improved matching eﬃciency. With prior information on the
stationary demand and supply distributions, we ﬁrst characterize the matching delay trade
oﬀ in a doubleended queue model. A multiserver queue model is then used to build a ride
hailing simulator as a modeltraining environment, weighing the informativeness against
dimensionality. Searching for the optimal delayed matching policy in this environment is
formulated as a POMDP. We also propose a dedicated stepwise reward decomposition to
address the issue of the sparse reward. Finally, we unfold the learned policy into episodic
trajectories to garner insights.
Qin et al. You can read the published version here Transp. Res. Part C
We compare proposed RL algorithms’ performances with a baseline strategy using ﬁxed
matching time interval and prior work based on realworld taxi data. Our numerical results
are twofold. First, we identify the range of supplytodemand ratio in which the learned
delayed matching policies outperform the instantaneous matching strategy. Second, the
learned delayed matching policy is both eﬃcient and interpretable. Our results show that the
ACER algorithm can reduce the mean total wait time by amount on a par with the baseline.
More importantly, it can lower the wait time by 20.41% compared with the instantaneous
matching. The learned delayed matching has physical implications. A ridehailing platform
can incorporate the structure of learned strategy as a lookup table to adaptively decide when
to use a delayed matching policy and how long these matching delays should be.
This paper leaves several extensions for future research. First, it is worth further inves
tigation and development of an analytical model to quantify the Ushape tradeoﬀ curve at
a certain supplyanddemand ratio. To obtain this threshold, we must understand how the
ﬁrststage matching time interval decision connects to the secondstage bipartite matching
rewards. Second, we can extend to a metamodel that switches the agent’s matching policy
between instantaneous and delayed matching and includes diﬀerent factors such as passen
gers’ utility into the objective functions. Finally, online RL is needed if implementing the
algorithm in the ridehailing platform, which is challenging due to the platform’s stochas
ticity. A model must be developed to reﬂect the platform’s realtime dynamics so that the
agents can learn eﬀectively by interacting with it. However, the platform’s stochasticity may
complicate the agent’s target to learn, thereby resulting in an unstable RL algorithm with
no convergence guarantees.
Acknowledgments
The work described in this paper was partly supported by research grants from the
National Science Foundation (CMMI1854684; CMMI1904575) and DiDi Chuxing. The
ﬁrst author (G. Qin) is grateful to the China Scholarship Council (CSC) for ﬁnancially
supporting his visiting program at the University of Michigan (No. 201806260144). We also
thank Shanghai Qiangsheng Taxi Company for providing the taxi GPS dataset.
References
AlAbbasi, A.O., Ghosh, A., Aggarwal, V., 2019. Deeppool: Distributed modelfree algo
rithm for ridesharing using deep reinforcement learning. IEEE Transactions on Intelligent
Transportation Systems 20, 4714–4727.
AlonsoMora, J., Samaranayake, S., Wallar, A., Frazzoli, E., Rus, D., 2017. Ondemand high
capacity ridesharing via dynamic tripvehicle assignment. Proceedings of the National
Academy of Sciences 114, 462–467.
Aouad, A., Sarita¸c, ¨
O., 2020. Dynamic stochastic matching under limited time, in: Proceed
ings of the 21st ACM Conference on Economics and Computation, pp. 789–790.
Ashlagi, I., Burq, M., Dutta, C., Jaillet, P., Saberi, A., Sholley, C., 2018. Maximum weight
online matching with deadlines. arXiv preprint arXiv:1808.03526 .
Qin et al. You can read the published version here Transp. Res. Part C
Azar, Y., Ganesh, A., Ge, R., Panigrahi, D., 2017. Online service with delay, in: Proceedings
of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 551–563.
Cao, P., He, S., Huang, J., Liu, Y., 2020. To pool or not to pool: Queueing design for
largescale service systems. Operations Research .
Foerster, J., Nardelli, N., Farquhar, G., Afouras, T., Torr, P.H., Kohli, P., Whiteson, S., 2017.
Stabilising experience replay for deep multiagent reinforcement learning, in: International
conference on machine learning, PMLR. pp. 1146–1155.
Guo, S., Liu, Y., Xu, K., Chiu, D.M., 2017. Understanding passenger reaction to dynamic
prices in rideondemand service, in: 2017 IEEE International Conference on Pervasive
Computing and Communications Workshops (PerCom Workshops), IEEE. pp. 42–45.
Hare, J., 2019. Dealing with sparse rewards in reinforcement learning. arXiv preprint
arXiv:1910.09281 .
Hill, A., Raﬃn, A., Ernestus, M., Gleave, A., Kanervisto, A., Traore, R., Dhariwal, P.,
Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu,
Y., 2018. Stable baselines. https://github.com/hilla/stablebaselines.
Jaakkola, T., Singh, S.P., Jordan, M.I., 1995. Reinforcement learning algorithm for par
tially observable markov decision problems, in: Advances in neural information processing
systems, pp. 345–352.
Ke, J., Yang, H., Ye, J., 2020. Learning to delay in ridesourcing systems: a multiagent
deep reinforcement learning framework. IEEE Transactions on Knowledge and Data En
gineering .
Kim, D.S., Smith, R.L., 1995. An exact aggregation/disaggregation algorithm for large scale
markov chains. Naval Research Logistics (NRL) 42, 1115–1128.
Li, M., Qin, Z., Jiao, Y., Yang, Y., Wang, J., Wang, C., Wu, G., Ye, J., 2019. Eﬃcient
ridesharing order dispatching with mean ﬁeld multiagent reinforcement learning, in: The
World Wide Web Conference, pp. 983–994.
Lin, K., Zhao, R., Xu, Z., Zhou, J., 2018. Eﬃcient largescale ﬂeet management via multi
agent deep reinforcement learning, in: Proceedings of the 24th ACM SIGKDD Interna
tional Conference on Knowledge Discovery & Data Mining, pp. 1774–1783.
Luo, Q., Huang, Z., Lam, H., 2019. Dynamic congestion pricing for ridesourcing traﬃc: a
simulation optimization approach, in: 2019 Winter Simulation Conference (WSC), IEEE.
pp. 2868–2869.
Mao, C., Liu, Y., Shen, Z.J.M., 2020. Dispatch of autonomous vehicles for taxi services: A
deep reinforcement learning approach. Transportation Research Part C: Emerging Tech
nologies 115, 102626.
Qin et al. You can read the published version here Transp. Res. Part C
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller,
M., 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602
.
¨
Ozkan, E., Ward, A.R., 2020. Dynamic matching for realtime ride sharing. Stochastic
Systems 10, 29–70.
Pardo, F., Tavakoli, A., Levdik, V., Kormushev, P., 2018. Time limits in reinforcement
learning, in: International Conference on Machine Learning, pp. 4045–4054.
Qin, G., Li, T., Yu, B., Wang, Y., Huang, Z., Sun, J., 2017. Mining factors aﬀecting
taxi drivers’ incomes using gps trajectories. Transportation Research Part C: Emerging
Technologies 79, 103–118.
Sutton, R.S., Barto, A.G., 2018. Reinforcement learning: An introduction. MIT press.
Wang, H., Yang, H., 2019. Ridesourcing systems: A framework and review. Transportation
Research Part B: Methodological 129, 122–155.
Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., de Freitas, N., 2016.
Sample eﬃcient actorcritic with experience replay. arXiv preprint arXiv:1611.01224 .
Williamson, D.P., Shmoys, D.B., 2011. The design of approximation algorithms. Cambridge
university press.
Xu, Z., Li, Z., Guan, Q., Zhang, D., Li, Q., Nan, J., Liu, C., Bian, W., Ye, J., 2018.
Largescale order dispatch in ondemand ridehailing platforms: A learning and planning
approach, in: Proceedings of the 24th ACM SIGKDD International Conference on Knowl
edge Discovery & Data Mining, pp. 905–913.
Xu, Z., Yin, Y., Chao, X., Zhu, H., Ye, J., 2020a. A generalized ﬂuid model of ridehailing
systems. Working Paper, University of Michigan .
Xu, Z., Yin, Y., Ye, J., 2020b. On the supply curve of ridehailing systems. Transportation
Research Part B: Methodological 132, 29–43.
Yan, C., Zhu, H., Korolko, N., Woodard, D., 2020. Dynamic pricing and matching in ride
hailing platforms. Naval Research Logistics (NRL) 67, 705–724.
Yang, H., Qin, X., Ke, J., Ye, J., 2020. Optimizing matching time interval and matching
radius in ondemand ridesourcing markets. Transportation Research Part B: Method
ological 131, 84–105.
Zha, L., Yin, Y., Yang, H., 2016. Economic analysis of ridesourcing markets. Transportation
Research Part C: Emerging Technologies 71, 249–266.
Qin et al. You can read the published version here Transp. Res. Part C
Appendix A. Summary of notations
Table A.2: Notation list of variables and parameters
Notation Description
tA discrete time step in {0,1,· · · , T }
NP(t) Number of unmatched passenger requests at t
ND(t) Number of idle drivers at t
S(t) State of the ridehailing environment at t
A(t) Matching decision; conducting a bipartite matching if A(t) = 1 and holding if A(t)=0
p(s0s, a) Transition probability from sto s0after taking a
R(t) Reward received from the ridehailing environment at t
r(s, a, s0) Wait reward at s0transitioned from safter taking a
π(as) Policy function, the probability of taking action aat s
Vπ(s) State value function, the expected return starting from sby following π
γDiscount factor, γ∈[0,1]
qπ(s, a) Action value function, the expected total reward starting from sby taking aand then following π
O(t) Partial observation at t
λP(t) Estimated arrival rate of passenger requests at t
λD(t) Estimated arrival rate of idle drivers at t
nP(t1:t2) Actual number of passenger requests arriving between t1and t2
nD(t1:t2) Actual number of idle drivers arriving between t1and t2
NM(t) Number of matched passengerdriver pairs at t
V(t) Expected total reward received from the ridehailing environment at t
Rm(t) Reward regarding the matching wait cost at t
Rp(t) Reward regarding the pickup wait cost at t
Rw(t) Reward returned at t, a combination of Rm(t) and Rp(t)
τp(·,·) Total pickup wait cost function
cmUnit matching wait cost
cpUnit pickup wait cost
θParameters of the policy function or weights of the actor neural network
wParameters of the state value function or weights of the critic neural network
πθ(as) Parameterized policy function or actor neural network
Vw(s) Parameterized state value function or critic neural network
J(θ) Policy performance measure
αθLearning rate for the actor neural network
αwLearning rate for the critic neural network
G(t) Sampled total reward at t
η(s) Distribution of s
S0,S1,SSubarea 0, subarea 1, and the entire area
(x, y) Grid (x, y) in a partitioned area
o(x, y, t) observation of grid (x, y) at t
NP(x, y, t), ND(x, y, t) Number of unmatched passenger requests/idle drivers of grid (x, y) at t
λP(x, y, t), λD(x, y, t) Estimated arrival rate of passenger requests/idle drivers of grid (x, y) at t
RP(S, t) Reward regarding the pickup wait cost of subarea Sat t,S∈ {S0,S1}