PreprintPDF Available

TransferLight: Zero-Shot Traffic Signal Control on any Road-Network

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Traffic signal control plays a crucial role in urban mobility. However, existing methods often struggle to generalize beyond their training environments to unseen scenarios with varying traffic dynamics. We present TransferLight, a novel framework designed for robust generalization across road-networks, diverse traffic conditions and intersection geometries. At its core, we propose a log-distance reward function, offering spatially-aware signal prioritization while remaining adaptable to varied lane configurations - overcoming the limitations of traditional pressure-based rewards. Our hierarchical, heterogeneous, and directed graph neural network architecture effectively captures granular traffic dynamics, enabling transferability to arbitrary intersection layouts. Using a decentralized multi-agent approach, global rewards, and novel state transition priors, we develop a single, weight-tied policy that scales zero-shot to any road network without re-training. Through domain randomization during training, we additionally enhance generalization capabilities. Experimental results validate TransferLight's superior performance in unseen scenarios, advancing practical, generalizable intelligent transportation systems to meet evolving urban traffic demands.
TransferLight: Zero-Shot Traffic Signal Control on any Road-Network
Johann Schmidt*1, Frank Dreyer*, Sayed Abid Hashimi, Sebastian Stober
Artificial Intelligence Lab
Otto-von-Guericke University
Magdeburg, Germany
1johann.schmidt@ovgu.de
Abstract
Traffic signal control plays a crucial role in urban mobility.
However, existing methods often struggle to generalize be-
yond their training environments to unseen scenarios with
varying traffic dynamics. We present TransferLight, a novel
framework designed for robust generalization across road-
networks, diverse traffic conditions and intersection geome-
tries. At its core, we propose a log-distance reward function,
offering spatially-aware signal prioritization while remain-
ing adaptable to varied lane configurations—overcoming the
limitations of traditional pressure-based rewards. Our hierar-
chical, heterogeneous, and directed graph neural network ar-
chitecture effectively captures granular traffic dynamics, en-
abling transferability to arbitrary intersection layouts. Using
a decentralized multi-agent approach, global rewards, and
novel state transition priors, we develop a single, weight-tied
policy that scales zero-shot to any road network without re-
training. Through domain randomization during training, we
additionally enhance generalization capabilities. Experimen-
tal results validate TransferLight’s superior performance in
unseen scenarios, advancing practical, generalizable intelli-
gent transportation systems to meet evolving urban traffic de-
mands.
1 Introduction
Coordinating traffic at intersections is a major challenge for
urban planning. Due to the high and ever-increasing volume
of traffic in city centres, intersections can quickly become a
bottleneck if traffic is not properly coordinated, which can
lead to severe traffic congestion. To avoid congested roads,
signalized intersection are used to safely and efficiently co-
ordinate traffic flows. Traffic Signal Control (TSC) aims to
optimise the traffic flow and related measures (Wang, Ab-
dulhai, and Sanner 2023).
A common solution for TSC is to view it as an optimiza-
tion problem by designing a mathematical model of the traf-
fic environment using conventional traffic engineering theo-
ries and finding a closed-form solution based on that model.
Provided that the assumptions inherent to the underlying
traffic models are satisfied, such solutions produce good re-
sults in theory. However, assumptions such as uniform traf-
fic (Webster 1958; Little, Kelson, and Gartner 1981; Roess,
*These authors contributed equally.
Copyright © 2025, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
z
z
any Intersection Geometry
State Encoder
State Encoder
...
...
Policy Head
Policy Head
weight-tied
weight-tied
randomised road-networks and dynamics at training-time at test-time
applicable to any network
Figure 1: Our proposed traffic signal controller learns a gen-
eral policy for flexible phase prediction during training. Due
to the weight-tied models, we can apply the learned model
to any road-network during inference.
Prassas, and McShane 2004) or unlimited vehicle storage ca-
pacity of lanes (Varaiya 2013) are difficult or even impossi-
ble to observe in reality, which is why such solutions are not
optimal in practice, especially when traffic demand is high
and fluctuates significantly. Hence, the field pivoted towards
adaptive signal control policies, which are learned from data
through deep reinforcement learning (RL) (Wei et al. 2021).
Yet, most existing works still struggle to effectively transfer
their learned policies to changing traffic conditions.
Rigid State and Action Spaces The majority of RL-based
approaches employ overly rigid data structures to encode the
mapping from states to actions. Numerous studies simply
encode states and actions as fixed-size vectors or spatial ma-
trices (Zheng et al. 2019; Wang et al. 2024). This approach
inherently constrains the learned policy to a specific inter-
section geometry, which is defined by the structural arrange-
ment of lanes, movements, and phases. Consequently, the
reusability of such models is limited to networks of homo-
geneous intersections with identical geometries (Wei et al.
2018, 2019a). In an attempt to increase flexibility, states and
actions are (zero)-padded (Zheng et al. 2019; Chen et al.
2020), introducing upper bounds to the system’s diversity.
However, due to the combinatorial explosion of possible in-
tersection layouts, the required number of paddings grows
exponentially (Chu et al. 2019), potentially compromising
training efficiency and generalization ability of the model.
arXiv:2412.09719v1 [cs.AI] 12 Dec 2024
Rigid Traffic Environments Another significant issue in
current RL-based approaches is that the variability of traffic
dynamics is not adequately accounted for during the training
process. The majority of methods employ identical spatio-
temporal traffic patterns across all training episodes (Wei
et al. 2018, 2019a). While these models may exhibit im-
pressive performance within this constrained setting, they
typically suffer from substantial performance degradation
when confronted with real-world variability (Zheng et al.
2019; Yoon et al. 2021). This performance decline can be
attributed to overfitting and the drastically constrained ex-
ploration space during training (Jiang, Kolter, and Raileanu
2024). The limited exposure to diverse traffic scenarios dur-
ing the learning phase results in models that lack robust-
ness and adaptability to the complex and dynamic nature of
real-world traffic conditions (Korecki, Dailisan, and Helbing
2023). Consequently, these models struggle to generalize ef-
fectively to the multifaceted and often unpredictable traffic
patterns encountered in practical applications, highlighting a
critical gap between laboratory performance and real-world
efficacy.
Degenerated Reward Functions Reinforcement Learn-
ing is driven by the choice of reward to be optimised. As
long-term objectives, like travel time, depend on a sequence
of actions, credit assignment is difficult and might impact the
training efficiency drastically. Hence, short-term objectives
are used instead, like waiting time or queue length (Zheng
et al. 2019; Devailly, Larocque, and Charlin 2022, 2024),
or weighted combinations of them (Wei et al. 2018; Yoon
et al. 2021; Wu, Kim, and Ma 2023). Unfortunately, these
rewards do not correlate, leading to different optima (Wei
et al. 2019a). As a solution, Wei et al. (2019a) showed that
max-pressure control policy stabilise the traffic system over
time, which lets queue length and travel time settle in a local
optima. Based on these guarantees, pressure-based rewards
are frequently used in recent works (Oroojlooy et al. 2020).
However, as pressure is computed as a mean, it is invari-
ant to various transformations of the input signal. Different
spatial locations of heavy traffic loads along the lane do not
influence the indicator, leading to misjudgments of states.
Towards General Control Policies The limitations of
existing traffic signal control (TSC) approaches, particu-
larly their inability to generalize across intersection con-
figurations and traffic conditions, necessitate a more robust
and flexible solution.1We present TransferLight, a novel
model that addresses these challenges by leveraging graph-
structured representations and advanced training techniques.
Our contributions include
We introduce a novel log-distance reward function that
provides a continuous, spatially-aware signal prioritiz-
ing near-intersection vehicles while remaining bounded
and adaptable to diverse lane configurations, address-
ing key limitations of traditional pressure-based rewards
1The ideal solution would be a model capable of maintaining
consistent, high-quality performance across a wide spectrum of
road-networks and traffic dynamics. Once the general control pol-
icy is obtained, it can be applied to any (urban) environment.
(Wei et al. 2019a).
Building upon prior research (Yoon et al. 2021; Devailly,
Larocque, and Charlin 2022, 2024), we propose a hetero-
geneous graph neural network (Kipf and Welling 2017)
architecture for state encoding. This approach captures
fine-grained traffic dynamics and enhances generaliza-
tion, enabling universal applicability of the learned pol-
icy to varied intersection and road network geometries.
We utilize domain randomization to vary both static and
dynamic features of the traffic environment during the
training process similar to Devailly, Larocque, and Char-
lin (2022, 2024). This approach enhances the model’s
generalization capabilities to novel scenarios.
We use a decentralised multi-agent approach with a
global reward and novel state transition priors to fos-
ter proactive decisions. This allows us to learn a single
shared (weight-tied) general policy that can be zero-shot
scaled to any road-network during test-time without re-
training.
By combining these elements, TransferLight overcomes the
limitations of previous approaches, offering a unified frame-
work for learning robust and adaptive traffic signal control
policies. Our experimental results demonstrate that Trans-
ferLight achieves good performances on novel (unseen) sce-
narios, making a significant step towards practical, general-
izable and intelligent transportation systems.
2 Priliminaries
Traffic Signal Control We define a road network as a
graph G= (V,I O), where V={vk|k[1,2. . . V ]}
is the set of Vsignalised junctions. This geometric structure
defines the environment for an agent to act on. For nota-
tional convenience, we differentiate between incoming lanes
Ivand outgoing lanes Ovfor each intersection v V.2For
situations, where we do not need to differentiate between in-
coming and outgoing lanes, we use I O to denote an
arbitrary lane. Each lane defines a finite one-dimensional
coordinate space R+\ {0}with its origin at the inter-
section’s centre.
As in (Wei et al. 2019c; Urbanik et al. 2015), we define
mv= (i, o)to be a movement from i Ivto o Ovwith
mv Mv Iv×Ov. A movement can be either permitted,
prohibited, or protected. A movement is protected if the as-
sociated road users have priority and do not have to give way
to other movements. A movement is prohibited if the signal
is red, and it’s permitted if the associated road users must
yield the right-of-way to the colliding traffic before they are
allowed to cross the intersection. A phase ϕdescribes a tim-
ing procedure associated with the simultaneous operation of
one or more traffic movements (Urbanik et al. 2015) with a
green interval, a yellow change interval and an optional red
clearance interval. Let ϕvΦvbe a phase at intersection v,
and Mϕv M be the associated right-of-way movements.
The phase set Φvdefines the discrete action space for an
agent acting on v.
2Such that, I:= Sv∈V Ivand O:= Sv∈V Ov.
This defines the static part of the environment. The dy-
namics are given by a set of moving vehicles C={ck|
k[1,2. . . C]}. These are modelled as points on the one-
dimensional coordinate space . We define a state Ctby the
vehicle positions at a time point t. Hence, each vehicle’s mo-
tion is captured by c(t), which is evaluated at an a priori de-
fined sampling frequency of the sensor (or the simulation).
Cooperative Markov Games In a multi-intersection road
network, agent coordination is crucial for efficient traffic
flow. This scenario extends the Markov Decision Process to
a Markov Game (Littman 1994). At each time step t, every
agent v V observes the environment state Ct C and
selects an action ϕt
vΦvusing its policy πv(ϕt
v| Ct) :
Φi× C 7→ R+. The environment then transitions to Ct+1
according to T(Ct+1 |ϕt,Ct) : C × Φ× C 7→ R+, where
Φ = Sv∈V Φvis the joint action space. Each agent receives
a reward rt+1
vbased on Rv(Ct, ϕt,Ct+1) : C × Φ× C 7→ R,
denoted as Rt
vfor brevity.
In fully cooperative Markov Games, the global reward is
equivalent to individual rewards (Rt=Rt
v,v V) or a
team average (Rt=1
|V| Pv∈V Rt
v). While the former en-
tails aligned goals for individual agents, the latter allows
agents to pursue distinct objectives that contribute to the
overall team benefit. Since individual and global rewards
are functions of joint actions, value functions also depend
on the joint policy π={πv|v V } of all agents. Based
on the reward definition, we can define an individual state-
value function Eπ[P
k=0 γkRt+k
v| Ct]or a global state-
value function Eπ[P
k=0 γkRt+k| Ct]. The objective is to
find an optimal joint policy πthat maximizes the expected
discounted sum of global rewards:
π= argmax
π
ECtµEπ"
X
k=0
γkRt+k| Ct#,(1)
where µ(Ct|π)is the stationary distribution of the Markov
chain under joint policy π.3
3 Lifting Pressure-based Rewards
Under mild assumptions4, a max-pressure control policy sta-
bilises the traffic system over time (Wei et al. 2019a). This
means, that measures like queue length, throughput, and
travel time settle in local optima. We build upon these the-
oretical results by eliminating a remaining shortcoming of
pressure-based systems.
Degeneracies of Pressure We prove that the pressure of
a movement suffers from several degeneracies introducing
plateaus to the reward surface, which prohibit convergence
to superior extrema. The pressure ρ(m)of a movement m
M(Wei et al. 2019a) is defined by the difference between
3Note that Eq. (1) is permutation invariant with respect to V.
The geometric structure of the road network needs to be induced in
the state representation.
4That is, no physical queue expansion for non-arterial environ-
ments and admissible average demand.
pressure
ours
pressure
ours
Figure 2: Pressure (see Eq. (2)) is symmetric to vehicle po-
sition translations within the lane’s coordinate space. Our
more expressive measure breaks this symmetry.
the incoming and outgoing vehicle densities, such that
Ci
|i|Co
|o|with m= (i, o),(2)
where |i|,|o| R+\ {0}are the length of the lanes. Den-
sities are computed by the arithmetic mean over vehicles 5,
which comes with the following fundamental properties ar-
sing from the linearity of the operation:
permutation invariance,mean(x) = mean(πx),
translation equivariance,mean(x+b) = mean(x) + b,
scale equivariance,mean(bx) = bmean(x),
for any sequence xRn, permutation matrix π
{0,1}n×n, and bR. These symmetries also apply
locally, such that mean({x1+b, x2b, . . . , xn}) =
mean({x1, x2, . . . , xn})for instance. By these relations,
equivalence classes are formed, i.e., subsets with constant
outputs under these transformed inputs. Hence, the pressure
stays constant, when permuting the positions of vehicles,
shifting vehicles along the lane or scaling the distribution
of vehicles. The latter two are of specific interest, as the first
one would not change the state Ct.
Modelling the reward function by (pure) pressure maps
these equivalence classes on the reward surface and with
that on the loss surface. As gradients on these plateaus are
exactly zero, gradient-based optimisation will fail leaving
these regions. This might be mitigated by a drastically in-
creased momentum term (Kingma and Ba 2014), allowing
the model to jump over these regions. However, the model
can extract valuable information from these regions iff these
degeneracies are lifted.
Lifting the Degeneracies We argue, that the degeneracies
of Eq. (2) can get lifted by inducing spatial information.
As stated in Section 2, every vehicle ccan be interpreted
as a point on the lane’s one-dimensional coordinate space
. Using the Euclidean distances does not lift the degenera-
cies6Instead, we use log-distance, defined by log (c+ϵ)
[log ϵ −∞,log(1 + ϵ)0], where ϵ0. The farther
5We can interpret the traffic density as 1
||Pp1p, where
1p {0,1}is an indicator returning 1if there is a vehicle at the
spatial pon .
6For example, a configuration with a single vehicle at a large
distance from the intersection’s centre would yield the same metric
away a vehicle cfrom v, the larger the log-distance (closing
in on 0). This can be computed for an entire lane I O
by {log c+ϵ|c C}= log (C+ϵ).
We interpret the cumulated log-distances as the nega-
tive energy of the system. Analogously to a simplified po-
tential energy of a system of particles, where the energy
increases with distance between particles, as leveraged in
(Schmidt, K¨
ohler, and Borstell 2024). The goal is to min-
imise this energy, i.e., push the densities away from the in-
tersection’s centre. We interpret the total log-distance as the
energy ER+of the lane,
ˆ
E=X
c∈C
log (c+ϵ).(3)
This breaks both the translation and scale equivariance.7
Therefore, ˆ
Elifts the degeneracies of Eusing the
symmetry-breaking log-distance formulation. We define the
cumulated vehicle positions on a lane to be its energy, E:=
CR+. With this, we can formulate the average log-
pressure by the cumulated and normalised log-distances,
X
(i,o)∈Mv
1
|i|ˆ
Ei1
|o|ˆ
Eo.(4)
With this we define the reward
ri=X
(i,o)∈Mv
1
|i|ˆ
Ei1
|o|ˆ
Eo
.(5)
We focus on cooperative Markov Games (Littman 1994),
where agents have an incentive to work together to achieve
a team goal, which can be expressed by a global reward
function R(t). In such Multi-Agent settings, sharing infor-
mation among agents is key, as the other agents induce
otherwise unpredictable dynamics (non-stationary environ-
ments), which limits cooperation (Zhang, Yang, and Zha
2020). This can be done by joint state and action spaces,
which, however, require supportive mechanisms to cope
with the exponentially growing joint spaces (Choudhury
et al. 2021). Hence, action and state spaces are often dis-
joint and agents are trained by a global reward function to
encourage cooperation (Wei et al. 2019a; Chen et al. 2020;
van der Pol et al. 2022). In the following, we propose our
state encoding to cope with these challenges.
4 Graph-Structured State Encoding
Following Devailly, Larocque, and Charlin (2022, 2024), we
utilize a graph neural network on a heterogeneous graph to
encode both static and dynamic state characteristics of indi-
vidual intersections. This allows us to encode any intersec-
tion geometry regardless of the length of lanes and the num-
ber of lanes, approaches, movements and phases. By shar-
ing the parameters across all intersections in the network,
value as a configuration with multiple vehicles positioned closer to
the centre, provided the sum of their distances is equal to that of
the single distant vehicle.
7This follows from log (c+ϵ+b)=b+ log (c+ϵ)for b= 0
and log (bc +ϵ)=blog (c+ϵ).
N
S
E
W
policy head
Figure 3: Our hierarchical state space encoding uses a
position-encoded segment-density set on the lowest level.
This information is embedded and aggregated to form move-
ment representations, which then undergo another pass to
the phase level. On the phase level, we have intra-level up-
dates, otherwise information are passed down-to-top along
the directed heterogenous graph structure.
the model is encouraged to converge to a policy that gener-
alises various intersection configurations and traffic condi-
tions. We contextualise encodings by state transition priors
to allow for proactive decisions (which enable green waves).
We provide an illustration of our state encoding in Fig. 3. We
will discuss its core elements in the following.
Lane Partitioning As stated in Section 3, the density es-
timate over suffers from degeneracies. A state encoder us-
ing these estimates as inputs would inherit the degeneracies,
which would smooth out update nuances. Instead, we bound
the degeneracies to only act in limited sub-spaces. We de-
fine a metric ds to partition into
ds equally-sized segments,
which defines a hyperparameter to control the resolution of
the measure applied on top of lanes. In each segment, we
estimate the density by ρ(s) = 1
ds Cs. This factors out the
number of vehicles for the input to the encoder (similar to
the density estimate over ). Such a representation ensures
that the dynamical traffic system is fully described (for a
proof refer to Wei et al. (2019a)).
While often overlooked in previous studies (Zheng et al.
2019; Oroojlooy et al. 2020; Zang et al. 2020; Yoon et al.
2021), the length of lanes or segments plays a crucial role
in traffic dynamics. Our approach employs a uniform and
constant segment length ds, thereby streamlining the input
feature set compared to related works (Devailly, Larocque,
and Charlin 2022, 2024). This design choice allows the pol-
icy to implicitly learn length-related characteristics, includ-
ing segment capacity, enhancing the model’s adaptability to
diverse road networks compared to prior work (Wei et al.
2019a). However, ds has to be small enough8to minimise
the impact of local degeneracies, as discussed in Section 3.
8Note that
ds can have a remainder, which we drop. The impact
is minor for any reasonable choices of ds, as the cut-off is done
at the end of the lane (maximally distant to the intersection). We
choose ds to be 10 meters for our experiments.
Transition Prior Modelling only the dynamics within the
boundaries of the intersection, would result in reactivity
rather than proactivity, especially when is small. To fix
this, we interpret the road-network as a coordination graph,
which allows us to induce additional context to each agent
v V. We define the connectivity of the coordination graph
by movements,
M:= {(ℓ, o)|i=; (i, o) M},
M:= {(i, )|=o; (i, o) M}.(6)
This gives a single-hop receptive field for every v V. This
locally interdependent structure (Yi et al. 2024) can be inter-
preted as modelling communication channels between vand
its adjacent neighbour intersections.9We use this to define a
state transition prior
¯ρ=X
(i,o)∈M
ρ(i0)X
(i,o)∈M
ρ(o0),(7)
where i0and o0are the closest segments to the intersection’s
centre. If ¯ρ<0, more vehicles are going to leave .
Lane Coordinate Frames To break the permutation in-
variance of the segment set, we define the centre of the in-
tersection as a reference point and induce a positional en-
coding on the segments relative to that point. We use a
one-dimensional sinusoidal positional encoding (Vaswani
et al. 2017) along segments on each lane and over lanes.
Instead of additive fusion (Vaswani et al. 2017), we concate-
nate the positional information with the density of the seg-
ment. This preserves both identities, which improves expres-
siveness without the need of separate processing (Yu et al.
2023). Thus, we can define the segment feature vector by
hs= [ρ(s)pe(s)¯ρ]Rdbe the feature vector of a
segment s.
Segment-to-Movement Encoding We apply a graph at-
tention network (Veliˇ
ckovi´
c et al. 2017) to learn the mapping
R
ds ×d7→ Rd. To improve expressiveness, we use dynamic
scoring (Brody, Alon, and Yahav 2022) to compute attention
weights
αs=exp u(hs)
Ps∈Nexp u(hs)with u(hs) = a
sσ(Wshs),
(8)
where asRdand WsRd×dare learnable weights. N
defines the segment set for lane and σis a monotonic non-
linearity, like Leaky-ReLU. We then compute a representa-
tion for each movement hmRdby a weighted average of
its segments, such that
hm=σ
bs+ˆ
Wshs
| {z }
residual
+X
s∈Ni
αsWshs+X
s∈No
αsWshs
,
(9)
where ˆ
WsRd×denables learnable residual connections
and bsRdbeing the bias term. Movement nodes do not
9Agents are incentivized to cooperate rather than act solely in
their self-interest. This can lead to more stable equilibria where
multiple agents coordinate their strategies effectively.
hold information initially, hence the update is independent
of the original target node features hm.10 To ensure that
the neighbourhood aggregation runs in a numerically sta-
ble manner while allowing for a high degree of represen-
tational strength, the individual aggregation functions are
implemented as weighted sums with multi-head attention.
We compute attention over incoming and outgoing segments
separately, but aggregate and update the movement node fea-
tures in parallel. After this propagation step, the latent move-
ment node features are used as the source for the next level’s
update, as we will discuss in the following.
Movement-to-Phase Encoding The obtained movement
node features {hm|m Mv}form another heteroge-
neous directed acyclic sub-graph with the phase nodes. In-
stead of a sparsified graph, we use a fully-connected bipar-
tite structure with additional edge features. Each connection
between a movement m Mvand a phase ϕΦvholds
a scalar γ {−1,0,1}as an edge feature indicating
whether a movement is prohibited, protected or permitted
during a phase. In literature, often only permitted, or pro-
tected movements are considered (Zheng et al. 2019; Zang
et al. 2020). We argue, that also the information about pro-
hibited movements are essential to determine the energy of a
phase. Furthermore, phase nodes are initialised by a binary
flag hϕ {0,1}indicating whether the phase is currently
active or not. This changes Eq. (8) and Eq. (9) to
u(h) = a
mσ(Wmhm+Wϕhϕ+Wγγ)(10)
and hϕ=σ bm+˜
Wmhm+X
m∈Mv
αWmhm!,
where a,bmRdand Wm,˜
WmRd×dare learn-
able weights. Attention weights are computed as in Eq. (8)
but normalised over Mvinstead. Contrary to Veliˇ
ckovi´
c
et al. (2017), we embed node and edge features separately,
which reduces the model complexity while still preserving
expressiveness. Furthermore, we use the edge features and
the initial phase flag only to compute the attention scores.
Hence, the model can use γto weight movement features
during aggregation, but they do not infer otherwise with the
movement information. As we use a directed acyclic graph,
we do not face the identity issue discussed in general edge-
based graph attention (Wang, Chen, and Chen 2021). This
form of aggregation also preserves permutation invariance.
In contrast to the level before, this is an important property
for the encoding of phases, as they should be orientation
independent (Zheng et al. 2019). The obtained phase node
representations are further leveraged in an intra-level propa-
gation phase, as discussed next.
Intra-Level Phase Propagation We model the connec-
tion between phases as a fully-connected homogeneous
graph with Jaccard coefficients JϕϕR+between each
phase pair ϕ, ϕΦv. The Jaccard coefficient encodes the
intersection over the union of the green signals between the
two phases. This structures the phase space by quantifying
10hmis initialised with zeros, neutralising its impact in Eq. (9).
cologne3
ingolstadt7
synthe'c arterial scenario
Figure 4: Test performances (moving averages) on Cologne8 over 3600 simulated time steps.
the relative differences between phases w.r.t. to their “green”
portions. This results in the following intra-level update for-
mulation
u(hϕϕ) = a
ϕσ(Wϕhϕ+Wϕhϕ+WJJϕϕ)(11)
and hϕσ
bϕ+˜
Wϕhϕ+X
ϕΦv
αϕϕWϕhϕ
,
where aϕ,bϕRdand Wϕ,˜
WϕRd×dare learnable
weights. Again, attention weights are computed as in Eq. (8)
but normalised over Φvinstead. After propagation, each
node holds weighted information about all other phases,
which renders a single layer sufficient.
Weight-Sharing Our universal state encoding function al-
lows using the model for each intersection. In this way, our
model can be applied to any road-network size. Moreover,
by sharing parameters among agents, the algorithms are es-
sentially encouraged to converge to a region in parameter
space that works well for arbitrary intersections and traffic
conditions, thereby promoting generalization.
5 Domain-Randomised Training
Domain Randomization (DR) is a powerful technique for
bridging the sim-to-real gap (Tobin et al. 2017). By intro-
ducing sufficient variability in the simulated source domain
during training, DR enables the agent to generalize its pol-
icy to the target domain, treating it as another variant within
its learned distribution. The core principle of DR involves
configuring the environment based on a randomly sampled
configuration ξΞ, where Ξrepresents the space of possi-
ble domain parameters. Ξcontains all traffic-networks under
some degree of freedom, as well as different forms of traffic
dynamics. The agent’s objective is to find an optimal policy
πthat maximizes the expected return across all possible
environmental configurations, i.e., extending Eq. (1)
π= argmax
π
EξECtEπ"
X
k=0
γkR(t+k)| Ct, ξ #,(12)
where Ctµ(Ct|π;ξ)denotes the stationary distribu-
tion of the Markov chain under configuration ξand policy π.
We sample the static environmental characteristics (like the
number of intersections and lane lengths) from a uniform
distribution a priori. For the dynamics, we use traffic flow
modelling to define each flow f F by its route, vehicle
count, and departure times. To enhance realism and variabil-
ity, we model departure times using a beta distribution with
flow-specific parameters:
Tf={tmaxbk|bkBeta(αf, β f),1kCf},(13)
where αf, βfare sampled from a uniform distribution.11 In
literature, a Poisson process with a constant rate of Cf
tmax ve-
hicles per second with t[0, tmax]is often used instead.
However, the constant departure rates are often not realistic
in practice (e.g. during rush hours). This approach allows for
diverse departure patterns, including peaks and fluctuations,
while still encompassing the possibility of constant depar-
ture rates.
6 Experiments
The primary objective of our experiments is to show the abil-
ity of TransferLight to transfer its control policy to novel
scenarios without requiring any kind of re-training or fine-
tuning. In all experiments, TransferLight is trained on ran-
domly generated road-networks with random traffic dynam-
ics and tested on a yet unseen benchmark. This allows us
to quantify the generalisability of our method explicitly.
In Section 6.1, we analysed various performance measures
on multiple benchmarks (test scenarios) with several well-
known baselines. As arterial scenarios are of specific inter-
est for the community (Wei et al. 2019a), we conduced a de-
tailed investigation of our model’s generalisability on such
scenario types (see Section 6.2). The software specifications
of our implementations can be found in our open-sourced
code.
Exchangeable Policy Heads The learnable hierarchical
state encoding Section 4 maps states to action (phase) en-
ergies. The policy control function maps from this action
energy space to action probabilities. This results in maxi-
mum flexibility when it comes to the policy function. In this
work, we chose a Double DQN (Hasselt, Guez, and Silver
2016) and a A2C (Peng et al. 2018) as policy heads, but any
other can be used instead.
Table 1: Average number of standing vehicles () over 3600
simulated time steps (TL = TransferLight).
Cologne1 Ingolstadt1
Random 40.90 ±21.77 8.41 ±6.34
FixedTime 14.58 ±8.37 7.04 ±6.95
MaxPressure 8.00 ±5.22 1.88 ±1.38
TL-DQN 6.70 ±5.72 1.93 ±2.08
TL-A2C 7.21 ±7.04 2.30 ±3.30
6.1 Generalising different Scales
A general traffic signal control policy should be able to gen-
eralise from single intersections to more complex road net-
works. We demonstrate this ability by conducting experi-
ments on either end. Table 1 compares our models to dif-
ferent baselines on two single-intersection benchmarks. We
analysed the number of vehicles, as for a single intersection
this measure seems the most reasonable. We found that both
TransferLight variants outperform all baselines on Cologne1
and perform quasi on par with MaxPressure, causing the
least congestion.
To analyse how are policy scales to more complex road
networks, we conducted an experiment on Cologne8 com-
prising 8 signalised intersections. We measured multiple
popular traffic performance indicators during testing. We
found that both TransferLight variants outperformed all
heuristic and trained baselines. Note that, CoLight (Wei
et al. 2019b) and SOTL (Reztsov 2014) are explicitly trained
on Cologne8, whereas TransferLight generalises from ran-
dom road-networks. Both trained baselines failed to control
a subset of intersections, leading to early congestions and
hence the worse performance. The results in Fig. 4 under-
mine the ability of TransferLight to generalise also to more
complex scenarios. In the appendix, we rise the problem
complexity even more to identify TransferLight’s general-
isation limits.
6.2 Arterial Signal Progression
A special type of coordination is signal progression, which
attempts to coordinate the onset of green times of successive
intersections along an arterial street in order to move road
users through the major roadway as efficiently as possible
(Wang, Abdulhai, and Sanner 2023). Intuitively, the hope
here is to create a green wave in which green times are cas-
caded so that a large group of vehicles (also called a platoon)
can pass through the arterial street without stopping.
PressLight (Wei et al. 2019a) and MaxPressure were
shown to maximise throughput and minimise travel time in
11We sample a destination and target line segment and use the
Dijkstra algorithm (Dijkstra 1959) to estimate the shortest path.
We use αf, βfUnif(1,10) in our experiments. The number of
vehicles following the flow is sampled from a pool of Cavailable
vehicles in the simulation.
cologne3
ingolstadt7
synthe'c arterial scenario
Figure 5: Average Travel Time on Cologne8 over 3600 sim-
ulated time steps.
Figure 6: Signal progression comparison on a synthetic 5-
intersection arterial scenario. PressLight (Wei et al. 2019a)
was explicitly trained and designed to fit this specific sce-
nario, whereas TransferLight generalises from random non-
arterial road-networks.
arterial environments. We compare TransferLight to these
baselines while not being trained on arterial scenarios (other
than PressLight). Here, the state transition prior is essen-
tial to provide geometric information to perform proactive
decisions. Figure 6 shows the spatio-temporal signal pro-
gression plots, where each gray line represents the trajec-
tory of a single vehicle. In the optimal case, vehicle trajecto-
ries form straight lines (i.e., they keep a constant velocity).
We found that the zero-shot performance of our model can
keep up with the performances of the baselines. In Fig. 5,
we extended the experiment to a real-world scenario. We
found that TransferLight was able to achieve the minimal
travel time among the contesters, including MPLight (Chen
et al. 2020) and PressLight. Our model learns a more robust
and general policy from the DR-based training, enhancing
its effectiveness in real-world environments characterized by
greater variability.
7 Conclusion
We presented a novel framework designed for robust gener-
alization across road-networks, diverse traffic conditions and
intersection geometries. Our method can scale to any road-
network through a decentralized multi-agent approach with
global rewards and state transition priors to ensure proac-
tive decisions. We used a hierarchical, heterogeneous, and
directed graph neural network to encode any intersection ge-
ometry, which we train using a novel log-distance reward
function. Generalization is further fostered by domain ran-
domization during training. Through domain randomization
during training, we additionally enhance generalization ca-
pabilities. This is particularly valuable for real-world appli-
cations where traffic conditions can vary significantly due to
events, road closures, or long-term changes in urban mobil-
ity patterns.
Limitations and Future Work Our method shows al-
ready striking generalisation capabilities, which, however,
need further improvement to cope with even larger road net-
works. In future work, we aim to extend the concept of sym-
metry breaking to the intersection’s geometries. Mapping
intersections to canonical forms, as in Jiang et al. (2024),
collapses the state space to an exponentially smaller sub-
space. These canonical forms can be obtained from equiv-
ariant encodings (van der Pol et al. 2022) using canonical-
isation priors (Kaba et al. 2023; Mondal et al. 2023) or by
search (Schmidt and Stober 2024). This will drastically im-
prove the sample efficiency of our model and render domain
randomisation useless.
8 Acknowledgments
We would like to thank the Thorsis Innovation GmbH and
Galileo Test-Track team for valuable support throughout this
work. Furthermore, the authors acknowledge the financial
support by the Federal Ministry of Education and Research
of Germany (BMBF) within the framework for the funding
for the project PASCAL.
References
Brody, S.; Alon, U.; and Yahav, E. 2022. How Attentive are
Graph Attention Networks? In International Conference on
Learning Representations (ICLR).
Chen, C.; Wei, H.; Xu, N.; Zheng, G.; Yang, M.; Xiong, Y.;
Xu, K.; and Zhenhui. 2020. Toward A Thousand Lights: De-
centralized Deep Reinforcement Learning for Large-Scale
Traffic Signal Control. In AAAI Conference on Artificial In-
telligence.
Choudhury, S.; Gupta, J. K.; Morales, P.; and Kochender-
fer, M. J. 2021. Scalable Anytime Planning for Multi-Agent
MDPs. In International Conference on Autonomous Agents
and Multi-Agent Systems (AAMAS).
Chu, T.; Wang, J.; Codec`
a, L.; and Li, Z. 2019. Multi-Agent
Deep Reinforcement Learning for Large-Scale Traffic Sig-
nal Control. IEEE Transactions on Intelligent Transporta-
tion Systems.
Devailly, F.-X.; Larocque, D.; and Charlin, L. 2022. IG-RL:
Inductive Graph Reinforcement Learning for Massive-Scale
Traffic Signal Control. Trans. Intell. Transport. Sys.
Devailly, F.-X.; Larocque, D.; and Charlin, L. 2024. Model-
Based Graph Reinforcement Learning for Inductive Traffic
Signal Control. IEEE Open Journal of Intelligent Trans-
portation Systems.
Dijkstra, E. W. 1959. A note on two problems in connexion
with graphs. Numerische mathematik.
Hasselt, H. v.; Guez, A.; and Silver, D. 2016. Deep rein-
forcement learning with double Q-Learning. In AAAI Con-
ference on Artificial Intelligence.
Jiang, H.; Li, Z.; Li, Z.; Bai, L.; Mao, H.; Ketter, W.; and
Zhao, R. 2024. A General Scenario-Agnostic Reinforcement
Learning for Traffic Signal Control. Trans. Intell. Transport.
Sys.
Jiang, Y.; Kolter, J. Z.; and Raileanu, R. 2024. On the im-
portance of exploration for generalization in reinforcement
learning. In International Conference on Neural Informa-
tion Processing Systems (NeurIPS).
Kaba, S.-O.; Mondal, A. K.; Zhang, Y.; Bengio, Y.; and Ra-
vanbakhsh, S. 2023. Equivariance with Learned Canonical-
ization Functions. In International Conference on Machine
Learning (ICML).
Kingma, D. P.; and Ba, J. 2014. Adam: A method for
stochastic optimization. arXiv preprint arXiv:1412.6980.
Kipf, T. N.; and Welling, M. 2017. Semi-Supervised Clas-
sification with Graph Convolutional Networks. In Interna-
tional Conference on Learning Representations (ICLR).
Korecki, M.; Dailisan, D.; and Helbing, D. 2023. How Well
Do Reinforcement Learning Approaches Cope With Disrup-
tions? The Case of Traffic Signal Control. IEEE Access.
Little, J. D. C.; Kelson, M. D.; and Gartner, N. H. 1981.
MAXBAND : a versatile program for setting signals on ar-
teries and triangular networks. Working papers.
Littman, M. L. 1994. Markov games as a framework for
multi-agent reinforcement learning. In Cohen, W. W.; and
Hirsh, H., eds., Machine Learning Proceedings.
Lopez, P. A.; Behrisch, M.; Bieker-Walz, L.; Erdmann, J.;
Fl¨
otter¨
od, Y.-P.; Hilbrich, R.; L¨
ucken, L.; Rummel, J.; Wag-
ner, P.; and Wießner, E. 2018. Microscopic Traffic Simu-
lation using SUMO. In IEEE International Conference on
Intelligent Transportation Systems.
Loshchilov, I.; and Hutter, F. 2017. Decoupled Weight De-
cay Regularization. In International Conference on Learn-
ing Representations (ICLR).
Mei, H.; Lei, X.; Da, L.; Shi, B.; and Wei, H. 2023. Lib-
signal: an open library for traffic signal control. Machine
Learning.
Mondal, A. K.; Panigrahi, S. S.; Kaba, S.-O.; Rajeswar, S.;
and Ravanbakhsh, S. 2023. Equivariant Adaptation of Large
Pretrained Models. In Conference on Neural Information
Processing Systems (NeurIPS).
Oroojlooy, A.; Nazari, M.; Hajinezhad, D.; and Silva, J.
2020. AttendLight: universal attention-based reinforce-
ment learning model for traffic signal control. In Interna-
tional Conference on Neural Information Processing Sys-
tems (NeurIPS).
Peng, B.; Li, X.; Gao, J.; Liu, J.; Chen, Y.-N.; and Wong,
K.-F. 2018. Adversarial Advantage Actor-Critic Model for
Task-Completion Dialogue Policy Learning. In 2018 IEEE
International Conference on Acoustics, Speech and Signal
Processing (ICASSP).
Reztsov, A. 2014. Self-Organising Traffic Lights (SOTL) as
an Upper Bound Estimate. SSRN Electronic Journal, 24.
Roess, R.; Prassas, E.; and McShane, W. 2004. Traffic engi-
neering. Prentice Hall.
Schmidt, J.; K¨
ohler, B.; and Borstell, H. 2024. Reviv-
ing Simulated Annealing: Lifting its Degeneracies for Real-
Time Job Scheduling. In Hawaii International Conference
on System Sciences (HICSS).
Schmidt, J.; and Stober, S. 2024. Tilt your Head: Activat-
ing the Hidden Spatial-Invariance of Classifiers. In Interna-
tional Conference on Machine Learning (ICML).
Tobin, J.; Fong, R.; Ray, A.; Schneider, J.; Zaremba, W.;
and Abbeel, P. 2017. Domain randomization for transfer-
ring deep neural networks from simulation to the real world.
In IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS).
Urbanik, T.; Tanaka, A.; Lozner, B.; Lindstrom, E.; Lee,
K.; Quayle, S.; Beaird, S.; Tsoi, S.; Ryus, P.; Gettman, D.;
Sunkari, S.; Balke, K.; and Bullock, D. 2015. NCHRP Re-
port 812: A Guide for Applying Context-Sensitive Solutions
for Signalized Intersections.
van der Pol, E.; van Hoof, H.; Oliehoek, F. A.; and Welling,
M. 2022. Multi-Agent MDP Homomorphic Networks.
In International Conference on Learning Representations
(ICLR).
Varaiya, P. 2013. Max pressure control of a network of
signalized intersections. Transportation Research Part C:
Emerging Technologies.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. At-
tention is all you need. In International Conference on Neu-
ral Information Processing Systems (NeurIPS).
Veliˇ
ckovi´
c, P.; Cucurull, G.; Casanova, A.; Romero, A.; Li`
o,
P.; and Bengio, Y. 2017. Graph Attention Networks. Inter-
national Conference on Learning Representations (ICLR).
Wang, M.; Xiong, X.; Kan, Y.; Xu, C.; and Pun, M.-O. 2024.
UniTSA: A Universal Reinforcement Learning Framework
for V2X Traffic Signal Control. IEEE Transactions on Ve-
hicular Technology.
Wang, X.; Abdulhai, B.; and Sanner, S. 2023. A Critical Re-
view of Traffic Signal Control and a Novel Unified View of
Reinforcement Learning and Model Predictive Control Ap-
proaches for Adaptive Traffic Signal Control. In Handbook
on Artificial Intelligence and Transport.
Wang, Z.; Chen, J.; and Chen, H. 2021. EGAT: Edge-
Featured Graph Attention Network. In Artificial Neural Net-
works and Machine Learning (ICANN).
Webster, F. 1958. Traffic Signal Settings. Road research
technical paper.
Wei, H.; Chen, C.; Zheng, G.; Wu, K.; Gayah, V.; Xu,
K.; and Li, Z. 2019a. PressLight: Learning Max Pressure
Control to Coordinate Traffic Signals in Arterial Network.
In ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining.
Wei, H.; Xu, N.; Zhang, H.; Zheng, G.; Zang, X.; Chen, C.;
Zhang, W.; Zhu, Y.; Xu, K.; and Li, Z. 2019b. CoLight:
Learning Network-level Cooperation for Traffic Signal Con-
trol. In ACM International Conference on Information and
Knowledge Management (CIKM).
Wei, H.; Zheng, G.; Gayah, V.; and Li, Z. 2021. Recent Ad-
vances in Reinforcement Learning for Traffic Signal Con-
trol: A Survey of Models and Evaluation. SIGKDD Explor.
Newsl.
Wei, H.; Zheng, G.; Gayah, V. V.; and Li, Z. J. 2019c.
A Survey on Traffic Signal Control Methods. ArXiv,
abs/1904.08117.
Wei, H.; Zheng, G.; Yao, H.; and Li, Z. 2018. IntelliLight:
A Reinforcement Learning Approach for Intelligent Traffic
Light Control. In ACM SIGKDD International Conference
on Knowledge Discovery & Data Mining.
Wu, C.; Kim, I.; and Ma, Z. 2023. Deep Reinforce-
ment Learning Based Traffic Signal Control: A Comparative
Analysis. Procedia Computer Science. International Con-
ference on Ambient Systems, Networks and Technologies
Networks (ANT) and International Conference on Emerg-
ing Data and Industry 4.0 (EDI40).
Yi, Y.; Li, G.; Wang, Y.; and Lu, Z. 2024. Learning to
share in multi-agent reinforcement learning. In Interna-
tional Conference on Neural Information Processing Sys-
tems (NeurIPS).
Yoon, J.; Ahn, K.; Park, J.; and Yeo, H. 2021. Transferable
traffic signal control: Reinforcement learning with graph
centric state representation. Transportation Research Part
C: Emerging Technologies.
Yu, R.; Wang, Z.; Wang, Y.; Li, K.; Liu, C.; Duan, H.; Ji, X.;
and Chen, J. 2023. LaPE: Layer-adaptive Position Embed-
ding for Vision Transformers with Independent Layer Nor-
malization. In IEEE/CVF International Conference on Com-
puter Vision (ICCV).
Zang, X.; Yao, H.; Zheng, G.; Xu, N.; Xu, K.; and Li, Z.
2020. MetaLight: Value-based Meta-reinforcement Learn-
ing for Traffic Signal Control. In AAAI Conference on Arti-
ficial Intelligence.
Zhang, Z.; Yang, J.; and Zha, H. 2020. Integrating Indepen-
dent and Centralized Multi-agent Reinforcement Learning
for Traffic Signal Network Optimization. In International
Conference on Autonomous Agents and Multi-Agent Systems
(AAMAS).
Zheng, G.; Xiong, Y.; Zang, X.; Feng, J.; Wei, H.; Zhang, H.;
Li, Y.; Xu, K.; and Li, Z. 2019. Learning Phase Competition
for Traffic Signal Control. In ACM International Conference
on Information and Knowledge Management (CIKM).
Figure 7: Three random road-network samples (static envi-
ronments) used during training.
0 500 1000 1500 2000 2500 3000
training steps
1.0
0.9
0.8
0.7
0.6
0.5
0.4
average reward
A2C
DQN
0 500 1000 1500 2000 2500 3000
training steps
120
140
160
180
200
220
average queue length
A2C
DQN
Figure 8: Average log-pressure reward ()and the average
queue length ()over 3000 training steps.
A Supplementary Material
A.1 Implementation Details
A replay buffer is introduced to decorrelate the experience
tuples used for updating the parameters of the online DQN.
For the A2C tuples are promptly utilized to perform im-
mediate updates. This immediacy is crucial, as estimating
the policy gradient necessitates the use of experience tu-
ples generated from the current policy. All learnable func-
tions are MLPs incorporating additional intermediate lay-
ers for layer normalization and dropout. This design aims
to enhance training stability and convergence. We used a
64-dimensional latent space, which is significantly smaller
than all our baselines, saving computational resources, al-
lowing for better scalability and faster inference. We used 8
attention heads for all attention-based graph layer (see Sec-
tion 4). All experiments are performed on an Nvidia A40
GPU (48GB) node with 1 TB RAM, 2x 24core AMD EPYC
74F3 CPU @ 3.20GHz, and a local SSD (NVMe). As the
inference costs are generally extremely cheap, the available
resources are only required to amplify training. More details
can be found in our open-sourced code base.
Simulation Details We used the SUMO (Simulation of
Urban MObility) (Lopez et al. 2018) during all our exper-
iments. As in (Wei et al. 2019a), each action persists for a
duration of 10 seconds before the next action can be chosen.
To ensure safety, every transition from one phase to another
involves a 3-second yellow-change interval followed by 2-
second all-red interval to clear the intersection.
Training Details For optimisation, AdamW (Loshchilov
and Hutter 2017) with a learning rate of 1e3and other-
wise default settings is utilised. Furthermore, we used mini-
cologne3
ingolstadt7
Figure 9: Waiting time, queue length, and emission reduc-
tion using our log-distance pressure reward (Eq. (5)) com-
pared to the commonly used pressure reward (Eq. (2)).
batches of 64 SAR (state, action, reward) samples. We oper-
ate within a finite horizon of 0tT. We also include a
convergence illustration in Fig. 8. We found that both model
versions converge within 3000 steps (as the performance
stays within reasonable error-bounds constant afterwards).
We skipped the first 100 steps to let the traffic spawn in the
simulation and develop a natural flow.
Baseline Details All heuristics (incl. Random, FixedTime,
and MaxPressure) are custom implementations. All train-
able baselines and related performance results are obtained
using LibSignal (Mei et al. 2023). Nonetheless, we used the
same routines to compute the high-level performance indi-
cators presented in our performance plots.
A.2 Further Experiments
Reward Comparison Fig. 9 compares the performance
gains through our symmetry-breaking log-distance reward.
We found that our log-distance reward improves all three
target performance indicators over the simulated test span.
These empirical results underpin our theoretical claims in
Section 3.
Limits of Generalisability The ability to generalise is of
course facing limits at some range of problem complex-
ity. We performed an additional experiment on ingolstadt21
comprising 21 intersections in a narrow urban environment.
Figure 10 compares our method to various baselines under
different performance measures on this benchmark scenario.
After around 1200 time steps, TransferLight with either head
starts diverging into a suboptimal sequence of phases. On
the long run, this leads to congestions, which in turn lead
to performance decreases among all measures. We dedicate
our future work to prevent such situations to occur (under
reasonable traffic demands).
cologne3
ingolstadt7
synthe'c arterial scenario
Figure 10: Test performances (moving averages) on Ingolstadt21 over 3600 simulated time steps. This is an example of the
limits of generalisation capabilities of TransferLight. Around step 2400 a congestion is builds up around a few intersections
which miscalculated some phase energies. Afterwards, it was not able to resolve the knot and the congestion spread across the
network.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Deep neural networks are applied in more and more areas of everyday life. However, they still lack essential abilities, such as robustly dealing with spatially transformed input signals. Approaches to mitigate this severe robustness issue are limited to two pathways: Either models are implicitly regularised by increased sample variability (data augmentation) or explicitly constrained by hard-coded inductive biases. The limiting factor of the former is the size of the data space, which renders sufficient sample coverage intractable. The latter is limited by the engineering effort required to develop such inductive biases for every possible scenario. Instead, we take inspiration from human behaviour, where percepts are modified by mental or physical actions during inference. We propose a novel technique to emulate such an inference process for neural nets. This is achieved by traversing a sparsified inverse transformation tree during inference using parallel energy-based evaluations. Our proposed inference algorithm, called Inverse Transformation Search (ITS), is model-agnostic and equips the model with zero-shot pseudo-invariance to spatially transformed inputs. We evaluated our method on several benchmark datasets, including a synthesised ImageNet test set. ITS outperforms the utilised baselines on all zero-shot test scenarios.
Article
Full-text available
We introduce MuJAM, an adaptive traffic signal control method which leverages model-based reinforcement learning to 1) extend recent generalization efforts (to road network architectures and traffic distributions) further by allowing a generalization to the controllers’ constraints (cyclic and acyclic policies), 2) improve performance and data efficiency over related model-free approaches, and 3) enable explicit coordination at scale for the first time. In a zero-shot transfer setting involving both road networks and traffic settings never experienced during training, and in a larger transfer experiment involving the control of 3,971 traffic signal controllers in Manhattan, we show that MuJAM, using both cyclic and acyclic constraints, outperforms domain-specific baselines as well as a recent transferable approach.
Conference Paper
Full-text available
Inspired by the success of Simulated Annealing in physics, we transfer insights and adaptations to the scheduling domain, specifically addressing the one-stage job scheduling problem with an arbitrary number of parallel machines. In optimization, challenges arise from local optima, plateaus in the loss surface, and computationally complex Hamiltonian (cost) functions. To overcome these issues, we propose the integration of corrective actions, including symmetry breaking, restarts, and freezing out non-optimal fluctuations, into the Metropolis-Hastings algorithm. Additionally, we introduce a generalized Hamiltonian that efficiently fuses straightforward but widely applied processing-time cost functions. Our approach outperforms decision rules, meta-heuristics, and novel reinforcement learning algorithms. Notably, our method achieves these superior results in real-time, thanks to its computationally efficient evaluation of the Hamiltonian.
Article
Full-text available
This paper introduces a library for cross-simulator comparison of reinforcement learning models in traffic signal control tasks. This library is developed to implement recent state-of-the-art reinforcement learning models with extensible interfaces and unified cross-simulator evaluation metrics. It supports commonly-used simulators in traffic signal control tasks, including Simulation of Urban MObility(SUMO) and CityFlow, and multiple benchmark datasets for fair comparisons. We conducted experiments to validate our implementation of the models and to calibrate the simulators so that the experiments from one simulator could be referential to the other. Based on the validated models and calibrated environments, this paper compares and reports the performance of current state-of-the-art RL algorithms across different datasets and simulators. This is the first time that these methods have been compared fairly under the same datasets with different simulators.
Article
Full-text available
Data-driven and machine-learning-based methods are increasingly used in attempts to master the challenges of the world. But are they really the best approaches to manage complex dynamical systems? Our aim is to gain more insights into this question by studying various popular reinforcement learning methods for traffic signal control, namely in disrupted scenarios characterized by significant, unpredictable variations. The results are expected to be relevant in subject areas ranging from traffic physics to transportation theory, from dynamics in networks to complex systems, from control theory to self-organization, and from adaptive heuristics to machine learning.
Article
Traffic congestion is a persistent problem in urban areas, which calls for the development of effective traffic signal control (TSC) systems. While existing Reinforcement Learning (RL)-based methods have shown promising performance in optimizing TSC, it is challenging to generalize these methods across intersections of different structures. In this paper, we introduce UniTSA, a universal RL framework tailored for Vehicle-to-Everything (V2X) environments, aimed at overcoming these generalization hurdles. Our framework is equipped with a novel agent architecture that utilizes a junction matrix to uniformly represent intersection states, making it applicable to a variety of intersection designs. Additionally, UniTSA incorporates traffic state augmentation techniques specifically developed for TSC systems. These techniques leverage the rotational symmetry inherent to intersections and emphasize the relative positioning of vehicles over simple vehicle counts, enhancing the model's adaptability to changing traffic conditions and unfamiliar scenarios. We also integrate the Low-Rank Adaptation (LoRA) method, allowing for efficient model customization to specific intersections with minimal additional training. Extensive evaluations conducted on the Simulation of Urban MObility (SUMO) platform, featuring a range of intersection layouts, confirm UniTSA's robust performance. Our results indicate a significant advancement in the development of scalable TSC solutions suitable for diverse V2X applications. The source code in this work is available at https://github.com/wmn7/Universal-Light .
Conference Paper
We present a scalable planning algorithm for multi-agent sequential decision problems that require dynamic collaboration. Teams of agents need to coordinate decisions in many domains, but naive approaches fail due to the exponential growth of the joint action space with the number of agents. We circumvent this complexity through an anytime approach that allows us to trade computation for approximation quality and also dynamically coordinate actions. Our algorithm comprises three elements: online planning with Monte Carlo Tree Search (MCTS), factorizing local agent interactions with coordination graphs, and selecting optimal joint actions with the Max-Plus method. On the benchmark SysAdmin domain with static coordination graphs, our approach achieves comparable performance with much lower computation cost than the MCTS baselines. We also introduce a multi-drone delivery domain with dynamic, i.e., state-dependent coordination graphs, and demonstrate how our approach scales to large problems on this domain that are intractable for other MCTS methods.
Chapter
Most state-of-the-art Graph Neural Networks focus on node features in the learning process but ignore edge features. However, edge features also contain essential information in real-world, such as financial graphs. Node-centric approaches are suboptimal in edge-sensitive graphs since edge features are not adequately utilized. To address this problem, we present the Edge-Featured Graph Attention Network (EGAT) to leverage edge features in the graph feature representation. Our model is based on the edge-integrated attention mechanism, where both node and edge features are included in the calculation of the message and attention weights. In addition, the importance of edge information suggests that the edge features should be updated to learn high-level representation. So we perform edge updating with the integration of the features of connected nodes. In contrast to edge-node switching, our model acquires the adjacent edge features with the node-transit strategy, avoiding significant lift of computational complexity. Then we employ a multi-scale merge strategy, which concatenates features of every layer to construct hierarchical representation. Moreover, our model can be adapted to domain-specific graph neural networks, which further extends the application scenarios. Experiments show that our model achieves or matches the state-of-the-art on both node-sensitive and edge-sensitive datasets.