PreprintPDF Available

From Link Prediction to Forecasting: Information Loss in Batch-based Temporal Graph Learning

Authors:
  • Julius-Maximilans-Universität Würzburg
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Dynamic link prediction is an important problem considered by many recent works proposing various approaches for learning temporal edge patterns. To assess their efficacy, models are evaluated on publicly available benchmark datasets involving continuous-time and discrete-time temporal graphs. However, as we show in this work, the suitability of common batch-oriented evaluation depends on the datasets' characteristics, which can cause two issues: First, for continuous-time temporal graphs, fixed-size batches create time windows with different durations, resulting in an inconsistent dynamic link prediction task. Second, for discrete-time temporal graphs, the sequence of batches can additionally introduce temporal dependencies that are not present in the data. In this work, we empirically show that this common evaluation approach leads to skewed model performance and hinders the fair comparison of methods. We mitigate this problem by reformulating dynamic link prediction as a link forecasting task that better accounts for temporal information present in the data. We provide implementations of our new evaluation method for commonly used graph learning frameworks.
Content may be subject to copyright.
From Link Prediction to Forecasting: Information Loss
in Batch-based Temporal Graph Learning
Moritz Lampert
Chair of Machine Learning for Complex Networks
Center for Artificial Intelligence and Data Science (CAIDAS)
Julius-Maximilians-Universiät Würzburg, DE
moritz.lampert@uni-wuerzburg.de
Christopher Blöcker
Data Analytics Group
Department of Informatics
University of Zurich, CH
Ingo Scholtes
Chair of Machine Learning for Complex Networks
Center for Artificial Intelligence and Data Science (CAIDAS)
Julius-Maximilians-Universiät Würzburg, DE
Abstract
Dynamic link prediction is an important problem considered by many recent works
proposing various approaches for learning temporal edge patterns. To assess their
efficacy, models are evaluated on publicly available benchmark datasets involving
continuous-time and discrete-time temporal graphs. However, as we show in
this work, the suitability of common batch-oriented evaluation depends on the
datasets’ characteristics, which can cause two issues: First, for continuous-time
temporal graphs, fixed-size batches create time windows with different durations,
resulting in an inconsistent dynamic link prediction task. Second, for discrete-time
temporal graphs, the sequence of batches can additionally introduce temporal
dependencies that are not present in the data. In this work, we empirically show
that this common evaluation approach leads to skewed model performance and
hinders the fair comparison of methods. We mitigate this problem by reformulating
dynamic link prediction as a link forecasting task that better accounts for temporal
information present in the data. We provide implementations of our new evaluation
method for commonly used graph learning frameworks.
1 Introduction
Many scientific fields study data that can be modeled as graphs, where nodes represent entities
that are connected by edges. Examples include social [19], financial [1], biological [9] as well as
molecular networks [8]. Apart from the mere topology of interactions, i.e., who is connected to
whom, such data increasingly include information on when these interactions occur. Depending on
the temporal resolution, the resulting temporal graphs are often categorized as continuous-time or
discrete-time [21]: State-of-the-art data collection technology provides high-resolution continuous-
time temporal graphs, which capture the exact (and possibly unique) occurrence time of each
interaction. Examples include time-stamped online interactions [18] or social networks captured via
high-resolution proximity sensing technologies [37]. In contrast, discrete-time temporal graphs give
rise to a temporally ordered sequence of static snapshots, where each snapshot contains interactions
recorded within a (typically coarse-grained) time interval. Examples include scholarly collaboration
or citation graphs, which frequently include monthly or yearly snapshots.
Building on the growing importance of temporal data and the success of graph neural networks
(GNNs) for static graphs [3, 6], deep graph learning has recently been extended to temporal (or
dynamic) graphs [10]. To this end, several temporal graph neural network (TGNN) architectures
Preprint. Under review.
arXiv:2406.04897v1 [cs.LG] 7 Jun 2024
time
a
b
c
d
e
a
b
c
d
e
a
b
c
d
e
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
(a) b= 10, NMI = 0.66
b= 12, NMI = 0.64
h= 6, NMI = 0.71
(b) b= 9, NMI = 0.78 b= 10, NMI = 0.66 h= 1, NMI = 1
Figure 1: Illustration of the issues with a batch-based evaluation of TGNNs: (a) A continuous-time
temporal graph, split into batches with sizes
b= 10
(top),
b= 12
(middle), and time windows with
duration
h= 6
(bottom). (b) A discrete-time temporal graph, split into batches with size
b= 9
(left),
b= 10
(middle), and time windows with duration
h= 1
(right). Splitting temporal graphs with
inhomogeneous temporal activities into batches with fixed size
b
assigns edges in time windows of
varying lengths to the same batch and edges with identical timestamps to different batches. We use
normalized mutual information (NMI) between the edges’ timestamps and their associated batch
number (shown by colors) to quantify how much temporal information can be recovered from the
sequence of batch numbers alone. In our work, we propose a time-window-oriented approach to
evaluate dynamic link prediction that mitigates the information loss of current batch-based evaluation.
have been proposed that are able to simultaneously learn temporal and topological patterns. These
architectures are often evaluated in dynamic link prediction, where the task is to predict the existence
of edges in a future time window of length t, e.g., to provide recommendations to users [18].
For dynamic link prediction, TGNNs commonly utilize temporal batches to speed up training [34].
To construct these temporal batches, the sequence of temporally ordered edges is divided into a
sequence of equally large chunks that contain the same number of edges. Within each batch, edges
are typically treated as if they occurred simultaneously, thus discarding temporal information within
a batch. For continuous-time temporal graphs, such fixed-size batches are also likely to be associated
with time windows of varying lengths
t
. Changing the batch size affects the resulting window
lengths and could, e.g., change the task from predicting at the minute to the hour level, thus altering
its difficulty (Figure 1a). In discrete-time temporal graphs, snapshots are typically so large that they
comprise multiple batches (Figure 1b). This leads to the issue that the ordered sequence of batches
does not necessarily correspond to a temporally ordered sequence. A batch-wise training of TGNNs
thus effectively mixes information from the past and the future. This violates the arrow of time and
questions the applicability of TGNNs in real-world prediction settings, where models do not have
access to future information.
Addressing these important problems in the evaluation of temporal graph learning techniques, our
work makes the following contributions:
We quantify the information loss due to the aggregation of edges into batches on 14 discrete-
and continuous-time temporal graphs, thus showing how the dynamic link prediction task
depends on the batch size.
To mitigate this issue, we formulate the task as link forecasting using a time window-oriented
evaluation that adequately considers the available temporal information.
We perform an experimental evaluation of state-of-the-art TGNNs for link forecasting. Our
results highlight substantial differences in model performance compared to a batch-oriented
evaluation of link prediction, thus demonstrating the real-world impact of our work.
2
While batch-oriented processing is a technical necessity for efficient model training, our work shows
that tuning the batch size essentially tunes the link prediction task, thus fitting the task to the model
and undermining a fair comparison of temporal graph learning techniques. Proposing a time window-
oriented evaluation of dynamic link forecasting, our work provides a simple yet effective solution to
an important open issue in the evaluation of temporal graph neural networks.
2 Preliminaries and related work
Temporal graphs. A temporal (or dynamic) graph
G= (V, E )
is a tuple where
V
is the set
of
n=|V|
nodes and
E
is a chronologically ordered sequence of
m=|E|
time-stamped edges
defined as
E= ((u0, v0, t0),...,(um1, vm1, tm1))
with
1t0 · · · tm1tmax
[27,
40, 43]. Each node
vi
can have static node features
hiHV
and each edge
(ui, vj, t)
can have
edge features
eij,tHE
that change over time. We assume that interactions occur instantaneously
with discrete timestamps
tN
. Although timestamps
tN
are discrete, such temporal graphs are
often referred to as continuous-time [17, 33]. In contrast, discrete-time temporal graphs coarse-grain
time-stamped edges into a sequence of static snapshot graphs
{Gti:tj}
, where
Gti:tj= (V, Eti:tj)
with Eti:tj={(u, v)| (u, v, t)E:tittj}[42].
Dynamic link prediction. Given time-stamped edges up to time
t
, the goal of dynamic link
prediction is to predict whether an edge
(v, u, t + 1)
exists at future time
t+ 1
[43, 27, 17, 40]. In
practice, it is often computationally infeasible to train and evaluate models on all possible edges
one edge at a time. Thus, the chronologically ordered sequence of edges
E
is usually divided into
temporal batches
B+
i
, where each batch has a fixed size of
b
edges. Edges within the same batch are
typically processed in parallel [34, 28], thereby losing the temporal information inside each batch. In
addition to the existing (positive) edges
(u, v)B+
i
, non-existing (negative) edges
(u, v)B
i
are sampled and used for training and evaluation. This is done since real-world graphs are typically
sparse and using all possible edges between all node pairs would lead to a large class imbalance and
longer runtime. With these assumptions, we can formally define the task as follows:
Definition 2.1 (Dynamic link prediction).Let
G= (V, E )
be a temporal graph with node features
HV
and edge features
HE
. Let
b
be the batch size and
B+
i:= {(u, v)| (u, v, t)Ewith t
{ti·b, . . . , ti·(b+1)1}}
the set of
b
edges in the
i
-th batch. We further use
B
i
to denote a set of
negative edges drawn using negative sampling as described in Appendix A. For a given batch
i
we
use
ˆ
Ei={(u, v, t)| (u, v, t)E:t<ti·b}
to denote the past edges. The goal of dynamic link
prediction is to find a model
fθu, v |ˆ
Ei, HV, H ˆ
Ei
with parameters
θ
that, for each batch
i
, predicts
whether (u, v)B+
ior (u, v)B
i.
State-of-the-art TGNNs. Current state-of-the-art dynamic link prediction methods, such as
JODIE [18], DyRep [36], TGN [28], and TCL [39], utilize recurrent neural networks, graph attention,
transformers, or a combination thereof to capture the nodes’ time-evolving properties. Temporal
Graph Attention (TGAT) extends graph attention to the temporal domain and replaces positional
encodings in GAT with a vector representation of time [41]. CAWN learns temporal motifs based
on causal anonymous walks (CAW) [40]. GraphMixer takes an attention-free and transformer-free
approach, using an MLP-based link encoder, a mean-pooling-based node encoder, and an MLP-
based link classifier for predictions [5]. DyGFormer combines nodes’ historical co-occurrences as
interaction targets of the same source node with a temporal patching approach to capture long-term
histories [43]. Several further approaches for discrete-time dynamic link prediction exist, including
DyGEM [35], DySAT [29], and EvolveGCN [25]. For a recent survey of deep-learning-based
dynamic link prediction, we refer to Feng et al. [10].
Temporal graph training and evaluation. Recent works [34, 45, 44] identified issues in the
training setup for memory-based TGNNs with large batch sizes: Processing edges that belong to
the same batch in parallel ignores their temporal dependencies, resulting in varying performance
depending on the chosen batch size. This issue has been termed temporal discontinuity. Su et al. [34]
propose PRES which accounts for intra-batch temporal dependencies through a prediction-correction
scheme. Zhou et al. [44] propose a distributed framework using smaller batch sizes on multiple
trainers. However, these works focus on training, not considering temporal discontinuity in evaluation.
3
Recent progress in terms of TGNN evaluation includes the temporal graph benchmark (TGB) [16]
similar to the static open graph benchmark (OGB) [14]. Poursafaei et al. [27] identify problems with
random negative sampling for dynamic link prediction and propose new negative sampling techniques
dependent on time to improve the evaluation of TGNNs. Gastinger et al. [13] identify issues in the
evaluation of temporal knowledge graph forecasting. Although none of the models used for this task
overlap with regular TGNNs for dynamic link prediction, some of the problems can be related.
3 From link prediction to link forecasting
Learning temporal patterns in a batch-oriented fashion leads to issues in continuous-time and discrete-
time graphs. Below, we show that batching leads to inconsistent tasks because the time window for
prediction varies for temporal batches across different link densities in time. Temporal batches further
cause information loss by either inducing a non-existing temporal order between links or ignoring the
existing order. We demonstrate these issues in eight continuous-time and six discrete-time temporal
graphs, whose characteristics are summarized in Table 1 and Appendix B. To mitigate these issues,
we then formulate the Link Forecasting task based on fixed-length time windows.
Table 1: Characteristics of continuous and discrete-time temporal graphs [27, 43]. For each data set,
we include the type, the number of nodes
n
, the number of edges
m
, the resolution of timestamps,
the total duration
T
of the observation, the average number of edges
|Et|
with the same timestamp
t
,
and the temporal density T/m.
Dataset Type n m Resolution T|Et|T/m
Enron Contin. 184 125 235 1 second 3.6 years 5.5 ± 16.6 908.2 s
UCI Contin. 1899 59 835 1 second 193.7 days 1.0 ± 0.3 279.7 s
MOOC Contin. 7144 411 749 1 second 29.8 days 1.2 ± 0.5 6.2 s
Wiki. Contin. 9227 157 474 1 second 31.0 days 1.0 ± 0.2 17.0 s
LastFM Contin. 1980 1 293 103 1 second 4.3 years 1.0 ± 0.1 106.0 s
Myket Contin. 17 988 694 121 1 second 197.0 days 1.0 ± 0.0 24.5 s
Social Contin. 74 2 099 519 1 second 242.3 days 3.7 ± 2.5 10.0 s
Reddit Contin. 10 984 672 447 1 second 31.0 days 1.0 ± 0.1 4.0 s
UN V. Discrete 201 1 035 742 1 year 71.0 years 14 385.3 ± 7142.1 36.1 min
US L. Discrete 225 60 396 1 congress 11.0 congr. 5033.0 ± 92.4 1.8 ·104congr.
UN Tr. Discrete 255 507 497 1 year 31.0 years 15 859.3 ± 3830.8 32.1 min
Can. P. Discrete 734 74 478 1 year 13.0 years 5319.9 ± 1740.5 91.8 min
Flights Discrete 13 169 1 927 145 1 day 121.0 days 15 796.3 ± 4278.5 5.4 s
Cont. Discrete 692 2 426 279 5 minutes 28.0 days 300.9 ± 342.4 1.0 s
Different time window durations in fixed-size batches. One issue of batch-oriented temporal
graph learning and dynamic link prediction is that activities in real-world temporal graphs are
inhomogeneously distributed across time. In Figure 2 we show the temporal activity in terms of the
number of time-stamped edges within a given time interval both for continuous-time and discrete-
time temporal graphs. For continuous-time data, we used binning in six-hour intervals. The results
show that most real-world temporal graphs have highly inhomogeneous activities across time. For
batch-oriented evaluation, this introduces the issue that each fixed-size batch
B+
i
determines a time
window with duration tjtithat is shorter (longer) during periods with higher (lower) activity.
In Figure 3 we evaluate the dependency between batch size and window for empirical temporal
graphs. We observe that, both in continuous- and discrete-time temporal graphs, a single batch size
can create time windows with varying durations even within the same dataset. For continuous-time
temporal graphs, we typically have much bigger batches than edges per timestamp such that the time
window of a batch is long (cf. Table 1). The number of edges per snapshot in discrete-time temporal
graphs is generally larger than the batch size
b
in any period regardless of the density (Table 1). This
means that edges in a batch often belong to the same snapshot leading to small window durations.
As an example, consider the Myket dataset [20] which contains users
v
and Android applications
u
,
connected at time
t
when user
v
installs application
u
. The timestamps are provided in seconds and
edges occur roughly every half a minute on average (cf. Table 1), making the expected time range for
a batch with size
b= 2
approximately 0.5 minutes. With
b= 2
, the task is to predict which users
install what applications during this time window. Choosing
b= 120
or
b= 2880
turns the task into
a prediction problem for approximately the next hour or day, respectively. As we can see, batching
not only leads to incomparable prediction tasks between models and datasets due to the varying
4
0 2000 4000
0
1000
Enron
0 500
0
1000 UCI
0 50 100
0
5000
10000 MOOC
0 50 100
0
1000
2000 Wiki.
0 2500 5000
Time (6 hours)
0
500
1000 LastFM
0 500
Time (6 hours)
0
1000
2000 Myket
0 500 1000
Time (6 hours)
0
5000
10000
Social
0 50 100
Time (6 hours)
0
5000 Reddit
(a) Continuous-time temporal graphs resoluted into seconds and binned into 6-hour time periods.
0 10000 20000
0
10000
20000 UN V.
0.0 2.5 5.0 7.5 10.0
0
2000
4000 US L.
0 2500 5000 750010000
0
10000
20000 UN Tr.
0 2000 4000
Time
0
5000
Can. P.
0 50 100
Time
0
20000
40000 Flights
0 10 20
Time
0
100000
Cont.
(b) Discrete-time temporal graphs with different time resolution (see Table 1).
Figure 2: Real-world datasets exhibit diverse edge occurrence patterns that are visualised using the
edge density across time, i.e., histograms counting the number of edges per timestamp. Dashed lines
divide the datasets into train, validation, and test sets as used in Section 4.
window duration but also acts as a kind of coarse-graining discarding all temporal information inside
each batch. We discuss this information loss and resulting issues in the next section.
Information loss in batch-based temporal graph learning. In Figure 4a we use normalized
mutual information (NMI) [7] to measure the information loss caused by splitting temporal edges
E
into batches. NMI quantifies how much information observing one random variable conveys about
another random variable. It takes values between
0
, meaning “no information”, and
1
, meaning “full
information”. By treating the index
i
of each batch
Bi
assigned to each edge
(u, v)Bi
as one
random variable and the associated edge’s timestamp
t
as the other, we can measure the temporal
information that is retained after dividing edges into batches. In this case, a value of
1
means that we
can reconstruct the timestamps of edges correctly from their batch number, and a value of
0
means
that batch numbers do not carry any information about timestamps. Consequently, small NMI values
indicate a large loss of temporal information due to batching.
In Figure 4a we see that in continuous-time temporal graphs where timestamps have a high resolution,
larger batches result in larger information loss because assigning edges that occur at different times to
the same batch discards their temporal ordering; the larger the batch size, the more information is lost.
A batch size of
b= 1
preserves all temporal information i.e. maximum NMI because we obtain a
bijective mapping between timestamps and batch numbers. An exception is, e.g., the Enron dataset
where emails sent to multiple recipients are recorded as simultaneously occurring edges. This leads
to a maximum NMI roughly at a batch size that equals the average number of edges per snapshot.
Figure 4b shows the batch-size dependent NMI for discrete-time temporal graphs. Similar to the
Enron dataset, the “optimal” batch size that retains most temporal information depends on the average
number of links per snapshot and, thus, on the characteristics of the data. Too small batch sizes
impose an ordering on the edges within the snapshots that is not present in the data while too large
batches stretch across snapshots and discard the temporal ordering of edges from different batches.
These results show that changing the batch size influences both the prediction time window as
well as the temporal information available to TGNNs. Effectively, the batch size is a hidden
5
(a) Continuous-time temporal graphs: Batch size
b
determines the average time window length. However, a
single batch size creates time windows with various lengths within and across datasets.
(b) Discrete-time temporal graphs: Fixed-size batches fall mostly within snapshots when the batches are much
smaller than the snapshots. Depending on the dataset, larger batches can also span across many snapshots.
Figure 3: Scatterplots visualising time window durations of all batches of size bfor different bs.
hyperparameter that directly impacts the characteristics (and difficulty) of the prediction task. In
real-world applications, however, the prediction time window is inherently connected to the problem
at hand, necessitating a task formulation that is chosen carefully for each benchmark instead of for
each model. To address this issue, we propose a new task formulation that utilizes a fixed prediction
time window thereby solving the first problem. Using a fixed time window rather than a fixed number
of links also mitigates the second problem by providing an equal amount of temporal information in
each batch, thus facilitating a fair model comparison.
Link forecasting: task definition. The study of temporal information is at the center of time
series forecasting and, therefore, we relate our task definition to a fixed temporal quantity to solve
the identified problems [2]. We can interpret the temporal edges
E
as
n2
Boolean time series,
each of which takes the value 1 at those times when an edge occurs. Standard multivariate models
output a value for each timestamp over a forecasting horizon
h
. In large-scale temporal graphs, it is
computationally infeasible to forecast the existence of all
n2
possible links, thus, only a sample of
negative edges is considered instead. In continuous-time dynamic graphs, observations are available at
high resolution, e.g. seconds, however, for many practical applications, predicting at lower granularity
suffices. For example, it is typically enough to predict whether a customer purchases a certain product
within the next day or week. Therefore, we consider forecasting for all timestamps
[t+ 1, t +h]
at
once instead of for each of them individually, and define the link forecasting task as follows:
Definition 3.1 (Dynamic link forecasting).Let
G= (V, E )
be a temporal graph with node features
HV
and edge features
HE
. Let
h
be the time horizon and
W+
i:= {(u, v)| (u, v, t)Ewith i·h
t < (i+ 1) ·h}
the set of edges in the
i
-th time window. We use
W
i
to denote a sample
of negative edges that do not occur in time window
[i·h, (i+ 1) ·h)
. We further use
ˆ
Ei=
{(u, v, t)| (u, v, t)E:t<i·h}
to denote the set of past edges for time window
i
. The goal of
dynamic link forecasting is to find a model
fθ(u, v|ˆ
Ei, HV, H ˆ
Ei)
with parameters
θ
that, for each
time window iforecasts whether (u, v)W+
ior (u, v)W
i.
6
0.0
0.5
1.0
NMI
Enron
0.0
0.5
1.0
UCI
0.0
0.5
1.0
MOOC
0.0
0.5
1.0
Wiki.
101103
Batch Size
0.0
0.5
1.0
NMI
LastFM
101103
Batch Size
0.0
0.5
1.0
Myket
101103
Batch Size
0.0
0.5
1.0
Social
101103
Batch Size
0.0
0.5
1.0
Reddit
(a) For continuous-time temporal graphs, the temporal ordering of edges within batches is discarded. With
increasing batch size, more edges with different timestamps are assigned to the same time window, thus, losing
more information.
100101102103104
0.0
0.5
1.0
NMI
UN V.
100101102103104
0.0
0.5
1.0
US L.
100101102103104
0.0
0.5
1.0
UN Tr.
100101102103104
Batch Size
0.0
0.5
1.0
NMI
Can. P.
100101102103104
Batch Size
0.0
0.5
1.0
Flights
100101102103104
Batch Size
0.0
0.5
1.0
Cont.
(b) Small batches for discrete-time temporal graphs implicitly define an edge ordering within snapshots that is
not present in the data, consequently losing the information that edges of the same snapshot occur at the same
time. The NMI has its maximum near the average snapshot size (refer to Table 1) after which the values decrease
again similar to continuous-time datasets.
Figure 4: Temporal information loss in terms of Normalized Mutual Information (NMI, y-axis) for
different batch sizes (x-axis), where smaller NMI scores indicate more information loss.
Crucially, this definition makes the evaluation independent of the batch size
b
and instead introduces
a time horizon
h
that defines the forecasting time window. Note that we define the past edges
ˆ
E
as a
temporal graph to keep temporal information, e.g., to enable sampling of recent neighbors [28]. The
forecasting time window, on the other hand, is a snapshot that discards temporal information inside
each window since we are forecasting if a link exists in any of the timestamps
t[t+ 1, t +h]
.
For parallel processing, batches can be defined either as
Et+1:t+hE
t+1:t+h
and can have different
sizes, or as a subset
BEt+1:t+hE
t+1:t+h
as long other time-dependent parts of the optimization
are based on Et+1:t+h, e.g., negative sampling or updating a memory.
Implementation We provide implementations for our evaluation procedure in commonly used
PyTorch [26] libraries to simplify the adoption of our approach. Specifically, we implement a
new
DataLoader
called
SnapshotLoader1
that replaces the widely used
TemporalDataLoader
in PyTorch Geometric [11]. We extend DyGLib [43] with a command line argument
horizon
that
can be used in the evaluation pipeline.
2
The latter was used for the experiments in this work and can
be used to reproduce our results.
4 Link prediction vs. forecasting in state-of-the-art TGNNs
We now experimentally evaluate nine state-of-the-art models [18, 36, 41, 28, 40, 27, 39, 5, 43],
evaluating their performance both based for the (usual) dynamic link prediction and the proposed
dynamic link forecasting task. We use implementation and model configurations provided by DyGLib
1https://github.com/M-Lampert/pytorch_geometric/blob/snapshot-loader/torch_
geometric/loader/snapshot_loader.py
2https://github.com/M-Lampert/DyGLib
7
Table 2: Average links per window
|Eti:tj|
and standard deviation for horizon
h
used in the evaluation
(left). We chose
h
for each dataset to get
|Eti:tj| 200
. The average time window duration is in
hours and seconds per batch (center). NMI uses the window and batch IDs of the test set as random
variables quantifying how much the chosen chunks differ between the approaches (right).
Dataset h|Eti:tj|bAvg Duration (h) Avg Duration (s) NMI
Enron 172 800s (48h) 214.1 ± 274.1 200 50.1 ± 165.38 180395.2 ± 595358.23 0.80
UCI 57 600s (16h) 208.5 ± 335.5 200 15.4 ± 31.67 55542.5 ± 114021.12 0.83
MOOC 1200s (1
/3h) 199.3 ± 167.2 200 0.3 ± 0.72 1242.8 ± 2597.53 0.88
Wikipedia 3600s (1h) 211.7 ± 56.3 200 0.9 ± 0.26 3382.2 ± 921.97 0.89
LastFM 21 600s (6h) 204.4 ± 120.2 200 5.9 ± 5.37 21099.5 ± 19331.76 0.91
Myket 5400s (3
/2h) 220.1 ± 133.6 200 1.4 ± 1.18 4879.2 ± 4236.54 0.91
Social Evo. 1800s (1
/2h) 186.1 ± 165.3 200 0.6 ± 1.26 1984.1 ± 4533.72 0.91
Reddit 900s (1
/4h) 226.0 ± 54.5 200 0.2 ± 0.06 792.5 ± 199.35 0.92
[43] (cf. Appendix C) and repeat each experiment five times to obtain averages. We use historical
negative sampling [27] and train each model using batch-based training and validation with batch
size
b= 200
which was found “to be a good trade-off between speed and update granularity” [28]
and adopted in similar works [43, 27]. Afterwards, we evaluate each model trained with batch-based
training using our proposed time-window-based evaluation method as well as the common batch-
based evaluation approach with
b= 200
. We choose the forecasting horizon
h
such that we obtain
average batch sizes of approximately
200
for all continuous-time datasets to make the results of
our new evaluation method comparable to the results of the batch-oriented approach (Table 2). For
discrete-time temporal graphs, we set
h= 1
to obtain one time window per snapshot, predicting for
time intervals ranging from five minutes to a year, depending on the datasets (Table 1).
The results are presented in Table 3 for continuous-time and Table 4 discrete-time temporal graphs.
The tables show AUC-ROC scores for time window-based link forecasting and the relative change
compared to the batch-based evaluation commonly used for dynamic link prediction. For continuous-
time temporal graphs, the change in performance between our window-based and the batch-based
approach largely depends on the dataset: Datasets with a similar window duration for all fixed-sized
batches (quantified with high NMI scores in Table 2), such as Wikipedia, Reddit, or Myket, only
exhibit small differences between the performances. This is expected since we chose the horizon
h
to
produce batches of the same average size as the fixed-sized batches. Nevertheless, the time windows
in datasets with inhomogeneously distributed temporal activity such as Enron or UCI do not fit the
fixed-sized batches well (lower NMI in Table 2), showing substantial performance changes across
models. This highlights that the performance scores of batch-based evaluation are skewed and may
not reflect the models’ performance in a real-world setting on inhomogeneous temporal datasets.
Table 3: Test AUC-ROC scores for link forecasting and the relative change compared to link prediction
for continuous-time graphs on the same trained models (standard deviations in Appendix D). We
compute the AUC-ROC score per time window and average by weighing each time window equally,
regardless of the number of edges (Appendix E discusses additional weighting schemes). The last
row/column provides mean
µ
and standard deviation
σ
of the absolute relative change per column/row.
Dataset JODIE DyRep TGN TGAT CAWN EdgeBank TCL GraphMixer DyGFormer µ±σ
Enron 84.0(8.6%) 80.3(9.2%) 67.9(0.2%) 69.0(17.6%) 75.7(13.9%) 82.7(3.6%) 75.1(11.1%) 88.6(9.0%) 84.5(10.6%) 9.3%±5.1%
UCI 86.8(4.2%) 60.2(17.1%) 62.1(1.5%) 55.2(7.4%) 56.5(3.0%) 72.5(4.9%) 56.3(6.2%) 80.2(0.5%) 75.7(0.6%) 5.0%±5.1%
MOOC 83.1(2.1%) 79.0(2.1%) 87.4(1.2%) 79.9(2.9%) 68.8(2.2%) 59.8(3.4%) 68.4(5.8%) 70.3(5.5%) 80.0(1.5%) 3.0%±1.7%
Wiki. 81.5(0.4%) 78.3(0.1%) 83.7(0.6%) 82.9(0.7%) 71.3(0.4%) 77.2(0.1%) 84.6(0.6%) 87.3(0.6%) 79.8(0.3%) 0.4%±0.2%
LastFM 76.3(2.2%) 69.0(3.7%) 79.2(1.9%) 65.2(4.7%) 66.3(2.6%) 78.0(0.2%) 62.5(2.7%) 59.9(9.2%) 78.2(1.0%) 3.1%±2.6%
Myket 64.4(0.6%) 64.1(0.1%) 61.2(0.1%) 57.8(0.4%) 33.5(3.1%) 52.6(1.3%) 58.2(0.3%) 59.8(0.5%) 33.8(3.0%) 1.0%±1.2%
Social 92.1(0.8%) 92.2(0.5%) 92.2(0.5%) 92.5(0.1%) 86.5(1.4%) 84.9(1.1%) 94.7(0.6%) 94.6(0.6%) 97.3(0.0%) 0.6%±0.4%
Reddit 80.6(0.0%) 79.5(0.0%) 80.4(0.0%) 78.6(0.1%) 80.2(0.0%) 78.6(0.1%) 76.2(0.1%) 77.1(0.1%) 80.2(0.0%) 0.0%±0.1%
µ±σ2.4%±2.9% 4.1%±6.1% 0.8%±0.7% 4.2%±6.0% 3.3%±4.4% 1.8%±1.9% 3.4%±4.0% 3.2%±4.0% 2.1%±3.6%
We further observe that, for dynamic link forecasting, the performance of memory-based models
on discrete-time temporal graphs tends to decrease more than for other methods. This is expected
since these models incorporate information about the present snapshot by updating their memory
based on prior batches, which means using part of the snapshot’s edges to predict its remaining edges.
Our evaluation method prevents this information leakage, which explains the substantial drop in
performance. For Contacts, the discrete-time dataset with the highest NMI score, we see the smallest
changes in model performance. This demonstrates that the models’ performance obtained through
8
batch-oriented evaluation reflects the time-window-based performance more closely when a given
batch size defines more homogeneous time windows. However, this is often not the case in real-world
discrete-time temporal graphs with low granularity and large snapshots.
Table 4: Test AUC-ROC scores as in Table 3 but for discrete-time graphs.
Dataset JODIE DyRep TGN TGAT CAWN EdgeBank TCL GraphMixer DyGFormer µ±σ
UN V. 54.0(26.7%) 52.2(28.2%) 51.3(27.1%) 54.4(3.0%) 53.7(7.1%) 89.6(0.0%) 53.4(0.6%) 56.9(1.1%) 65.2(3.5%) 10.8%±12.6%
US L. 52.5(6.8%) 61.8(22.6%) 57.7(31.2%) 78.6(0.2%) 82.0(0.2%) 68.4(1.3%) 75.4(0.3%) 90.4(0.2%) 89.4(0.0%) 7.0%±11.7%
UN Tr. 57.7(12.8%) 50.3(20.4%) 54.3(14.0%) 64.1(3.9%) 67.6(4.5%) 85.6(1.0%) 63.7(4.5%) 68.6(3.4%) 70.7(3.4%) 7.5%±6.6%
Can. P. 63.6(0.5%) 67.5(1.2%) 73.2(0.2%) 72.7(1.5%) 70.0(2.9%) 63.2(0.4%) 69.5(2.0%) 80.7(0.6%) 85.5(12.5%) 2.4%±3.9%
Flights 67.4(3.1%) 66.0(4.3%) 68.1(1.0%) 72.6(0.0%) 65.2(0.3%) 74.6(0.0%) 70.6(0.0%) 70.7(0.0%) 68.6(0.5%) 1.0%±1.6%
Cont. 95.6(0.1%) 94.9(0.5%) 96.6(0.5%) 95.9(0.6%) 86.7(4.1%) 93.0(0.9%) 95.7(1.7%) 95.2(1.1%) 97.7(0.6%) 1.1%±1.2%
µ±σ8.3%±10.2% 12.9%±12.2% 12.3%±14.1% 1.5%±1.6% 3.2%±2.7% 0.6%±0.5% 1.5%±1.7% 1.1%±1.2% 3.4%±4.7%
5 Conclusion
In this work, we considered issues associated with current evaluation practices for dynamic link
prediction in temporal graphs. To address computational limitations, edges in the test set are split into
fixed-size batches, causing several issues: In continuous-time temporal graphs, fixed-size batches
create varying-length time windows depend on the temporal activity patterns. For edges within a
batch we further lose information on their temporal ordering. In discrete-time temporal graphs where
snapshots are typically larger than the batch size, batches impose an ordering of edges which is not
present in the data. Moreover, state-of-the-art approaches for dynamic link prediction have treated the
batch size as a tunable parameter. However, changing the batch size actually changes the prediction
task, resulting in incomparable results between different batch sizes.
We solve these issues by formulating the dynamic link forecasting task. Dynamic link forecasting
acknowledges the resolution at which temporal interaction data is recorded and explicitly considers a
forecasting horizon corresponding to a prediction time window of a fixed duration. Depending on the
dataset and problem setting, the horizon may span seconds, minutes, hours, or longer, but crucially,
time windows always span the same length. We evaluated dynamic link forecasting performance
of nine state-of-the-art temporal graph learning approaches on 14 real-world datasets, comparing it
to the common dynamic link prediction evaluation. We find substantial differences, especially for
memory-based TGNNs. We provide data loader implementations for commonly used evaluation
frameworks to facilitate practical applications of our evaluation approach.
Limitations and Open Issues Limitations of our work include that our reformulation of the
dynamic link prediction task suggests time-window-based approaches for model training, which
however goes beyond the scope of our paper. After completing our experiments, we became aware of
a current work proposing a correction technique that could account for some of the issues addressed
by our work during training [34]. Future work could thus evaluate whether this correction approach
can mitigate some of the differences observed in our results.
Despite these open issues, we believe that our work is an important and timely contribution that can
help to improve evaluation practices in temporal graph learning.
Acknowledgments and Disclosure of Funding
Moritz Lampert and Christopher Blöcker acknowledge funding from the German Federal Ministry of
Education and Research, Grant No. 100582863 (TissueNet). Ingo Scholtes and Christopher Blöcker
acknowledge funding through the Swiss National Science Foundation, Grant No. 176938.
References
[1]
Marco Bardoscia et al. “The Physics of Financial Networks”. In: Nat Rev Phys 3.7 (July 2021),
pp. 490–507. DO I:10.1038/s42254-021-00322-5.
[2]
Konstantinos Benidis et al. “Deep Learning for Time Series Forecasting: Tutorial and Literature
Survey”. In: ACM Comput. Surv. 55.6 (2023), 121:1–121:36. D OI:10.1145/3533382.
9
[3]
Michael M. Bronstein et al. “Geometric Deep Learning: Going beyond Euclidean Data”. In:
IEEE Signal Process. Mag. 34.4 (2017), pp. 18–42. DOI:10.1109/MSP.2017.2693418.
[4] O. Celma. Music Recommendation and Discovery in the Long Tail. Springer, 2010.
[5]
Weilin Cong et al. “Do We Really Need Complicated Model Architectures For Temporal
Networks?” In: ICLR. OpenReview.net, 2023.
[6]
Gabriele Corso et al. “Graph Neural Networks”. In: Nat Rev Methods Primers 4.1 (Mar. 7,
2024), pp. 1–13. DO I:10.1038/s43586- 024-00294-7.
[7]
Thomas M. Cover and Joy A. Thomas. Elements of information theory, second edition. John
Wiley & Sons, 2006. ISB N: 978-0-471-24195-9.
[8]
Laurianne David et al. “Molecular representations in AI-driven drug discovery: a review and
practical guide”. In: Journal of Cheminformatics 12.1 (2020), p. 56.
[9]
Eric H. Davidson et al. “A Genomic Regulatory Network for Development”. In: Science
295.5560 (Mar. 2002), pp. 1669–1678. DOI :10.1126/science.1069883.
[10]
ZhengZhao Feng et al. “A Comprehensive Survey of Dynamic Graph Neural Networks:
Models, Frameworks, Benchmarks, Experiments and Challenges”. In: CoRR abs/2405.00476
(2024).
[11]
Matthias Fey and Jan Eric Lenssen. “Fast Graph Representation Learning with PyTorch
Geometric”. In: CoRR abs/1903.02428 (2019). UR L:
https://www.pyg.org/
(visited on
12/05/2022).
[12]
James H. Fowler. “Legislative cosponsorship networks in the US House and Senate”. In: Social
Networks 28.4 (2006), pp. 454–465. ISS N: 0378-8733. D OI:
https://doi.org/10.1016/j.
socnet.2005.11.003.
[13]
Julia Gastinger et al. “Comparing Apples and Oranges? On the Evaluation of Methods for
Temporal Knowledge Graph Forecasting”. In: ECML/PKDD (3). Vol. 14171. Springer, 2023,
pp. 533–549. DO I:10.1007/978- 3-031-43418-1_32.
[14]
Weihua Hu et al. “Open Graph Benchmark: Datasets for Machine Learning on Graphs”. In:
NeurIPS. 2020.
[15]
Shenyang Huang et al. “Laplacian Change Point Detection for Dynamic Graphs”. In: Proceed-
ings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data
Mining. KDD ’20. Virtual Event, CA, USA: Association for Computing Machinery, 2020,
pp. 349–358. IS B N: 9781450379984. D OI:10.1145/3394486.3403077.
[16]
Shenyang Huang et al. “Temporal Graph Benchmark for Machine Learning on Temporal
Graphs”. In: NeurIPS. 2023.
[17]
Seyed Mehran Kazemi et al. “Representation Learning for Dynamic Graphs: A Survey”. In: J.
Mach. Learn. Res. 21 (2020), 70:1–70:73.
[18]
Srijan Kumar, Xikun Zhang, and Jure Leskovec. “Predicting Dynamic Embedding Trajectory
in Temporal Interaction Networks”. In: KDD. ACM, 2019, pp. 1269–1278.
[19]
David Lazer et al. “Computational Social Science”. In: Science 323.5915 (Feb. 6, 2009),
pp. 721–723. DO I:10.1126/science.1167742.
[20]
Erfan Loghmani and MohammadAmin Fazli. “Effect of Choosing Loss Function When Using
T-batching for Representation Learning on Dynamic Networks”. In: CoRR abs/2308.06862
(2023).
[21]
Antonio Longa et al. “Graph Neural Networks for Temporal Graphs: State of the Art, Open
Challenges, and Opportunities”. In: CoRR abs/2302.01018 (2023).
[22]
Graham K. MacDonald et al. “Rethinking Agricultural Trade Relationships in an Era of
Globalization”. In: BioScience 65.3 (Feb. 2015), pp. 275–289. ISSN: 0006-3568. DO I:
10.
1093/biosci/biu225.
[23]
Anmol Madan et al. “Sensing the "Health State" of a Community”. In: IEEE Pervasive
Computing 11.4 (2012), pp. 36–45. DO I:10.1109/MPRV.2011.79.
[24]
Pietro Panzarasa, Tore Opsahl, and Kathleen M. Carley. “Patterns and dynamics of users’
behavior and interaction: Network analysis of an online community”. In: Journal of the
American Society for Information Science and Technology 60.5 (2009), pp. 911–932. D OI:
https://doi.org/10.1002/asi.21015.
10
[25]
Aldo Pareja et al. “EvolveGCN: Evolving Graph Convolutional Networks for Dynamic
Graphs”. In: Proceedings of the AAAI Conference on Artificial Intelligence 34.04 (Apr. 2020),
pp. 5363–5370. DOI:
10.1609/ aaai . v34i04 .5984
.UR L:
https:// ojs . aaai .org/
index.php/AAAI/article/view/5984.
[26]
Adam Paszke et al. “PyTorch: An Imperative Style, High-Performance Deep Learning Library”.
In: NeurIPS. 2019, pp. 8024–8035. DO I:
https: //proceedings.neurips .cc/paper/
2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html.
[27]
Farimah Poursafaei et al. “Towards Better Evaluation for Dynamic Link Prediction”. In:
NeurIPS. 2022.
[28]
Emanuele Rossi et al. “Temporal Graph Networks for Deep Learning on Dynamic Graphs”. In:
CoRR abs/2006.10637 (2020).
[29]
Aravind Sankar et al. “DySAT: Deep Neural Representation Learning on Dynamic Graphs via
Self-Attention Networks”. In: Proceedings of the 13th International Conference on Web Search
and Data Mining. WSDM ’20. Houston, TX, USA: Association for Computing Machinery,
2020, pp. 519–527. ISB N: 9781450368223. DOI:
10.1145/3336191.3371845
.URL:
https:
//doi.org/10.1145/3336191.3371845.
[30]
Piotr Sapiezynski et al. “Interaction data from the Copenhagen Networks Study”. In: Scientific
Data 6.1 (Dec. 2019), p. 315. IS S N: 2052-4463. D OI:10.1038/s41597-019- 0325-x.
[31]
Matthias Schäfer et al. “Bringing up OpenSky: A large-scale ADS-B sensor network for
research”. In: IPSN-14 Proceedings of the 13th International Symposium on Information
Processing in Sensor Networks. 2014, pp. 83–94. DOI:10.1109/IPSN.2014.6846743.
[32]
Jitesh Shetty and Jafar Adibi. “The Enron email dataset database schema and brief statistical
report”. In: Information sciences institute technical report, University of Southern California
4.1 (2004), pp. 120–128.
[33]
Joakim Skarding, Bogdan Gabrys, and Katarzyna Musial. “Foundations and Modeling of
Dynamic Networks Using Dynamic Graph Neural Networks: A Survey”. In: IEEE Access 9
(2021), pp. 79143–79168. DO I:10.1109/ACCESS.2021.3082932.
[34]
Junwei Su, Difan Zou, and Chuan Wu. “PRES: Toward Scalable Memory-Based Dynamic
Graph Neural Networks”. In: ICLR (Poster). OpenReview.net, 2024.
[35]
Aynaz Taheri, Kevin Gimpel, and Tanya Berger-Wolf. “Learning to Represent the Evolution of
Dynamic Graphs with Recurrent Models”. In: Companion Proceedings of The 2019 World Wide
Web Conference. WWW ’19. San Francisco, USA: Association for Computing Machinery,
2019, pp. 301–307. ISB N: 9781450366755. DOI:
10.1145/3308560.3316581
.URL:
https:
//doi.org/10.1145/3308560.3316581.
[36]
Rakshit Trivedi et al. “DyRep: Learning Representations over Dynamic Graphs”. In: ICLR
(Poster). OpenReview.net, 2019.
[37]
Philippe Vanhems et al. “Estimating Potential Infection Transmission Routes in Hospital
Wards Using Wearable Proximity Sensors”. In: PLOS ONE 8.9 (Sept. 11, 2013), e73970. DOI:
10.1371/journal.pone.0073970.
[38]
Erik Voeten, Anton Strezhnev, and Michael Bailey. United Nations General Assembly Voting
Data. Version V32. 2009. DOI:
10.7910/DVN/LEJUQZ
.URL:
https://doi.org/10.7910/
DVN/LEJUQZ.
[39]
Lu Wang et al. “TCL: Transformer-based Dynamic Graph Modelling via Contrastive Learning”.
In: CoRR abs/2105.07944 (2021).
[40]
Yanbang Wang et al. “Inductive Representation Learning in Temporal Networks via Causal
Anonymous Walks”. In: ICLR. OpenReview.net, 2021.
[41]
Da Xu et al. “Inductive Representation Learning on Temporal Graphs”. In: ICLR. OpenRe-
view.net, 2020.
[42]
Guotong Xue et al. “Dynamic Network Embedding Survey”. In: Neurocomputing 472 (2022),
pp. 212–223. DO I:10.1016/j.neucom.2021.03.138.
[43]
Le Yu et al. “Towards Better Dynamic Graph Learning: New Architecture and Unified Library”.
In: NeurIPS. 2023.
[44]
Hongkuan Zhou et al. “DistTGL: Distributed Memory-Based Temporal Graph Neural Network
Training”. In: SC. ACM, 2023, 39:1–39:12.
[45]
Hongkuan Zhou et al. “TGL: A General Framework for Temporal GNN Training on Billion-
Scale Graphs”. In: CoRR abs/2203.14883 (2022).
11
A Negative sampling approaches
Dynamic link prediction is typically framed as a binary classification problem to predict class 1 for
existing links during a certain time window and 0 otherwise. Due to the sparsity of most real-world
graphs, it usually suffices to train and evaluate using all existing (positive) edges and a sample of
non-existing (negative) edges out of all possible edges
V2
. In static link prediction, negative edges
are typically sampled randomly from
V2
without replacement but Poursafaei et al. [27] showed that
this technique is ill-suited for dynamic link prediction. One reason is rooted in the characteristics of
temporal graphs where already-seen interactions tend to repeat several times during the observation
period. To address this issue, Poursafaei et al. [27] introduced negative sampling which we cover in
the following.
Given a training set
Etrain
and test set
Etest
, each containing a sequence of edges
Etmin:tmax
in the
temporal graph
G
, we can define the following commonly used sampling strategies for drawing
negative samples B
ifor batch Biwith |Bi|=|B
i|[27, 43].
Random (rnd): Sample
B
i
from
V2
without replacement. The subgraph corresponding to
Bi
is assumed to be sparse, making it unlikely to sample a positive edge
eBi
as negative.
Historic (his): Sample
B
i
without replacement from all training edges
Ehist =Etrain \
(vj, uj)|ti·btjt(i+1)·b
except the ones appearing at the same time as the edges in
Bi. If |Ehist |<|Bi|, draw the remaining edges randomly as described above.
Inductive (ind): Sample
B
i
without replacement from all unseen test edges
Eind =Etest \
(Etrain (vj, uj)|ti·btjt(i+1)·b)
except the ones appearing at the same time as the
edges in Bi. If |Eind |<|Bi|, draw the remaining edges randomly as described above.
Note that we leave out the validation set
Eval
for simplicity. Negative edges for
Eval
can be sampled
as for Etest.
12
B Datasets
In this work, we use eight continuous-time and six discrete-time datasets, listed in Table 1. Here,
we describe what systems were observed to create the datasets and plot the datasets’ link count
histograms in Figure 5.
0 1000
101
103Enron
0 20 40
101
103
105
UCI
2.5 5.0 7.5
101
103
105MOOC
123
103
105
Wiki.
0 20
Link Counts
102
105LastFM
1 2
Link Counts
104
106
Myket
0 20 40
Link Counts
101
103
105
Social
2 4
Link Counts
102
104
106
Reddit
(a) In the continuous-time temporal graphs, it is most common that at most one edge occurs per timestamp.
Depending on the dataset, more than one edge per timestamp is more or less common.
5000 1000015000 20000 25000
0
2
4UN V.
4900 5000 5100
0
2
4US L.
1000012500 15000 17500 20000
0
2
4UN Tr.
4000 6000 8000
Link Counts
0
1
2Can. P.
10000 15000 20000 25000
Link Counts
0
2
4Flights
0 500 1000 1500
Link Counts
0
100
Cont.
(b) The snapshots in discrete-time temporal graphs contain large numbers of edges, typically much larger than
commonly utilized batch sizes. The Contacts dataset has fewer links per snapshot due to its much higher
resolution than the remaining discrete-time datasets.
Figure 5: The link count histograms show how many edges occur per timestamp in continuous-time
temporal graphs and per snapshot in discrete-time temporal graphs, respectively.
Enron [32] is a bipartite continuous-time graph where nodes are users and the temporal
edges represent emails sent between users. Emails with multiple recipients are recorded as
separate and simultaneously occurring edges, one per recipient. The temporal edges are
resolved at the second level and the dataset spans approximately 3.6 years.
UCI [24] is a unipartite continuous-time social network dataset from an online platform at
the University of California at Irvine. The nodes represent students and the timestamped
edges represent communication between the students. The dataset spans approximately six
and a half months.
MOOC (massive open online course) [18] is a bipartite continuous-time graph where nodes
represent users and units in an online course, such as problems or videos. Temporal edges
are resolved at the second level and encode when a user interacts with a unit of the online
course. The dataset spans approximately one month.
Wikipedia (Wiki.) [18] is a bipartite continuous-time graph where nodes represent editors
and Wikipedia articles. The timestamped edges are resolved at the second level and represent
when an editor has edited an article. The dataset spans approximately one month.
LastFM is a bipartite continuous-time graph where nodes represent users and songs. Tem-
poral edges are resolved at the second level and model the users’ listening behavior and
represent when a user has listened to a song. The dataset was originally published by Celma
[4] and later filtered by Kumar et al. [18] for use in a temporal graph learning context.
13
Myket [20] is a bipartite continuous-time graph where nodes represent user and Android
applications. The timestamped edges represent when a user installed an application. The
dataset spans approximately six and a half months.
Social Evolution (Social) [23] is a unipartite continuous-time graph of the proximity
between the students in a dormitory, collected between October 2008 and May 2009 using
mobile phones. Temporal edges connect students when they are in proximity and are
resolved at the second level.
Reddit [18] is a bipartite continuous-time graph where nodes represent Reddit users and
their posts. The timestamped edges are resolved at the second level and represent when a
user has made a post on Reddit. The dataset spans approximately one month.
UN Vote (UN V.)[38, 27] is a weighted unipartite discrete-time graph of votes in the United
Nations General Assembly between 1946 and 2020. Nodes represent countries and edges
connect countries if they both vote “yes”. The dataset is resolved at the year level and edge
weights represent how many times the two connected countries have both voted “yes” in the
same vote.
US Legislators (US L.) [15, 12, 27] is a weighted unipartite discrete-time graph of interac-
tions between legislators in the US Senate. Nodes represent legislators and edges represent
co-sponsorship, i.e., edges connect legislators who co-sponsor the same bill. The dataset
is resolved at the congress level and edge weights encode the number of co-sponsorships
during a congress.
UN Trade (UN Tr.) [22, 27] is a directed and weighted unipartite discrete-time graph of food
and agricultural trade between countries where nodes represent countries. The dataset spans
30 years and is resolved at the year level. Weighted edges encode the sum of normalized
agriculture imports or exports between two countries during a given year.
Canadian Parliament (Can. P.) [15, 27] is a weighted unipartite discrete-time political
network where nodes represent Members of the Canadian Parliament (MPs) and an edge
between two MPs means that they have both voted “yes” on a bill. The dataset is resolved
at the year level and the edges’ weights represent how often the two connected MPs voted
“yes” on the same bill during a year.
Flights [31, 27] is a directed and weighted unipartite discrete-time graph where nodes
represent airports and edges represent flights during the COVID-19 pandemic. The edges
are resolved at the day level and their weights are given by the number of flights between
two airports during the respective day.
Contacts (Cont.) [30, 27] is a unipartite discrete-time proximity network between university
students. Nodes represent students who are connected by an edge if they were in close
proximity during a time window. The dataset is resolved at the 5-minute level and spans one
month.
14
C Experimental details
For reproducibility, we provide a Python package extending the dynamic graph learning library
DyGLib
3
[43] as a supplement, including a bash script to run the experiments. The code will be made
publicly available on GitHub after acceptance of the paper.
We use the best hyperparameters reported by Yu et al. [43] and, for completeness, list these hyperpa-
rameters for the 13 datasets used by Yu et al. [43] below. However, the Myket dataset [20] was not
included in the study. Therefore, for Myket, we use each method’s default parameters as suggested
by the respective authors.
We use 9 state-of-the-art dynamic graph learning models and baselines (JODIE [18], DyRep [36],
TGAT [41], TGN [28], CAWN [40], EdgeBank [27], TCL [39], GraphMixer [5] and DyGFormer
[43]). The neural-network-based approaches (all except EdgeBank) are trained five times for 100
epochs using the Adam optimizer with a learning rate of
0.0001
. An early-stopping strategy with
a patience of 5 is employed to avoid overfitting. For training and validation, a batch size of 200 is
used. The training, validation and test sets of each dataset contain 70%, 15% and 15% of the edges,
respectively. The sets are split based on time, i.e., the training set contains the edges that occurred
first while the test set comprises the most recent edges.
The experiments were conducted on a variety of machines with different CPUs and GPUs. A list of
machine specifications is provided in Table 5.
Table 5: Hardware details of the machines used for the experiments.
(a) CPUs
CPU
AMD Ryzen Threadripper PRO 5965WX 24 Cores
AMD Ryzen 9 7900X 12 Cores
11th Gen Intel(R) Core(TM) i9-11900K 8 Cores
AMD Ryzen 9 7950X 16 Cores
13th Gen Intel(R) Core(TM) i9-13900H 14 Cores
(b) GPUs
GPU
NVIDIA GeForce RTX 3090 Ti
NVIDIA GeForce RTX 4080
NVIDIA GeForce RTX 3090
NVIDIA GeForce RTX 4090
NVIDIA GeForce RTX 4060 (Laptop)
NVIDIA A100
NVIDIA GeForce RTX 2080 Ti
NVIDIA TITAN Xp
NVIDIA TITAN X
NVIDIA Quadro RTX 8000
For all model architectures, time-related representations use a size of 100 dimensions while all other
non-time-related representations are set to 172. An exception is DyGFormer where the neighbor
co-occurrence encoding and the aligned encoding each have 50 dimensions. We use eight attention
heads for CAWN, and two attention heads for all other attention-based methods. The memory-based
models either use a vanilla recurrent neural network (JODIE and DyRep), or a gated recurrent unit
(GRU) to update their memory. Other model-specific parameters are provided in Table 6 .
3https://github.com/yule-BUAA/DyGLib (MIT License)
15
Table 6: Specific hyperparameters for different models and datasets.
(a) Hyperparameters for neighborhood sampling-based
models.
nNeighbors
is the number of sampled neigh-
bors using the specified neighbor sampling strategy.
nLayers
is the number of transformer layers (TCL),
the number of MLP-Mixer layers (for GraphMixer) or
the number of GNN layers otherwise.
Dataset Model Neigh. Sampling nNeighbors nLayers Dropout
Wikipedia
DyRep recent 10 1 0.1
TGAT recent 20 2 0.1
TGN recent 10 1 0.1
TCL recent 20 2 0.1
GraphMixer recent 30 2 0.5
Reddit
DyRep recent 10 1 0.1
TGAT uniform 20 2 0.1
TGN recent 10 1 0.1
TCL uniform 20 2 0.1
GraphMixer recent 10 2 0.5
MOOC
DyRep recent 10 1 0.0
TGAT recent 20 2 0.1
TGN recent 10 1 0.2
TCL recent 20 2 0.1
GraphMixer recent 20 2 0.4
LastFM
DyRep recent 10 1 0.0
TGAT recent 20 2 0.1
TGN recent 10 1 0.3
TCL recent 20 2 0.1
GraphMixer recent 10 2 0.0
Enron
DyRep recent 10 1 0.0
TGAT recent 20 2 0.2
TGN recent 10 1 0.0
TCL recent 20 2 0.1
GraphMixer recent 20 2 0.5
Social Evo.
DyRep recent 10 1 0.1
TGAT recent 20 2 0.1
TGN recent 10 1 0.0
TCL recent 20 2 0.0
GraphMixer recent 20 2 0.3
UCI
DyRep recent 10 1 0.0
TGAT recent 20 2 0.1
TGN recent 10 1 0.1
TCL recent 20 2 0.0
GraphMixer recent 20 2 0.4
Myket
DyRep recent 10 1 0.1
TGAT recent 20 2 0.1
TGN recent 10 1 0.1
TCL recent 20 2 0.1
GraphMixer recent 20 2 0.1
Flights
DyRep recent 10 1 0.1
TGAT recent 20 2 0.1
TGN recent 10 1 0.1
TCL recent 20 2 0.1
GraphMixer recent 20 2 0.2
Can. Parl.
DyRep uniform 10 1 0.0
TGAT uniform 20 2 0.2
TGN uniform 10 1 0.3
TCL uniform 20 2 0.2
GraphMixer uniform 20 2 0.2
US Legis.
DyRep recent 10 1 0.0
TGAT recent 20 2 0.1
TGN recent 10 1 0.1
TCL uniform 20 2 0.3
GraphMixer recent 20 2 0.4
UN Trade
DyRep recent 10 1 0.1
TGAT uniform 20 2 0.1
TGN recent 10 1 0.2
TCL uniform 20 2 0.0
GraphMixer uniform 20 2 0.1
UN Vote
DyRep recent 10 1 0.1
TGAT recent 20 2 0.2
TGN uniform 10 1 0.1
TCL uniform 20 2 0.0
GraphMixer uniform 20 2 0.0
Contacts
DyRep recent 10 1 0.0
TGAT recent 20 2 0.1
TGN recent 10 1 0.1
TCL recent 20 2 0.0
GraphMixer recent 20 2 0.1
(b) Hyperparameters DyGFormer.
Dataset Model Sequence Length Patch Size Dropout
Wikipedia DyGFormer 32 1 0.1
Reddit DyGFormer 64 2 0.2
MOOC DyGFormer 256 8 0.1
LastFM DyGFormer 512 16 0.1
Enron DyGFormer 256 8 0.0
Social Evo. DyGFormer 32 1 0.1
UCI DyGFormer 32 1 0.1
Myket DyGFormer 32 1 0.1
Flights DyGFormer 256 8 0.1
Can. Parl. DyGFormer 2048 64 0.1
US Legis. DyGFormer 256 8 0.0
UN Trade DyGFormer 256 8 0.0
UN Vote DyGFormer 128 4 0.2
Contacts DyGFormer 32 1 0.0
(c) Hyperparameters CAWN.
Dataset Model Walk Length Time Scale Dropout
Wikipedia CAWN 1 0.000001 0.1
Reddit CAWN 1 0.000001 0.1
MOOC CAWN 1 0.000001 0.1
LastFM CAWN 1 0.000001 0.1
Enron CAWN 1 0.000001 0.1
Social Evo. CAWN 1 0.000001 0.1
UCI CAWN 1 0.000001 0.1
Myket CAWN 1 0.000001 0.1
Flights CAWN 1 0.000001 0.1
Can. Parl. CAWN 1 0.000001 0.0
US Legis. CAWN 1 0.000001 0.1
UN Trade CAWN 1 0.000001 0.1
UN Vote CAWN 1 0.000001 0.1
Contacts CAWN 1 0.000001 0.1
(d) Hyperparameters EdgeBank
Dataset Model Neg. Sampling Memory Mode Time Window
Wikipedia EdgeBank
random unlimited -
historical repeat threshold -
inductive repeat threshold -
Reddit EdgeBank
random unlimited -
historical repeat threshold -
inductive repeat threshold -
MOOC EdgeBank
random time window fixed proportion
historical time window repeat interval
inductive repeat threshold -
LastFM EdgeBank
random time window fixed proportion
historical time window repeat interval
inductive repeat threshold -
Enron EdgeBank
random time window fixed proportion
historical time window repeat interval
inductive repeat threshold -
Social Evo. EdgeBank
random repeat threshold -
historical repeat threshold -
inductive repeat threshold -
UCI EdgeBank
random unlimited -
historical time window fixed proportion
inductive time window repeat interval
Myket EdgeBank
random unlimited -
historical repeat threshold
inductive repeat threshold
Flights EdgeBank
random unlimited -
historical repeat threshold -
inductive repeat threshold -
Can. Parl. EdgeBank
random time window fixed proportion
historical time window fixed proportion
inductive repeat threshold -
US Legis. EdgeBank
random time window fixed proportion
historical time window fixed proportion
inductive time window fixed proportion
UN Trade EdgeBank
random time window repeat interval
historical time window repeat interval
inductive repeat threshold -
UN Vote EdgeBank
random time window repeat interval
historical time window repeat interval
inductive time window repeat interval
Contacts EdgeBank
random time window repeat interval
historical time window repeat interval
inductive repeat threshold -
16
D Detailed AUC-ROC and average precision results
Here, we provide detailed tabulated results for all models’ AUC-ROC and average precision perfor-
mance across five runs, including standard deviations.
AUC-ROC
Table 7: Average AUC-ROC performance over five runs for the test set of the continuous-time datasets
from [27, 43], including standard deviations.
Eval Dataset JODIE DyRep TGN TGAT CAWN EdgeBank TCL GraphMixer DyGFormer
Forec.
Enron 84.0 ± 5.1 80.3 ± 1.4 67.9 ± 7.1 69.0 ± 1.6 75.7 ± 0.5 82.7 ± 0.0 75.1 ± 5.2 88.6 ± 0.5 84.5 ± 0.6
UCI 86.8 ± 1.0 60.2 ± 2.8 62.1 ± 1.3 55.2 ± 1.4 56.5 ± 0.5 72.5 ± 0.0 56.3 ± 1.0 80.2 ± 1.0 75.7 ± 0.5
MOOC 83.1 ± 4.2 79.0 ± 4.5 87.4 ± 1.9 79.9 ± 0.8 68.8 ± 1.6 59.8 ± 0.0 68.4 ± 1.4 70.3 ± 1.2 80.0 ± 9.0
Wiki. 81.5 ± 0.4 78.3 ± 0.4 83.7 ± 0.6 82.9 ± 0.3 71.3 ± 0.8 77.2 ± 0.0 84.6 ± 0.5 87.3 ± 0.3 79.8 ± 1.6
LastFM 76.3 ± 0.8 69.0 ± 1.4 79.2 ± 2.7 65.2 ± 0.9 66.3 ± 0.3 78.0 ± 0.0 62.5 ± 6.4 59.9 ± 1.4 78.2 ± 0.6
Myket 64.4 ± 2.2 64.1 ± 2.9 61.2 ± 2.6 57.8 ± 0.5 33.5 ± 0.4 52.6 ± 0.0 58.2 ± 2.2 59.8 ± 0.4 33.8 ± 0.9
Social 92.1 ± 1.9 92.2 ± 0.7 92.2 ± 2.6 92.5 ± 0.5 86.5 ± 0.0 84.9 ± 0.0 94.7 ± 0.5 94.6 ± 0.2 97.3 ± 0.1
Reddit 80.6 ± 0.1 79.5 ± 0.8 80.4 ± 0.4 78.6 ± 0.7 80.2 ± 0.3 78.6 ± 0.0 76.2 ± 0.4 77.1 ± 0.4 80.2 ± 1.1
Pred.
Enron 77.4 ± 3.6 73.5 ± 2.4 68.0 ± 2.9 58.7 ± 1.2 66.4 ± 0.4 79.8 ± 0.0 67.6 ± 5.5 81.3 ± 0.8 76.4 ± 0.5
UCI 83.3 ± 1.4 51.4 ± 7.8 63.0 ± 1.3 59.6 ± 1.5 58.2 ± 0.6 69.1 ± 0.0 60.0 ± 0.9 80.6 ± 0.8 76.2 ± 0.6
MOOC 84.8 ± 3.1 80.7 ± 3.2 88.5 ± 1.6 82.3 ± 0.6 70.4 ± 1.3 61.9 ± 0.0 72.6 ± 0.6 74.4 ± 1.4 81.2 ± 8.9
Wiki. 81.8 ± 0.4 78.4 ± 0.4 84.1 ± 0.6 83.5 ± 0.2 71.6 ± 0.8 77.1 ± 0.0 85.2 ± 0.5 87.8 ± 0.3 80.0 ± 1.6
LastFM 78.0 ± 0.7 71.7 ± 1.1 80.7 ± 2.4 68.4 ± 0.7 68.1 ± 0.3 78.2 ± 0.0 64.3 ± 6.1 65.9 ± 1.7 78.9 ± 0.6
Myket 64.0 ± 2.1 64.2 ± 2.7 61.1 ± 2.6 57.6 ± 0.4 32.5 ± 0.4 51.9 ± 0.0 58.4 ± 2.0 59.5 ± 0.4 32.8 ± 1.0
Social 91.4 ± 2.1 92.7 ± 0.5 91.7 ± 3.3 92.6 ± 0.5 87.7 ± 0.1 85.8 ± 0.0 95.2 ± 0.2 94.1 ± 0.2 97.3 ± 0.1
Reddit 80.6 ± 0.1 79.5 ± 0.8 80.4 ± 0.4 78.7 ± 0.6 80.2 ± 0.3 78.6 ± 0.0 76.2 ± 0.4 77.1 ± 0.4 80.2 ± 1.1
Table 8: Average AUC-ROC performance over five runs for the test set of the discrete-time datasets
from [27, 43], including standard deviation.
Eval Dataset JODIE DyRep TGN TGAT CAWN EdgeBank TCL GraphMixer DyGFormer
Forec.
UN V. 54.0 ± 1.8 52.2 ± 2.0 51.3 ± 7.1 54.4 ± 3.6 53.7 ± 2.1 89.6 ± 0.0 53.4 ± 1.0 56.9 ± 1.6 65.2 ± 1.1
US L. 52.5 ± 1.8 61.8 ± 3.5 57.7 ± 1.8 78.6 ± 7.9 82.0 ± 4.0 68.4 ± 0.0 75.4 ± 5.3 90.4 ± 1.5 89.4 ± 0.9
UN Tr. 57.7 ± 3.3 50.3 ± 1.4 54.3 ± 1.5 64.1 ± 1.3 67.6 ± 1.2 85.6 ± 0.0 63.7 ± 1.6 68.6 ± 2.6 70.7 ± 2.6
Can. P. 63.6 ± 0.8 67.5 ± 8.5 73.2 ± 1.1 72.7 ± 2.2 70.0 ± 1.4 63.2 ± 0.0 69.5 ± 3.1 80.7 ± 0.9 85.5 ± 3.5
Flights 67.4 ± 2.0 66.0 ± 1.9 68.1 ± 1.7 72.6 ± 0.2 65.2 ± 1.8 74.6 ± 0.0 70.6 ± 0.1 70.7 ± 0.3 68.6 ± 1.3
Cont. 95.6 ± 0.8 94.9 ± 0.3 96.6 ± 0.3 95.9 ± 0.2 86.7 ± 0.1 93.0 ± 0.0 95.7 ± 0.5 95.2 ± 0.2 97.7 ± 0.0
Pred.
UN V. 73.7 ± 2.4 72.6 ± 1.5 70.3 ± 4.3 52.8 ± 3.6 50.1 ± 1.6 89.5 ± 0.0 53.0 ± 1.6 56.2 ± 2.0 63.0 ± 1.1
US L. 56.3 ± 1.9 79.9 ± 1.1 84.0 ± 2.2 78.5 ± 7.8 81.8 ± 4.0 67.5 ± 0.0 75.6 ± 5.4 90.2 ± 1.6 89.4 ± 0.9
UN Tr. 66.1 ± 3.0 63.2 ± 2.1 63.1 ± 1.2 61.7 ± 1.3 64.7 ± 1.3 86.4 ± 0.0 60.9 ± 1.3 66.3 ± 2.5 68.3 ± 2.3
Can. P. 63.9 ± 0.7 66.6 ± 2.5 73.4 ± 3.5 71.6 ± 2.6 68.0 ± 1.0 62.9 ± 0.0 68.2 ± 3.6 81.2 ± 1.0 97.7 ± 0.7
Flights 69.5 ± 2.2 69.0 ± 1.0 68.8 ± 1.6 72.6 ± 0.2 65.0 ± 1.4 74.6 ± 0.0 70.6 ± 0.1 70.7 ± 0.3 68.9 ± 1.1
Cont. 95.5 ± 0.6 95.4 ± 0.2 96.1 ± 0.8 95.4 ± 0.3 83.3 ± 0.0 92.2 ± 0.0 94.1 ± 0.8 94.1 ± 0.2 97.1 ± 0.0
Average precision
Table 9: Mean average precision performance for dynamic link forecasting (window-based) over ve
runs for the continuous-time datasets. Values in parenthesis show the relative change as compared to
the average precision performance for dynamic link prediction (batch-based).
Dataset JODIE DyRep TGN TGAT CAWN EdgeBank TCL GraphMixer DyGFormer µ±σ
Enron 80.8(12.0%) 78.3(12.4%) 68.6(5.0%) 71.7(13.5%) 76.0(15.1%) 81.1(5.5%) 78.2(11.3%) 89.8(9.2%) 85.3(11.6%) 10.6%±3.4%
UCI 87.0(7.2%) 59.5(21.4%) 69.2(2.8%) 64.4(6.2%) 64.0(1.6%) 68.6(5.4%) 65.2(5.5%) 85.3(0.6%) 80.5(0.2%) 5.7%±6.4%
MOOC 82.4(1.1%) 76.9(0.3%) 86.3(0.8%) 82.7(2.1%) 72.3(1.7%) 59.1(2.7%) 74.9(4.6%) 74.4(4.5%) 82.1(0.4%) 2.0%±1.6%
Wiki. 84.1(0.1%) 80.9(0.1%) 88.5(0.4%) 87.5(0.4%) 75.1(0.1%) 73.3(0.2%) 89.2(0.4%) 90.8(0.4%) 83.1(0.0%) 0.2%±0.2%
LastFM 76.7(1.1%) 69.4(2.7%) 78.8(1.9%) 72.1(3.9%) 68.2(2.3%) 73.4(0.3%) 70.3(1.8%) 70.0(5.4%) 80.5(0.7%) 2.2%±1.6%
Myket 64.5(1.6%) 63.1(0.9%) 62.8(1.2%) 57.9(1.4%) 46.6(3.3%) 51.9(1.3%) 58.9(1.1%) 60.0(1.5%) 46.1(3.3%) 1.7%±0.9%
Social 89.4(1.2%) 91.9(0.3%) 93.9(0.8%) 95.0(0.2%) 85.6(0.6%) 79.7(1.1%) 96.1(0.1%) 95.8(0.4%) 97.7(0.3%) 0.6%±0.4%
Reddit 80.1(0.0%) 79.2(0.0%) 80.6(0.0%) 78.6(0.1%) 81.3(0.2%) 73.5(0.2%) 76.5(0.0%) 77.5(0.0%) 82.8(0.1%) 0.1%±0.1%
µ±σ3.0%±4.3% 4.8%±7.9% 1.6%±1.6% 3.5%±4.6% 3.1%±5.0% 2.1%±2.2% 3.1%±3.9% 2.8%±3.3% 2.1%±4.0%
17
Table 10: Mean average precision performance for dynamic link forecasting (window-based) over
five runs for the discrete-time datasets. Values in parenthesis show the relative change as compared to
the average precision performance for dynamic link prediction (batch-based).
Dataset JODIE DyRep TGN TGAT CAWN EdgeBank TCL GraphMixer DyGFormer µ±σ
UN V. 52.6(22.4%) 49.6(26.8%) 49.7(24.8%) 52.7(0.9%) 52.4(3.4%) 84.2(0.7%) 52.4(1.7%) 54.0(0.1%) 62.4(4.0%) 9.4%±11.6%
US L. 46.0(4.5%) 62.5(14.5%) 58.6(27.8%) 71.0(0.3%) 80.7(0.1%) 63.2(0.2%) 77.5(0.1%) 86.5(0.6%) 86.1(0.4%) 5.4%±9.6%
UN Tr. 52.7(10.6%) 49.4(16.6%) 53.2(9.7%) 59.1(2.0%) 59.2(2.4%) 79.0(2.6%) 57.5(2.3%) 65.8(3.3%) 67.1(4.4%) 6.0%±5.1%
Can. P. 52.1(1.3%) 61.0(1.3%) 69.9(2.0%) 70.8(4.5%) 68.3(6.9%) 59.4(6.8%) 68.2(6.2%) 80.9(4.9%) 83.2(14.3%) 5.4%±4.0%
Flights 65.2(2.2%) 63.9(4.3%) 68.3(0.0%) 73.5(1.1%) 64.7(1.3%) 70.4(0.2%) 71.0(0.4%) 71.9(1.0%) 68.9(0.1%) 1.2%±1.4%
Cont. 94.0(0.2%) 95.8(0.4%) 97.0(0.8%) 96.8(0.8%) 88.2(4.4%) 89.4(0.6%) 96.6(2.1%) 95.7(1.6%) 98.3(0.6%) 1.3%±1.3%
µ±σ6.9%±8.5% 10.6%±10.4% 10.8%±12.5% 1.6%±1.5% 3.1%±2.4% 1.8%±2.6% 2.1%±2.2% 1.9%±1.8% 4.0%±5.4%
Table 11: Mean average precision performance over five runs for the test set of the continuous-time
datasets from [27, 43], including standard deviations.
Eval Dataset JODIE DyRep TGN TGAT CAWN EdgeBank TCL GraphMixer DyGFormer
Forec.
Enron 80.8 ± 5.3 78.3 ± 2.3 68.6 ± 5.6 71.7 ± 1.2 76.0 ± 0.7 81.1 ± 0.0 78.2 ± 2.9 89.8 ± 0.4 85.3 ± 0.6
UCI 87.0 ± 1.9 59.5 ± 2.3 69.2 ± 1.0 64.4 ± 1.1 64.0 ± 0.7 68.6 ± 0.0 65.2 ± 0.9 85.3 ± 0.6 80.5 ± 0.9
MOOC 82.4 ± 4.9 76.9 ± 4.3 86.3 ± 2.3 82.7 ± 0.7 72.3 ± 1.3 59.1 ± 0.0 74.9 ± 0.7 74.4 ± 0.6 82.1 ± 8.7
Wiki. 84.1 ± 0.5 80.9 ± 0.4 88.5 ± 0.4 87.5 ± 0.2 75.1 ± 1.0 73.3 ± 0.0 89.2 ± 0.3 90.8 ± 0.2 83.1 ± 1.2
LastFM 76.7 ± 0.6 69.4 ± 1.8 78.8 ± 3.5 72.1 ± 0.8 68.2 ± 0.5 73.4 ± 0.0 70.3 ± 6.5 70.0 ± 1.1 80.5 ± 0.9
Myket 64.5 ± 1.8 63.1 ± 1.5 62.8 ± 2.2 57.9 ± 0.4 46.6 ± 0.2 51.9 ± 0.0 58.9 ± 2.6 60.0 ± 0.2 46.1 ± 1.7
Social 89.4 ± 4.7 91.9 ± 1.0 93.9 ± 1.7 95.0 ± 0.3 85.6 ± 0.1 79.7 ± 0.0 96.1 ± 0.4 95.8 ± 0.2 97.7 ± 0.1
Reddit 80.1 ± 0.3 79.2 ± 0.9 80.6 ± 0.6 78.6 ± 1.0 81.3 ± 0.4 73.5 ± 0.0 76.5 ± 0.6 77.5 ± 0.5 82.8 ± 0.8
Pred.
Enron 72.1 ± 3.0 69.7 ± 3.7 65.3 ± 3.2 63.2 ± 0.5 66.0 ± 0.5 76.9 ± 0.0 70.2 ± 3.4 82.3 ± 0.6 76.4 ± 0.4
UCI 81.1 ± 3.3 49.0 ± 4.5 71.2 ± 1.1 68.6 ± 1.1 65.1 ± 0.6 65.0 ± 0.0 69.0 ± 0.8 85.9 ± 0.5 80.7 ± 1.1
MOOC 83.4 ± 4.3 77.1 ± 3.8 87.0 ± 2.1 84.5 ± 0.7 73.5 ± 1.0 60.7 ± 0.0 78.5 ± 0.5 77.9 ± 0.8 82.4 ± 9.3
Wiki. 84.1 ± 0.5 80.9 ± 0.3 88.8 ± 0.3 87.9 ± 0.2 75.0 ± 1.2 73.1 ± 0.0 89.5 ± 0.3 91.2 ± 0.2 83.1 ± 1.1
LastFM 77.6 ± 0.6 71.4 ± 1.7 80.3 ± 3.2 75.0 ± 0.7 69.8 ± 0.5 73.2 ± 0.0 71.6 ± 6.1 74.1 ± 1.3 81.1 ± 0.9
Myket 63.4 ± 1.7 62.5 ± 1.4 62.1 ± 2.3 57.1 ± 0.4 45.1 ± 0.2 51.3 ± 0.0 58.3 ± 2.2 59.1 ± 0.2 44.7 ± 1.6
Social 88.3 ± 4.8 91.6 ± 0.7 93.2 ± 2.4 94.8 ± 0.3 86.2 ± 0.2 80.6 ± 0.0 96.2 ± 0.2 95.4 ± 0.1 97.3 ± 0.1
Reddit 80.1 ± 0.3 79.2 ± 0.9 80.5 ± 0.5 78.6 ± 1.0 81.1 ± 0.4 73.7 ± 0.0 76.5 ± 0.6 77.5 ± 0.5 82.7 ± 0.8
Table 12: Mean average precision performance over five runs for the test set of the discrete-time
datasets from [27, 43], including standard deviations.
Eval Dataset JODIE DyRep TGN TGAT CAWN EdgeBank TCL GraphMixer DyGFormer
Forec.
UN V. 52.6 ± 1.8 49.6 ± 1.9 49.7 ± 3.9 52.7 ± 2.6 52.4 ± 2.0 84.2 ± 0.0 52.4 ± 0.9 54.0 ± 1.4 62.4 ± 1.7
US L. 46.0 ± 0.9 62.5 ± 3.6 58.6 ± 2.4 71.0 ± 8.9 80.7 ± 3.7 63.2 ± 0.0 77.5 ± 4.3 86.5 ± 1.9 86.1 ± 1.0
UN Tr. 52.7 ± 3.0 49.4 ± 0.9 53.2 ± 1.5 59.1 ± 2.7 59.2 ± 1.7 79.0 ± 0.0 57.5 ± 1.9 65.8 ± 1.9 67.1 ± 2.7
Can. P. 52.1 ± 0.5 61.0 ± 7.6 69.9 ± 0.8 70.8 ± 1.6 68.3 ± 2.3 59.4 ± 0.0 68.2 ± 1.6 80.9 ± 0.5 83.2 ± 2.9
Flights 65.2 ± 2.7 63.9 ± 2.8 68.3 ± 2.2 73.5 ± 0.3 64.7 ± 0.9 70.4 ± 0.0 71.0 ± 0.4 71.9 ± 0.8 68.9 ± 2.0
Cont. 94.0 ± 2.6 95.8 ± 0.4 97.0 ± 0.5 96.8 ± 0.2 88.2 ± 0.2 89.4 ± 0.0 96.6 ± 0.4 95.7 ± 0.2 98.3 ± 0.0
Pred.
UN V. 67.8 ± 1.9 67.8 ± 1.7 66.1 ± 3.9 52.3 ± 2.5 50.7 ± 1.4 84.8 ± 0.0 53.3 ± 1.3 53.9 ± 1.7 60.0 ± 1.4
US L. 48.2 ± 1.0 73.1 ± 2.2 81.2 ± 2.1 71.2 ± 8.2 80.8 ± 3.5 63.3 ± 0.0 77.4 ± 4.5 86.0 ± 2.0 85.8 ± 1.0
UN Tr. 58.9 ± 3.1 59.3 ± 1.8 58.9 ± 1.5 57.9 ± 2.4 57.9 ± 2.1 81.1 ± 0.0 56.2 ± 1.5 63.8 ± 1.6 64.3 ± 2.2
Can. P. 52.8 ± 0.5 61.8 ± 1.1 68.5 ± 2.1 67.7 ± 1.6 63.9 ± 1.3 63.8 ± 0.0 64.2 ± 2.0 77.1 ± 0.4 97.1 ± 0.7
Flights 66.7 ± 3.3 66.8 ± 1.6 68.3 ± 1.8 72.7 ± 0.2 63.9 ± 0.9 70.5 ± 0.0 70.8 ± 0.5 71.2 ± 0.7 68.9 ± 1.8
Cont. 94.2 ± 1.3 95.5 ± 0.3 96.3 ± 1.1 96.0 ± 0.3 84.5 ± 0.2 88.8 ± 0.0 94.6 ± 0.8 94.2 ± 0.1 97.7 ± 0.1
18
E Global performance scores
The results presented in Table 3 and Table 4 assign the individual scores of each time window the
same weight and then compute the mean over all scores to get the final score. This score measures the
model performance across time, i.e. it is equally important for a model to perform well in periods that
only have a few edge occurrences as well as in periods where many edges occur. In some scenarios,
the focus might not be to forecast the existence of edges in all time windows equally well but instead
forecast for all edges equally well. In the following, we investigate the model performance of link
forecasting compared to link prediction using this perspective of model performance.
The results are presented in Table 13 for continuous-time temporal graphs and in Table 14 for discrete-
time temporal graphs (corresponding average precision in Table 17 and Table 18). In contrast to
the results presented in the main part of this work, the scores are computed over all edges instead
of per time window and then averaged. As we can see, the changes between link forecasting and
link prediction are less expressed if every edge is weighted the same instead of every time window.
Nevertheless, we can still observe the patterns discussed above although not as distinct.
Table 13: Test AUC-ROC scores for link forecasting and the relative change compared to link
prediction for continuous-time graphs on the same trained models (standard deviations in Table 15).
We compute the AUC-ROC score over all edges instead of per time window or batch as in Table 3.
The last row/column provides mean
µ
and standard deviation
σ
of the absolute relative change per
column/row.
Dataset JODIE DyRep TGN TGAT CAWN EdgeBank TCL GraphMixer DyGFormer µ±σ
Enron 76.8(1.1%) 73.4(0.2%) 68.4(0.2%) 58.9(0.2%) 66.7(0.5%) 78.5(1.6%) 68.0(0.5%) 81.1(0.8%) 76.8(0.5%) 0.6%±0.5%
UCI 85.4(3.5%) 60.7(18.1%) 64.0(1.5%) 59.5(0.2%) 57.9(0.5%) 71.3(3.1%) 59.9(0.1%) 80.5(0.1%) 76.4(0.2%) 3.0%±5.8%
MOOC 83.9(0.9%) 79.5(1.3%) 88.1(0.5%) 82.2(0.1%) 70.5(0.2%) 59.9(3.2%) 72.3(0.0%) 74.2(0.1%) 81.0(0.0%) 0.7%±1.0%
Wiki. 81.6(0.2%) 78.3(0.1%) 84.1(0.0%) 83.4(0.0%) 71.6(0.1%) 77.3(0.3%) 85.1(0.0%) 87.7(0.0%) 80.2(0.2%) 0.1%±0.1%
LastFM 76.6(0.6%) 70.2(1.4%) 78.2(0.3%) 68.5(0.0%) 68.0(0.0%) 78.0(0.2%) 64.3(0.0%) 66.1(0.0%) 78.9(0.0%) 0.3%±0.5%
Myket 64.0(0.1%) 64.0(0.0%) 60.7(0.1%) 57.4(0.3%) 32.6(0.3%) 52.0(0.0%) 58.4(0.1%) 59.4(0.1%) 32.9(0.3%) 0.1%±0.1%
Social 90.4(0.6%) 91.1(1.4%) 91.5(0.1%) 92.7(0.0%) 87.8(0.1%) 86.0(0.2%) 95.3(0.1%) 94.1(0.0%) 97.5(0.0%) 0.3%±0.5%
Reddit 80.5(0.1%) 79.5(0.1%) 80.3(0.1%) 78.6(0.1%) 80.2(0.1%) 78.5(0.2%) 76.2(0.1%) 77.1(0.1%) 80.1(0.0%) 0.1%±0.1%
µ±σ0.9%±1.1% 2.8%±6.2% 0.4%±0.5% 0.1%±0.1% 0.2%±0.2% 1.1%±1.4% 0.1%±0.2% 0.1%±0.3% 0.1%±0.2%
Table 14: Test AUC-ROC scores for discrete-time temporal graphs as in Table 13. All results with
standard deviations are listed in Table 16.
Dataset JODIE DyRep TGN TGAT CAWN EdgeBank TCL GraphMixer DyGFormer µ±σ
UN V. 56.3(25.5%) 53.3(28.7%) 52.0(25.9%) 54.3(2.8%) 53.8(7.4%) 89.7(0.1%) 53.4(0.7%) 57.1(1.4%) 63.9(3.2%) 10.6%±12.3%
US L. 52.5(7.1%) 61.8(22.1%) 57.7(31.2%) 78.6(0.1%) 82.0(0.1%) 68.4(1.3%) 75.4(0.1%) 90.4(0.3%) 89.4(0.2%) 6.9%±11.6%
UN Tr. 57.6(13.1%) 50.4(20.3%) 54.4(14.0%) 64.1(3.9%) 67.6(4.6%) 85.6(1.0%) 63.7(4.5%) 68.6(3.4%) 70.7(3.5%) 7.6%±6.5%
Can. P. 64.0(0.5%) 64.6(3.2%) 72.7(1.4%) 72.3(0.5%) 68.1(0.0%) 61.5(2.6%) 68.3(0.1%) 81.7(0.1%) 83.7(14.3%) 2.5%±4.6%
Flights 67.3(3.0%) 65.6(4.7%) 68.1(1.0%) 72.6(0.0%) 65.2(0.3%) 74.6(0.0%) 70.5(0.1%) 70.6(0.1%) 68.5(0.5%) 1.1%±1.7%
Cont. 93.3(1.0%) 94.1(1.4%) 95.6(0.5%) 95.3(0.0%) 83.4(0.1%) 92.2(0.0%) 94.7(0.5%) 93.7(0.0%) 97.2(0.0%) 0.4%±0.5%
µ±σ8.4%±9.6% 13.4%±11.7% 12.3%±13.6% 1.2%±1.7% 2.1%±3.2% 0.8%±1.0% 1.0%±1.7% 0.9%±1.3% 3.6%±5.5%
Table 15: Test AUC-ROC scores for link forecasting and link prediction averaged over 5 runs with
standard deviations on continuous-time temporal graphs.
Eval Dataset JODIE DyRep TGN TGAT CAWN EdgeBank TCL GraphMixer DyGFormer
Forec.
Enron 76.8 ± 3.9 73.4 ± 2.7 68.4 ± 3.4 58.9 ± 1.3 66.7 ± 0.5 78.5 ± 0.0 68.0 ± 5.7 81.1 ± 0.8 76.8 ± 0.5
UCI 85.4 ± 1.0 60.7 ± 2.7 64.0 ± 1.1 59.5 ± 1.5 57.9 ± 0.6 71.3 ± 0.0 59.9 ± 0.8 80.5 ± 0.8 76.4 ± 0.5
MOOC 83.9 ± 3.4 79.5 ± 4.1 88.1 ± 2.0 82.2 ± 0.6 70.5 ± 1.2 59.9 ± 0.0 72.3 ± 0.6 74.2 ± 1.4 81.0 ± 9.0
Wiki. 81.6 ± 0.4 78.3 ± 0.4 84.1 ± 0.6 83.4 ± 0.2 71.6 ± 0.8 77.3 ± 0.0 85.1 ± 0.5 87.7 ± 0.3 80.2 ± 1.6
LastFM 76.6 ± 0.5 70.2 ± 1.2 78.2 ± 3.0 68.5 ± 0.8 68.0 ± 0.3 78.0 ± 0.0 64.3 ± 6.0 66.1 ± 1.7 78.9 ± 0.6
Myket 64.0 ± 2.1 64.0 ± 2.7 60.7 ± 2.3 57.4 ± 0.5 32.6 ± 0.4 52.0 ± 0.0 58.4 ± 2.0 59.4 ± 0.4 32.9 ± 1.0
Social 90.4 ± 2.6 91.1 ± 1.0 91.5 ± 3.3 92.7 ± 0.5 87.8 ± 0.1 86.0 ± 0.0 95.3 ± 0.2 94.1 ± 0.2 97.5 ± 0.1
Reddit 80.5 ± 0.2 79.5 ± 0.8 80.3 ± 0.4 78.6 ± 0.7 80.2 ± 0.3 78.5 ± 0.0 76.2 ± 0.4 77.1 ± 0.5 80.1 ± 1.1
Pred.
Enron 76.0 ± 3.0 73.2 ± 2.3 68.3 ± 2.9 58.8 ± 1.2 66.4 ± 0.4 79.8 ± 0.0 67.6 ± 5.6 80.5 ± 0.8 76.4 ± 0.5
UCI 82.5 ± 1.3 51.4 ± 7.7 63.0 ± 1.3 59.6 ± 1.5 58.2 ± 0.6 69.1 ± 0.0 60.0 ± 0.9 80.7 ± 0.8 76.2 ± 0.6
MOOC 84.6 ± 3.1 80.5 ± 3.2 88.5 ± 1.6 82.1 ± 0.6 70.3 ± 1.3 61.9 ± 0.0 72.3 ± 0.6 74.1 ± 1.4 81.0 ± 9.0
Wiki. 81.7 ± 0.4 78.4 ± 0.4 84.1 ± 0.6 83.4 ± 0.2 71.5 ± 0.8 77.1 ± 0.0 85.2 ± 0.5 87.8 ± 0.3 80.0 ± 1.6
LastFM 77.1 ± 0.7 71.2 ± 1.1 78.4 ± 2.7 68.5 ± 0.7 68.1 ± 0.3 78.2 ± 0.0 64.3 ± 6.0 66.1 ± 1.7 78.9 ± 0.6
Myket 63.9 ± 2.1 64.0 ± 2.7 60.8 ± 2.3 57.5 ± 0.4 32.5 ± 0.4 51.9 ± 0.0 58.4 ± 2.0 59.5 ± 0.4 32.8 ± 1.0
Social 91.0 ± 2.4 92.4 ± 0.4 91.5 ± 3.5 92.7 ± 0.5 87.7 ± 0.1 85.8 ± 0.0 95.2 ± 0.2 94.1 ± 0.2 97.4 ± 0.1
Reddit 80.6 ± 0.1 79.5 ± 0.8 80.4 ± 0.4 78.7 ± 0.6 80.2 ± 0.3 78.6 ± 0.0 76.2 ± 0.4 77.1 ± 0.5 80.2 ± 1.1
19
Table 16: Test AUC-ROC scores for link forecasting and link prediction averaged over 5 runs with
standard deviations on discrete-time temporal graphs.
Eval Dataset JODIE DyRep TGN TGAT CAWN EdgeBank TCL GraphMixer DyGFormer
Forec.
UN V. 56.3 ± 1.4 53.3 ± 0.8 52.0 ± 7.2 54.3 ± 1.4 53.8 ± 2.1 89.7 ± 0.0 53.4 ± 1.1 57.1 ± 1.6 63.9 ± 1.7
US L. 52.5 ± 1.8 61.8 ± 3.5 57.7 ± 1.8 78.6 ± 7.9 82.0 ± 4.0 68.4 ± 0.0 75.4 ± 5.3 90.4 ± 1.5 89.4 ± 0.9
UN Tr. 57.6 ± 3.3 50.4 ± 1.2 54.4 ± 1.5 64.1 ± 1.3 67.6 ± 1.2 85.6 ± 0.0 63.7 ± 1.6 68.6 ± 2.6 70.7 ± 2.6
Can. P. 64.0 ± 0.8 64.6 ± 7.5 72.7 ± 2.7 72.3 ± 2.6 68.1 ± 1.0 61.5 ± 0.0 68.3 ± 3.6 81.7 ± 0.9 83.7 ± 3.9
Flights 67.3 ± 2.0 65.6 ± 1.8 68.1 ± 1.7 72.6 ± 0.2 65.2 ± 1.7 74.6 ± 0.0 70.5 ± 0.1 70.6 ± 0.3 68.5 ± 1.3
Cont. 93.3 ± 1.9 94.1 ± 0.5 95.6 ± 0.5 95.3 ± 0.3 83.4 ± 0.1 92.2 ± 0.0 94.7 ± 0.5 93.7 ± 0.1 97.2 ± 0.0
Pred.
UN V. 75.6 ± 1.9 74.8 ± 1.2 70.2 ± 5.8 52.8 ± 1.6 50.1 ± 1.6 89.5 ± 0.0 53.0 ± 1.6 56.3 ± 2.0 61.9 ± 1.6
US L. 56.5 ± 1.9 79.3 ± 1.0 84.0 ± 2.2 78.5 ± 7.8 81.9 ± 4.0 67.5 ± 0.0 75.4 ± 5.5 90.2 ± 1.5 89.3 ± 0.9
UN Tr. 66.3 ± 3.0 63.2 ± 1.9 63.2 ± 1.2 61.7 ± 1.3 64.7 ± 1.3 86.4 ± 0.0 60.9 ± 1.3 66.3 ± 2.5 68.3 ± 2.3
Can. P. 63.6 ± 0.7 66.8 ± 2.4 73.7 ± 3.5 72.0 ± 2.6 68.1 ± 1.0 63.1 ± 0.0 68.4 ± 3.6 81.6 ± 1.0 97.7 ± 0.6
Flights 69.4 ± 2.3 68.9 ± 1.0 68.7 ± 1.6 72.6 ± 0.2 65.0 ± 1.4 74.6 ± 0.0 70.6 ± 0.1 70.7 ± 0.3 68.9 ± 1.1
Cont. 94.3 ± 1.2 95.4 ± 0.3 96.1 ± 0.7 95.3 ± 0.3 83.3 ± 0.1 92.2 ± 0.0 94.3 ± 1.0 93.7 ± 0.1 97.3 ± 0.0
Table 17: Average Precision scores computed as in Table 13 for ROC-AUC scores on continuous-time
temporal graphs. For a full list of results with standard deviations, see Table 19.
Dataset JODIE DyRep TGN TGAT CAWN EdgeBank TCL GraphMixer DyGFormer µ±σ
Enron 69.8(1.1%) 68.2(0.4%) 65.3(0.5%) 62.7(0.1%) 66.8(0.4%) 75.7(1.2%) 70.7(0.9%) 81.5(1.2%) 77.1(0.8%) 0.7%±0.4%
UCI 85.1(6.8%) 56.6(18.1%) 71.9(0.4%) 69.0(0.1%) 65.6(0.0%) 66.9(3.1%) 69.6(0.1%) 85.9(0.2%) 81.2(0.2%) 3.2%±6.0%
MOOC 82.8(0.5%) 76.2(0.6%) 86.3(0.7%) 84.4(0.0%) 73.6(0.1%) 58.7(3.2%) 78.3(0.0%) 77.7(0.0%) 82.1(0.0%) 0.6%±1.0%
Wiki. 84.1(0.0%) 80.8(0.2%) 88.8(0.0%) 87.9(0.0%) 75.2(0.3%) 73.4(0.5%) 89.5(0.0%) 91.1(0.0%) 83.3(0.3%) 0.1%±0.2%
LastFM 76.4(1.9%) 70.3(2.5%) 78.5(0.7%) 76.0(0.0%) 72.2(0.0%) 73.0(0.2%) 72.5(0.0%) 75.1(0.0%) 82.1(0.0%) 0.6%±1.0%
Myket 63.1(0.4%) 61.7(0.0%) 61.3(0.0%) 56.4(0.3%) 44.8(0.0%) 51.1(0.0%) 57.6(0.1%) 58.7(0.1%) 44.4(0.0%) 0.1%±0.2%
Social 87.0(1.0%) 90.0(1.5%) 93.0(0.0%) 94.9(0.0%) 86.6(0.2%) 80.8(0.3%) 96.4(0.1%) 95.5(0.0%) 97.6(0.1%) 0.4%±0.5%
Reddit 79.7(0.1%) 78.9(0.0%) 80.3(0.1%) 78.3(0.1%) 81.0(0.1%) 73.4(0.2%) 76.3(0.0%) 77.2(0.0%) 82.6(0.0%) 0.1%±0.1%
µ±σ1.5%±2.2% 2.9%±6.2% 0.3%±0.3% 0.1%±0.1% 0.1%±0.2% 1.1%±1.3% 0.1%±0.3% 0.2%±0.4% 0.2%±0.3%
Table 18: Average Precision scores as in Table 17 for discrete-time graphs. All results and standard
deviations are listed in Table 20.
Dataset JODIE DyRep TGN TGAT CAWN EdgeBank TCL GraphMixer DyGFormer µ±σ
UN V. 53.3(23.6%) 50.6(28.2%) 49.9(22.6%) 52.5(1.7%) 52.6(4.8%) 84.1(0.4%) 52.4(0.9%) 54.1(1.4%) 60.0(3.0%) 9.6%±11.6%
US L. 46.0(4.2%) 62.5(15.3%) 58.6(28.5%) 71.0(0.2%) 80.7(0.0%) 63.2(0.0%) 77.5(0.1%) 86.5(0.7%) 86.1(0.6%) 5.5%±9.9%
UN Tr. 52.8(10.0%) 49.6(15.7%) 53.3(9.6%) 59.1(3.0%) 59.2(3.3%) 79.0(2.5%) 57.5(3.5%) 65.8(2.7%) 67.1(3.3%) 6.0%±4.7%
Can. P. 52.3(0.4%) 59.9(6.4%) 69.7(2.5%) 70.5(0.1%) 66.6(0.1%) 58.0(2.8%) 67.0(0.1%) 81.4(0.2%) 82.2(16.1%) 3.2%±5.3%
Flights 65.2(2.3%) 63.4(5.3%) 68.3(0.8%) 73.5(0.0%) 64.7(0.7%) 70.3(0.0%) 71.0(0.1%) 71.9(0.0%) 68.8(0.6%) 1.1%±1.7%
Cont. 90.2(2.2%) 95.1(0.7%) 95.7(0.7%) 96.0(0.0%) 85.2(0.0%) 88.7(0.1%) 95.4(0.4%) 93.5(0.1%) 97.9(0.0%) 0.5%±0.7%
µ±σ7.1%±8.7% 11.9%±9.9% 10.8%±12.0% 0.8%±1.2% 1.5%±2.1% 1.0%±1.3% 0.8%±1.3% 0.8%±1.0% 3.9%±6.1%
Table 19: Test average precision scores for link forecasting and link prediction averaged over 5 runs
with standard deviations on continuous-time temporal graphs.
Eval Dataset JODIE DyRep TGN TGAT CAWN EdgeBank TCL GraphMixer DyGFormer
Forec.
Enron 69.8 ± 3.6 68.2 ± 4.0 65.3 ± 2.8 62.7 ± 0.8 66.8 ± 0.5 75.7 ± 0.0 70.7 ± 3.7 81.5 ± 0.6 77.1 ± 0.7
UCI 85.1 ± 1.8 56.6 ± 2.4 71.9 ± 0.9 69.0 ± 1.0 65.6 ± 0.7 66.9 ± 0.0 69.6 ± 0.8 85.9 ± 0.4 81.2 ± 0.9
MOOC 82.8 ± 5.1 76.2 ± 4.3 86.3 ± 2.7 84.4 ± 0.7 73.6 ± 0.9 58.7 ± 0.0 78.3 ± 0.6 77.7 ± 0.7 82.1 ± 9.8
Wiki. 84.1 ± 0.5 80.8 ± 0.3 88.8 ± 0.4 87.9 ± 0.2 75.2 ± 1.1 73.4 ± 0.0 89.5 ± 0.3 91.1 ± 0.2 83.3 ± 1.2
LastFM 76.4 ± 0.5 70.3 ± 2.0 78.5 ± 3.9 76.0 ± 0.7 72.2 ± 0.4 73.0 ± 0.0 72.5 ± 5.9 75.1 ± 1.2 82.1 ± 0.8
Myket 63.1 ± 1.8 61.7 ± 1.5 61.3 ± 2.1 56.4 ± 0.4 44.8 ± 0.3 51.1 ± 0.0 57.6 ± 2.3 58.7 ± 0.2 44.4 ± 1.7
Social 87.0 ± 6.1 90.0 ± 1.4 93.0 ± 2.4 94.9 ± 0.3 86.6 ± 0.1 80.8 ± 0.0 96.4 ± 0.2 95.5 ± 0.2 97.6 ± 0.1
Reddit 79.7 ± 0.4 78.9 ± 0.9 80.3 ± 0.6 78.3 ± 1.0 81.0 ± 0.5 73.4 ± 0.0 76.3 ± 0.6 77.2 ± 0.5 82.6 ± 0.8
Pred.
Enron 69.0 ± 2.1 68.5 ± 4.3 65.0 ± 3.9 62.6 ± 0.6 66.6 ± 0.5 76.7 ± 0.0 70.1 ± 3.6 80.5 ± 0.6 76.5 ± 0.6
UCI 79.7 ± 3.0 48.0 ± 4.3 71.7 ± 1.1 69.1 ± 1.0 65.6 ± 0.7 64.9 ± 0.0 69.6 ± 0.8 86.0 ± 0.5 81.0 ± 1.0
MOOC 83.2 ± 4.5 76.6 ± 4.0 86.9 ± 2.2 84.3 ± 0.7 73.5 ± 1.0 60.6 ± 0.0 78.3 ± 0.5 77.7 ± 0.7 82.1 ± 9.8
Wiki. 84.1 ± 0.6 80.6 ± 0.3 88.8 ± 0.3 87.9 ± 0.2 75.0 ± 1.1 73.0 ± 0.0 89.5 ± 0.3 91.1 ± 0.2 83.1 ± 1.2
LastFM 77.9 ± 0.7 72.1 ± 1.9 79.1 ± 3.1 76.0 ± 0.6 72.2 ± 0.4 73.1 ± 0.0 72.4 ± 5.9 75.1 ± 1.2 82.1 ± 0.8
Myket 62.9 ± 1.8 61.7 ± 1.5 61.3 ± 2.1 56.6 ± 0.4 44.8 ± 0.3 51.1 ± 0.0 57.6 ± 2.2 58.7 ± 0.2 44.4 ± 1.7
Social 87.8 ± 5.3 91.3 ± 0.8 93.0 ± 2.6 94.9 ± 0.3 86.4 ± 0.1 80.5 ± 0.0 96.3 ± 0.2 95.4 ± 0.1 97.6 ± 0.1
Reddit 79.8 ± 0.4 78.9 ± 0.9 80.4 ± 0.6 78.4 ± 1.0 80.9 ± 0.5 73.6 ± 0.0 76.3 ± 0.6 77.2 ± 0.5 82.6 ± 0.8
20
Table 20: Test average precision scores for link forecasting and link prediction averaged over 5 runs
with standard deviations on discrete-time temporal graphs.
Eval Dataset JODIE DyRep TGN TGAT CAWN EdgeBank TCL GraphMixer DyGFormer
Forec.
UN V. 53.3 ± 1.2 50.6 ± 1.5 49.9 ± 4.4 52.5 ± 1.4 52.6 ± 1.9 84.1 ± 0.0 52.4 ± 0.9 54.1 ± 1.4 60.0 ± 2.0
US L. 46.0 ± 0.9 62.5 ± 3.6 58.6 ± 2.4 71.0 ± 8.9 80.7 ± 3.7 63.2 ± 0.0 77.5 ± 4.3 86.5 ± 1.9 86.1 ± 1.0
UN Tr. 52.8 ± 3.1 49.6 ± 0.8 53.3 ± 1.7 59.1 ± 2.7 59.2 ± 1.7 79.0 ± 0.0 57.5 ± 1.9 65.8 ± 1.9 67.1 ± 2.7
Can. P. 52.3 ± 0.6 59.9 ± 6.5 69.7 ± 1.5 70.5 ± 1.8 66.6 ± 2.1 58.0 ± 0.0 67.0 ± 1.9 81.4 ± 0.5 82.2 ± 3.2
Flights 65.2 ± 2.6 63.4 ± 2.6 68.3 ± 2.2 73.5 ± 0.3 64.7 ± 0.8 70.3 ± 0.0 71.0 ± 0.4 71.9 ± 0.8 68.8 ± 2.0
Cont. 90.2 ± 5.2 95.1 ± 0.6 95.7 ± 1.0 96.0 ± 0.4 85.2 ± 0.2 88.7 ± 0.0 95.4 ± 0.6 93.5 ± 0.1 97.9 ± 0.1
Pred.
UN V. 69.7 ± 1.5 70.5 ± 1.1 64.4 ± 6.5 51.6 ± 1.3 50.2 ± 1.4 84.5 ± 0.0 52.9 ± 1.4 53.3 ± 1.7 58.2 ± 1.5
US L. 48.0 ± 1.0 73.8 ± 2.4 81.9 ± 2.3 70.9 ± 8.7 80.7 ± 3.6 63.2 ± 0.0 77.6 ± 4.4 85.8 ± 2.0 85.6 ± 1.2
UN Tr. 58.7 ± 3.7 58.9 ± 1.7 58.9 ± 1.5 57.4 ± 2.6 57.4 ± 2.3 81.0 ± 0.0 55.6 ± 1.5 64.1 ± 1.7 64.9 ± 2.6
Can. P. 52.1 ± 0.4 64.0 ± 1.7 71.5 ± 1.8 70.4 ± 1.7 66.5 ± 2.4 59.7 ± 0.0 67.0 ± 1.9 81.2 ± 0.4 98.0 ± 0.5
Flights 66.7 ± 3.6 66.9 ± 1.9 68.9 ± 2.0 73.5 ± 0.3 64.2 ± 1.0 70.3 ± 0.0 71.0 ± 0.4 71.9 ± 0.8 69.3 ± 2.0
Cont. 92.2 ± 2.3 95.8 ± 0.4 96.4 ± 0.9 96.0 ± 0.4 85.2 ± 0.2 88.7 ± 0.0 95.0 ± 1.0 93.5 ± 0.1 98.0 ± 0.0
21
ResearchGate has not been able to resolve any citations for this publication.
Chapter
Full-text available
Due to its ability to incorporate and leverage time information in relational data, Temporal Knowledge Graph (TKG) learning has become an increasingly studied research field. To predict the future based on TKG, researchers have presented innovative methods for Temporal Knowledge Graph Forecasting. However, the experimental procedures employed in this research area exhibit inconsistencies that significantly impact empirical results, leading to distorted comparisons among models. This paper focuses on the evaluation of TKG Forecasting models: We examine the evaluation settings commonly used in this research area and highlight the issues that arise. To make different approaches to TKG Forecasting more comparable, we propose a unified evaluation protocol and apply it to re-evaluate state-of-the-art models on the most commonly used datasets. Ultimately, we demonstrate the significant difference in results caused by different evaluation settings. We believe this work provides a solid foundation for future evaluations of TKG Forecasting models, thereby contributing to advancing this growing research area.
Article
Full-text available
As the total value of the global financial market outgrew the value of the real economy, financial institutions created a global web of interactions that embodies systemic risks. Understanding these networks requires new theoretical approaches and new tools for quantitative analysis. Statistical physics contributed significantly to this challenge by developing new metrics and models for the study of financial network structure, dynamics, and stability and instability. In this Review, we introduce network representations originating from different financial relationships, including direct interactions such as loans, similarities such as co-ownership and higher-order relations such as contracts involving several parties (for example, credit default swaps) or multilayer connections (possibly extending to the real economy). We then review models of financial contagion capturing the diffusion and impact of shocks across each of these systems. We also discuss different notions of ‘equilibrium’ in economics and statistical physics, and how they lead to maximum entropy ensembles of graphs, providing tools for financial network inference and the identification of early-warning signals of system-wide instabilities.
Article
Full-text available
Dynamic networks are used in a wide range of fields, including social network analysis, recommender systems and epidemiology. Representing complex networks as structures changing over time allow network models to leverage not only structural but also temporal patterns. However, as dynamic network literature stems from diverse fields and makes use of inconsistent terminology, it is challenging to navigate. Meanwhile, graph neural networks (GNNs) have gained a lot of attention in recent years for their ability to perform well on a range of network science tasks, such as link prediction and node classification. Despite the popularity of graph neural networks and the proven benefits of dynamic network models, there has been little focus on graph neural networks for dynamic networks. To address the challenges resulting from the fact that this research crosses diverse fields as well as to survey dynamic graph neural networks, this work is split into two main parts. First, to address the ambiguity of the dynamic network terminology we establish a foundation of dynamic networks with consistent, detailed terminology and notation. Second, we present a comprehensive survey of dynamic graph neural network models using the proposed terminology.
Article
Full-text available
The technological advances of the past century, marked by the computer revolution and the advent of high-through-put screening technologies in drug discovery, opened the path to the computational analysis and visualization of bioactive molecules. For this purpose, it became necessary to represent molecules in a syntax that would be readable by computers and understandable by scientists of various fields. A large number of chemical representations have been developed over the years, their numerosity being due to the fast development of computers and the complexity of producing a representation that encompasses all structural and chemical characteristics. We present here some of the most popular electronic molecular and macromolecular representations used in drug discovery, many of which are based on graph representations. Furthermore, we describe applications of these representations in AI-driven drug discovery. Our aim is to provide a brief guide on structural representations that are essential to the practice of AI in drug discovery. This review serves as a guide for researchers who have little experience with the handling of chemical representations and plan to work on applications at the interface of these fields.
Article
Many real world graphs contain time domain information. Temporal Graph Neural Networks capture temporal information as well as structural and contextual information in the generated dynamic node embeddings. Researchers have shown that these embeddings achieve state-of-the-art performance in many different tasks. In this work, we propose TGL, a unified framework for large-scale offline Temporal Graph Neural Network training where users can compose various Temporal Graph Neural Networks with simple configuration files. TGL comprises five main components, a temporal sampler, a mailbox, a node memory module, a memory updater, and a message passing engine. We design a Temporal-CSR data structure and a parallel sampler to efficiently sample temporal neighbors to form training mini-batches. We propose a novel random chunk scheduling technique that mitigates the problem of obsolete node memory when training with a large batch size. To address the limitations of current TGNNs only being evaluated on small-scale datasets, we introduce two large-scale real-world datasets with 0.2 and 1.3 billion temporal edges. We evaluate the performance of TGL on four small-scale datasets with a single GPU and the two large datasets with multiple GPUs for both link prediction and node classification tasks. We compare TGL with the open-sourced code of five methods and show that TGL achieves similar or better accuracy with an average of 13X speedup. Our temporal parallel sampler achieves an average of 173X speedup on a multi-core CPU compared with the baselines. On a 4-GPU machine, TGL can train one epoch of more than one billion temporal edges within 1-10 hours. To the best of our knowledge, this is the first work that proposes a general framework for large-scale Temporal Graph Neural Networks training on multiple GPUs.
Article
Deep learning based forecasting methods have become the methods of choice in many applications of time series prediction or forecasting often outperforming other approaches. Consequently, over the last years, these methods are now ubiquitous in large-scale industrial forecasting applications and have consistently ranked among the best entries in forecasting competitions (e.g., M4 and M5). This practical success has further increased the academic interest to understand and improve deep forecasting methods. In this article we provide an introduction and overview of the field: We present important building blocks for deep forecasting in some depth; using these building blocks, we then survey the breadth of the recent deep forecasting literature.
Article
Since many real world networks are evolving over time, such as social networks and user-item networks, there are increasing research efforts on dynamic network embedding in recent years. They learn node representations from a sequence of evolving graphs but not only the latest network, for preserving both structural and temporal information from the dynamic networks. Due to the lack of comprehensive investigation of them, we give a survey of dynamic network embedding in this paper. Our survey inspects the data model, representation learning technique, evaluation and application of current related works and derives common patterns from them. Specifically, we present two basic data models, namely, discrete model and continuous model for dynamic networks. Correspondingly, we summarize two major categories of dynamic network embedding techniques, namely, structural-first and temporal-first that are adopted by most related works. Then we build a taxonomy that refines the category hierarchy by typical learning models. The popular experimental data sets and applications are also summarized. Lastly, we have a discussion of several distinct research topics in dynamic network embedding.