Content uploaded by Alejandro Vaisman
Author content
All content in this area was uploaded by Alejandro Vaisman on May 08, 2021
Content may be subject to copyright.
A Model and Query Language for Temporal
Graph Databases
Preprint of the VLDB Journal paper.
DOI: https://doi.org/10.1007/s00778-021-00675-4
Ariel Debrouvier1, Mat´ıas Perazzo1, Eliseo Parodi1, Valeria Soliani1,2, and
Alejandro Vaisman1
1Instituto Tecnol´ogico de Buenos Aires,Buenos Aires, Argentina
{ndebrouvier,maperazzo,eparodi,avaisman,vsoliani}@itba.edu.ar
2Hasselt University, Data Science Institute, Databases and Theoretical Computer
Science Research Group
Abstract. Graph databases are becoming increasingly popular for mod-
eling different kinds of networks for data analysis. Graph databases are
built over the property graph data model. In this model, nodes and edges
are annotated with property-value pairs.Although the work of researchers
and practitioners is based on graphs were the temporal dimension is not
considered, time is present in most real-world problems. Many differ-
ent kinds of changes may occur in a graph as the world it represents
evolves across time, for instance, edges, nodes and properties can be
added and/or deleted and property values can be updated. This paper
addresses the problem of modeling, storing and querying temporal prop-
erty graphs, allowing keeping the history of a graph database.
The work in this paper is based on a model where nodes and relationships
contain attributes (properties), which are timestamped with a validity
interval, and graphs are heterogeneous, that means, relationships may
be of different kinds. Thus, queries, like “Give me the friends of the
friends of Mary, who lived in Brussels at the same time than her, and
also give me the periods when this happened”. Associated with this model,
a powerful graph query language, denoted T-GQL, is presented, together
with a collection of algorithms for computing different kinds of temporal
paths in a graph, capturing different temporal path semantics. A Neo4j-
based implementation of the above is also presented, and a client-side
interface allows submitting queries in T-GQL to a Neo4j server. Finally,
experiments were carried out over synthetic and real-world datasets.
Keywords: Temporal Graph Databases ·Neo4j ·Query Languages ·Shortest
Path ·Cypher Query Language.
1 Introduction and Motivation
Property Graphs [3, 17, 26] are becoming increasingly popular, particularly for
modeling different kinds of networks for data analysis. The property graph data
2 A. Debouvier et al.
model underlies most graph databases in the marketplace [1]. Examples of graph
databases and graph processing frameworks based on this model are Neo4j3,
Janusgraph4, and GraphFrames [10]. Typically, the work of researchers and prac-
titioners is based on graphs were the temporal dimension is not considered, called
static graphs from here on. However, time is present in most real-world appli-
cations, and graphs are not the exception. Many different kinds of changes may
occur in a property graph as the world they represent evolves over time: edges,
nodes and properties can be added and/or deleted, property values can be up-
dated, to mention the most relevant ones. For instance:
– (a) In a phone call networks, where each vertex represents a person (or a
phone number), and an edge (u, v, t, λ) tells that ucalls vat time t, with
duration λ, new nodes and edges are added frequently, and also the properties
of uor vmay change over time.
– (b) In social networks (e.g., Facebook, Twitter), each vertex models a person
(or an organization, etc.), and an edge (u, v, t, λ) represents a relationship
between to persons uand v(e.g., ufollows v,uis a friend of v) at time t
which lasts λ(uwas a friend of vduring an interval whose duration is λ).
– (c) In transportation networks, each vertex in a graph represents a location,
and an edge (u, v, t, λ) represents a road segment, a street, or a highway
segment, existing from uto v, since time t, whose interval of existence is λ.
– (d) In transportation schedules, each vertex in a graph represents a location,
and an edge (u, v, t, λ) is a trip (flight, bus, etc.) from uto vdeparting at
time t, whose duration is λ.
Ignoring the time dimension could lead to incorrect results, or prevent in-
teresting analysis possibilities. For example, in case (b), it may be relevant to
know the interval of the relationships that occur in a social network, to weight
their strength, or to know chains of relationships that occurred simultaneously.
As another example, a user may be interested in asking for “People who were
still being Nutella fans while they were living outside Italy,” or “Friends of Mary
while she was working at the University of Antwerp.” Those are queries that
could not be answered without accounting for time. As another kind of problem,
note that in case (c) above, the shortest (or fastest) way to reach one city from
another one, varies with time, since a segment belonging to the shortest path
may have not existed in the past. Thus, for example, a transportation analyst
may ask for “Time saved for going from Buenos Aires to Pinamar after the con-
struction of Highway Number 11.” Further, this can be stated as a hypothetical
query [6, 14], asking for the fastest way to reach a city in case a new highway is
built.
Literature in temporal graphs is relatively limited, and basically oriented
to address path problems particularly for scenarios like (a) and (d) above (see
Section 2 for a discussion). As far as the authors are aware of, problems along
the lines of the temporal databases theory [28] have not been addressed yet.
3http://www.neo4j.com
4http://janusgraph.org/
3
Temporal property graph-based data models, query languages, algorithms, and
even, a study of the problems that can be solved with this approach, are still
open fields of study, and this work tackles them.
1.1 Contributions
This paper studies how temporal databases concepts can be applied to graph
databases, in order to be able to model, store, and query temporal graphs, in
other words, to keep the history of a graph database. The work presented here
is based on the property graph data model. This is not the case of most existing
work on the topic (e.g., [30, 32]), where only edges are timestamped with the
initial time of the relationship they represent, and the duration of such rela-
tionship. Further, those works account just for homogeneous graphs (i.e., graphs
were only one kind of relationship exists). In the model presented here, nodes
and relationships contain attributes (properties), which are timestamped with a
validity interval, and graphs are heterogeneous, that means, relationships may
be of different kinds. These graphs are called Interval-labeled Property Graphs
(ILPG) in this paper. This allows richer queries, like “Give me the friends of the
friends of Mary, who lived in Brussels at the same time than her”. Nevertheless,
the model presented in this paper also captures the semantics of the mentioned
works. For this, two path semantics are supported: Continuous path semantics,
defined along the lines of the work by Rizzolo and Vaisman [25], and Consec-
utive Path Semantics. Both semantics, and their implementation, are discussed
in detail. More concretely, the contributions of this work are:
– A temporal graph data model for property graphs, which allows keeping the
history of nodes, edges, and properties.
– A high-level graph query language, denoted T-GQL, based on GQL[15]
(standing for Graph Query Language), the standard language for property
graph databases being defined by the graph database community at the time
of writing this paper.
– A collection of algorithms for computing different kinds of temporal paths
in a graph, capturing different temporal path semantics.
– A Neo4j-based implementation of the above, together with a client interface
for querying Neo4j graphs.
– A collection of experiments over the implementation, over two use cases
which capture the two semantics studied in this work: a synthetic dataset of
a social network, and a real-world dataset of flights between airports.
1.2 Paper Organization
This paper is organized as follows. Section 2 reviews related work, in order to put
the present work in context. Section 3 introduces the temporal property graph
data model that will be used in the paper, and Section 4 presents ans discusses
T-GQL, the high-level data language proposed for the model, and Section 5
present the implementation details. Section 6 reports preliminary experimental
results, and Section 7 concludes the paper.
4 A. Debouvier et al.
2 Related Work
This section reviews related work, starting from traditional (non-temporal) prop-
erty graphs, and then moving on to the few existing work on temporal graphs.
These existing proposals are compared against the work presented here.
2.1 Graph database models
There is an extensive bibliography on graph database models, comprehensively
studied in [1, 4]. The interested reader is referred to these works for details. Mul-
tiple native graph indexing methods and query languages (e.g., GraphQL [19])
were developed to efficiently answer graph-oriented queries. In real-world prac-
tice, two graph database models are used:
(a) Models based on RDF,5oriented to the Semantic Web.
(b) Models based on Property Graphs.
Models of type (a) represent data as sets of triples where each triple consists
of three elements that are referred to as the subject, the predicate, and the
object of the triple. These triples allow describing arbitrary objects in terms of
their attributes and their relationships to other objects. Informally, a collection
of RDF triples is an RDF graph. Although the models in (a) have a general
scope, RDF graphs aim at representing metadata on the Web, therefore, an
important feature of RDF-base graph models is that they follow a standard,
which is not yet the case for the other graph databases. In the property graph
data model [3, 2], nodes and edges are labeled with a sequence of (attribute,
value)-pairs. Extending traditional graph models, Property Graphs are the usual
choice in modern graph databases used in real-world practice. Hartig [17, 18]
proposes a formal way of reconciling both models, through a collection of well-
defined transformations between Property Graphs and RDF graphs. He shows
that Property Graphs could, in the end, be queried using SPARQL, the standard
query language for the Semantic Web. This is also studied in [5, 29].
2.2 Data models for temporal graphs
Data models in the temporal graphs literature can be classified in three groups:
– (a) Duration-labeled temporal graphs (DLTG)
– (b) Interval-labeled temporal graphs (ILTG)
– (c) Snapshot-based temporal graphs (SBTG)
Graphs of type (a) are typically proposed for the phone calls and travel
scheduling problems described above. Graphs of type (b) are more appropri-
ate than the former ones, to capture the history of the relationships in social
networks. Graphs of type (c) are based on the notion of snapshot temporal
databases, where a temporal database is seen either as a sequence of snapshots,
or a sequence composed of an initial database and a sequence of incremental
updates. These models are discussed next.
5https://www.w3.org/RDF/
5
k
g
b
j
a
c
h
l
i
f
1
243 10
8
9
7
3
3
62 7
65
k
g
b
j
a
c
h
l
i
f
[1,2]
[4,5] [3,4] [10,11]
[8,9]
[9,10]
[7,8]
[3,4]
[3,4]
[6,7][2,3]
[7,8]
[2,3]
[5,6]
[6,7]
Fig. 1. Left: A duration-labeled temporal graph (cf. [30]); Right: An Interval-labeled
graph for the graph on the left.
2.2.1 Duration-labeled temporal graphs These kinds of graphs are stud-
ied by Wu et al. [30]. In these graphs, a node is represented as a string (i.e.,
nodes are not annotated with properties), and the edges are labeled with a value
representing a duration of the relationship between two nodes. Based on this
work, the same authors have elaborated different proposals [20, 31–34]. All of
them address the previously mentioned kinds of graphs. Definition 1 formally
explains the above description.
Definition 1 (Duration-labeled graphs (cf. [30])). Let Gd= (V, E)be a
temporal graph, where Vis the set of vertices, and Eis the set of edges in G.
– Each edge e= (u, v, t, λ)∈Eis a temporal edge representing a relationship
from a vertex uto another vertex vstarting at time t, with a duration λ.
For any two temporal edges (u, v, t1, λ1)and (u, v, t2, λ2),t1≤t2.
– Each node v∈Vis active when there is a temporal edge that starts or ends
at v.
–d(u, v) : the number of temporal edges from uto vin Gd.
–E(u, v): the set of temporal edges from uto vin G, i.e., E(u, v) = {(u, v, t1),
(u, v, t2), ..., (u, v, td(u, v ))}.
–Nout(v)or Nin (v) : the set of out-neighbours or in-neighbours of vin Gd,
i.e., Nout(v) = {u: (v, u, t)∈E}and Nin(v) = {u: (u, v, t)∈E}.
–dout(v)or din (v): the temporal out-degree or in-degree of v∈Gd, defined as
dout(v) = Pu∈Nout (v)d(v, u)and din (v) = Pu∈Nin(v)d(u, v ).
Graphs defined in this way are called Duration Labeled. The left-hand side of
Figure 1 shows an example, where, for simplicity, λ= 1.ut
As mentioned, the main use of this kind of temporal graphs is, for example, for
scheduling problems, where usually some sort of shortest path must be computed.
6 A. Debouvier et al.
Therefore, the works around this model propose “temporal” variants of the well-
known Dijkstra’s algorithm [11]. In [30] (and the sequels of this work, referred
above), the authors also define four different forms of “shortest” paths. These
are called here minimum temporal paths, and account for different measures: (1)
Earliest-arrival path, defined as a path that results in the earliest arrival time
starting from a source x to a target y; (2) Latest-departure path, defined as a
path that gives the latest departure time starting from x in order to reach y
at a given time; (3) Fastest path, defined as the path that goes from x to y in
the minimum elapsed time ; and (4) Shortest path, defined as the path that is
shortest from x to y in terms of overall traversal time along the edges.
2.2.2 Interval-labeled temporal graphs There are two main approaches
in the temporal databases literature [28], for keeping the history of a database:
tuple or attribute-timestamping, where either a temporal label is defined over the
database objects; or database versioning, where different versions of a database
are created at different time instants. The latter is described below in this section.
The former is discussed next. Definition 2 below, characterizes interval-labeled
temporal graphs (ILPG). From this general definition, different constraints can
be stated, leading to different models, as the one introduced below in this paper,
based on the work by Campos et al [7] (see Section 3), the first approach to
apply the ILTG notion to property graphs. Valid time is considered in the re-
mainder, that is, the times where the edges are valid in the real world, opposite
to transaction time, which reflects the time where the information is stored in
the database.
Definition 2 (Interval-labeled temporal graphs). Let Gd= (V, E)be a
temporal graph, where Vis the set of vertices, and Eis the set of edges in
G. A Duration Labeled Temporal Graph. is a temporal graph where each edge
e= (u, v, I )∈Eis a temporal edge representing a relationship from a vertex u
to another vertex v, valid during a time interval I= [ts, te].ut
The right-hand side of Figure 1 shows a graph equivalent to the one on the
left of such figure, but where edges are labeled with their validity interval instead
of a timestamp representing a duration. In the ILTG on the right-hand side of
Figure 1, for example, the edge between nodes band gis labeled with the interval
[3,4]. This is due to the fact that the same edge, on the left-hand side of the
same figure is labeled 3, representing the initial time of the edge, with a duration
of 1. That means, if the graph represents a bus schedule, the bus leaves from b
at time instant 3, and the trip between band gtakes one time unit.
Example 1. The path traversal times defined in Section 2.2.1 are also valid in
this representation. Consider for example, the computation of the earliest arrival
time from node ato every node in the graph, in the interval [1,4]. The algorithm
proposed in [30] gives as a result eat(b) = 2, eat(g) = 4, eat(h) = 4, and
eat(f) = 4. Obviously, this can also be computed with the interval-labeled graph.
It it easy to see that, for instance, eat(g) = 4, with path h(a, b, [1,2]),(b, g , [3,4])i,
7
since h(a, b, [2,3]),(b, g, [3,4])icannot be used since the arrival time at bin equal
to the departure time from bto g.ut
The discussion above gives the intuition that ILTGs and DLTGs are equiv-
alent, in the sense that both allow representing the same information using
different encodings for time (the proof is outside the scope of this paper). ILTGs
appear to be, at first sight, more appropriate than DLTGs to support classic
temporal queries, for example, the ones asking for the history of relationships in
a social network. At the same time, travel schedules and mobility data can also
be modeled in this way, as Example 1 shows. Moreover, current graph databases
are based on the property graph data model, which are not supported in the work
by Wu et al. Therefore, the data model in the present paper along with its the
accompanying query language, are based on interval-labeled property graphs.
2.2.3 Temporal graphs as a sequence of snapshots The work by Se-
mertzidis and Pitoura [27] aims at finding the most persistent matches of an
input pattern in the evolution of graph networks. The authors assume that the
history of a node-labeled graph is given in the form of graph snapshots corre-
sponding to the state of the graph at different time instants. Given a query graph
pattern P, the work address the problem of efficiently finding those matches of
P in the graph history that persist over time, that is, those matches that exist
for the longest time, either contiguously (i.e., in consecutive graph snapshots)
or collectively (i.e., in the largest number of graph snapshots). These queries
are called graph pattern queries. Locating durable matches in the evolution of
large graphs has many applications, like for example, long-term collaborations
between researchers, durable relationships in social networks, and so on. In [27],
a temporal graph is defined as follows, which defines the third category of tem-
poral graphs introduced above.
Definition 3 (Sequence-snapshot temporal graph). A temporal graph
G[ti, tj]in a time interval [ti, tj], is a sequence {Gti, Gti+1 , ..., Gtj}of graph
snapshots.
Using graphs in this category, Huo and Tsotras [21] study the problem of
efficiently computing shortest-paths on evolving social graphs. In this work,
shortest-path queries are “temporal” in the sense that they can refer to any
time-point or time-interval in the graph’s evolution, and corresponding valid an-
swers are returned. The authors extend the traditional Dijkstra’s algorithm [11]
to compute shortest-path distance(s) for a time-point or a time-interval. Dif-
ferent from traditional studies of shortest-path queries on a single graph, their
main goal is to efficiently answer temporal shortest-path queries within the social
graph’s evolving history. Such temporal queries can be viewed as being issued on
certain historical graph snapshot(s). For example, temporal shortest-path queries
in a social network can discover how close two given users were in the past and
how their closeness evolved over time. The authors define a temporal graph as
an initial snapshot, followed by updates. Finally, several different kinds of path
8 A. Debouvier et al.
queries are defined. A time point shortest path query T P S P (T EG, tq, vs, v t) re-
turns the shortest-path pfrom a source node vsto a target node vt, such that
both are temporally valid at query time tq,such that all edges in p are valid at
query time tq. Analogously a time interval shortest path ‘all’query returns a set
of paths Psuch that each path pi∈Pis associated with a time interval ∆pi,
and there is no other path shorter than pifrom vsto vtduring ∆pi.
2.2.4 Other work on temporal graphs In the temporal graph model pre-
sented in [8, 9], temporal data are organized in so-called frames, namely the finest
unit of temporal aggregation. A frame is associated with a time interval and al-
lows to retrieve the status of the social network during such interval. This model
does not allow registering changes in attributes of the nodes. Also, frame nodes
may become associated to a large number of edges. Redundant data are also a
problem since each frame is connected to all the existing data, so a frequently
changing graph would become full of redundant connections.
Khurana and Deshpande [22, 23] study methods to efficiently query historical
graphs. They focus on the particular problem of querying the state of a network
as of a certain point (snapshot) in time. The work is based on versioning. Ba-
sically, the current graph and a series of deltas containing the graph variation
over time are stored.
Among other works related with temporal graphs, Han et al. [16] present an
engine for temporal graph mining, and Kostakos [24] shows the use of temporal
graphs to represent dynamic events.
3 Data Model for Interval-labeled Property Graphs
The problem of extending property graphs in order to keep the history of the
components of the model is addressed next. Property graphs are graphs such that
their nodes and edges are labeled with a collection of (property,value) pairs.
These properties can evolve over time, therefore, in order to keep the history
of the graph, the data model must not only account for the changes in the
relationships and the nodes, but also for the changes in the properties. A first
approach for this, was presented by Campos et al [7]. The definition below is
based on such work.
Definition 4 (Temporal property graph). A temporal property graph is a
structure G(No, Na, Nv, E)where G is the name of the graph, Eis a set of edges,
and No, Na,and Nvare sets of nodes, denoted object nodes, attribute nodes,
and value nodes, respectively. Every object and attribute node, and every edge
in the graph is associated with a tuple (name,interval). The name represents
the content of the node (or the name of the relationship), and the interval
represents the period(s) in which the node is (was) valid and it is a temporal
element. Analogously, value nodes are associated with a (name,interval) pair.
For any node n, the elements in its associated pair are referred to as n.name,
n.interval, and (for value nodes) n.value. In addition, nodes and edges in G
9
satisfy the constraints in Definition 6 below. As usual in temporal databases, a
special value Now is used to represent that the node is currently valid. Object
nodes also have an identifier, denoted id.ut
In the definition above, object nodes represent entities (e.g., Person), edges
represent relationships between object nodes (e.g., LivesIn,FriendOf), attribute
nodes describe entities (e.g., Name); Finally, value nodes represent the value
of an attribute (e.g., Mary). To illustrate this more in detail, the first running
example that will be used in this paper is presented next.
Example 2 (Data model). The data model in Definition 4 is used to represent the
social network depicted in Figure 2. There are three kinds of object nodes, namely
Person,City, and Brand. There are also three types of temporal relationships:
LivedIn,Friend, and Fan. The first one is labeled with the periods when someone
lived somewhere. The second one is labeled with the periods when two people
were friends. The temporal semantics of the relationship Fan is similar. For
example, there is an edge of type Fan, joining nodes 14 (a Person node) and
70 (a Fan node), indicating that Mary Smith is a Samsung fan since 1982. The
attribute node Name represents the name associated with a Person node, and
it is also temporal. The actual value of the attribute node is represented as a
value node, e.g., the node in green with id=34 and value “Mary Smith”. Note
that this value changes to “Mary Smith-Taylor”, showing the temporality of
the attribute node Name. Finally, for clarity, if a node is valid throughout the
complete history, the temporal labels are omitted. ut
Before introducing the temporal graph’s constraints, some notation is needed.
In Definition 6 below, an edge is denoted by e{na, nb}where naand nbare nodes
connected by the edge e. An attribute node will be represented as na{n}where
nis the object node connected to na. A value node is denoted nv{na}where na
is the attribute node connected to nv. Also, the following definition is needed.
Definition 5 (Lifespan of an edge). Consider a node n, and a collection of
kedges outgoing from n,Eouti, i = 1, . . . , k such that Outi.name is the same for
all Eouti. Also, let Einj, j = 1, . . . , m be the set of medges with the same name
incoming to node n. The union of the temporal labels of all these edges is called
the lifespan of n, denoted l(n).ut
Definition 6 (Constraints). For the graph in Definition 4, the following con-
straints hold:
1. ∀n, n0∈No, n =n0∨n.id 6=n0.id
2. ∀n, n0∈Na, n =n0∨n.id 6=n0.id
3. ∀n, n0∈Nv, n =n0∨n.id 6=n0.id
4. ∀nv{na}, n0
v{na} ∈ Nv, nv=n0
v∨nv.value 6=n0
v.value
5. ∀ei{n, n0}, ej{n, n0} ∈ E∧ei.name =ej.name, ei=ej∨ei.name 6=
ej.name
6. ∀n∈No, e{n, n0} ∈ E⇒n0∈NoSNa
7. ∀n∈Na, e{n, n0} ∈ E⇒n0∈NoSNv
10 A. Debouvier et al.
Fig. 2. A temporal graph and its different kinds of nodes.
8. ∀n∈Nv, e{n, n0} ∈ E⇒n0∈Nv
9. ∀n∈Na(∃no∈No∃e∈E(e(no, n)∧(6 ∃n0∈(NaSNvSNo)∧e0∈
E∧e0{n0, n})))
10. ∀n∈Nv(e{n0, n} ∧ n0∈Na)⇒6 ∃n00 ∈(NaSNvSNo) (e00{n00 , n} ∈
E∨e00{n, n00 } ∈ E)
11. ∀ne{n, n0} ∈ Ne, ne.interval ⊂n.interval ∩n0.interval
12. ∀na{n} ∈ Na, na.interval ⊂n.interval
13. ∀nv{na} ∈ Nv, nv.interval ⊂nv.interval
14. ∀nv{na}, n0
v{na}, nv6=n0
v, nv.interval ∩n0
v.interval =∅
Constraints 1 through 3 state that all nodes in the graph have a different id.
Constraint 4 requires coalescing all nodes with the same value associated with the
same attribute node; thus, the interval becomes a temporal element which includes
all periods where the node had such value. Analogously, Constraint 5 applies to
edges: all edges with the same name (i.e, representing the same relationship type),
between the same pair of nodes, are coalesced. Constraints 6 through 8 state how
the nodes must be connected, namely: (a) An Object node can only be connected
to an attribute node or to another object node; (b) Attribute nodes can only be
connected to non-attribute nodes; and (c) Value nodes can only be connected to
attribute nodes. The cardinalities of these connections is stated by Constraints 9
11
n1
(friend,[2,8])
(friend,[1,9])
(friend,[4,7])
(friend,[2,3])
n2
n3
n4 n5
(friend,[1,10])
Fig. 3. Continuous paths.
through 10, stating that attribute nodes must be connected by only one edge to an
object node, and value nodes must only be connected to one attribute node with
one edge. Constraints 11 to 14 restrict the values of the interval property. ut
3.1 Continuous Path
In ILTGs, it is usually the case when queries ask for paths that are valid con-
tinuously during a certain interval. This requirement is captured by the notions
of continuous path and maximum continuous path [25]. Definition 7 introduces
these concepts.
Definition 7 (Continuous Path). Given a temporal property graph G(interval-
labeled), a continuous path (cp) with interval Tfrom node n1to node nk, travers-
ing a relationship r, is a sequence (n1, . . . , nk, r, T )of knodes and an interval
Tsuch that there is a sequence of consecutive edges of the form e1(n1, n2, r, T1),
e2(n2, n3, r, T2), . . . , ek(nk−1, nk, r, Tk),and T=Ti=1,k Ti.ut
Example 3 (Continuous Path). Consider the graph depicted in Figure 3, where
e1(n1, n2, f riend, [1,9]]), e2(n2, n3, friend, [2,3]), e3(n3, n4, f riend, [1,10]), e4(n1,
n5, friend, [2,8]),and e5(n5, n4, f riend, [4,7]).There are two continuous paths,
namely (n1, n2, n3, n4, f riend, [2,3]) and (n1, n5, n4, f riend, [4,7]). That is, n4
can be reached traversing the edges labeled friend from n1during the interval
[2,3] with a path of length 3, and during the interval [4,7] with a path of length
2. The interval when n4is continuously reachable from n1, is obtained by taking
the union of both intervals, that is [2,7]. ut
3.2 Pairwise Continuous Path
Requiring a path to be valid throughout a time interval is a strong condition for
a graph query. In many cases, querying temporal graphs requires a weaker notion
12 A. Debouvier et al.
of temporal path. Consider for example the case of a social network like the one in
Figure 2. Also assume that there are friendship relationships between a person p1
and a person p2, in an interval [2,7]. Also, p2was a friend of p3during the interval
[6,12], and p3was a friend of p4during the interval [10, Now]. It can be seen that
there is no continuous path from p1to p4. However, the user may be interested
in a transitive friendship relationship such that there is an intersection in the
interval of two consecutive edges. In the example above, note that pairwise, such
intersection exists, e.g., there is an overlap between (p1, p2, friend, [2,7]) and
(p2, p3, f riend, [6,12]),and between the latter and (p3, p4, friend, [10, N ow]).
That means, although there is not a continuous path between p1and p4, there
is a consecutive chain of pairwise temporal relationships in the path. This is
formalized by the notion of pairwise continuous path.
Definition 8 (Pairwise Continuous Path). Given a temporal property graph
G, there is a pairwise continuous path between two nodes n1, nk, through a rela-
tionship r, if there is a sequence of edges e1(n1, n2, r, [ts1, tf1]),...,ek(nk−1, nk,
[tsk−1, tfk]), such that (ts1≤ts2≤tf1∨ts2≤tf1≤tf2)∧. . . ∧(tsk−1≤tsk≤
tfk−1∨tsk≤tfk−1≤tfk).ut
3.3 Consecutive Paths
Figure 1 shows that DLTGs can also be represented as ILTGs. Therefore, the
queries in Section 2.2.1, e.g., asking for earliest or fastest arrival times in a
DLTG, require a different temporal semantics than the ones in Sections 3.1 and
3.2. Definition 9 introduces the notion of consecutive path.
Definition 9. A consecutive path Pctraversing a relationship rin a temporal
property Gis a sequence of nodes P= (n1, n2, r, [t1, t2]) . . . , (nk−1, nk, r, [tk−1, tk]))
where (ni, ni+1, r, [ti, ti+1]) is the i-th temporal edge in Pfor 1≤i≤k, and
ti−1< tifor 1≤i≤k. Instant tkis the ending time of P, denoted end(P), and
t1is the starting time of P, denoted start(P).The duration of Pis defined as
dura(P) = end(P)−start(P), and the distance of Pas dist(P) = k.ut
With the notion of consecutive path, several different temporal paths can be
defined, analogously to the paths for DLTGs described by Wu et al. in [30]. The
ones that will be studied in this paper are introduced by Definition 10.
Definition 10 (Types of consecutive paths). Let Gbe a temporal property
graph G, a relationship rin G, a source node ns, and a target node nt, both
in G;there is also a time interval [ts, te]. Let P(ns, nt, r, [ts, te]) = {P|Pis a
consecutive path from x to y such that start(P)≥ts, end(P)≤te}.The follow-
ing paths can be defined:
The earliest-arrival path (EAP) is the path that can be completed in a given
interval such that the ending time of the path is minimum. Formally,
EAP: P∈ P(ns, nt, r, [ts, te]) such that end(P) = min{end(P0) : P0∈ P(ns, nt,
r, [ts, te])}.
13
The latest-departure path (LDP) is the path that can be completed in a given
interval such that the starting time of the path is maximum. Formally,
LDP: P∈ P(x, y, [ts, te]) such that start(P) = max{start(P0) : P0∈ P(ns, nt,
r, [ts, te])}.
The fastest (FP) is the path that can be completed in a given interval such that
its duration is minimum. Formally,
FP: P∈ P(ns, nt, r, [ts, te]) such that dura(P) = min{dura(P0) : P0∈ P(ns, nt,
r, [ts, te])}.
The shortest path (SP) is the path that can be completed in a given interval
such that its length is minimum. Formally,
SP: P∈ P(ns, nt, r, [ts, te]) such that dist(P) = min{dist(P0) : P0∈ P(ns, nt,
r, [ts, te])}.ut
Based on Definition 10, more kinds of paths can be defined to address practi-
cal problems. For example, for scheduling, a fastest path can be defined restricted
to the paths such that there is a minimum ‘waiting’ time between two consecu-
tive edges. Or, for phone fraud analysis, a path such that the time between two
consecutive edges is below a given threshold, can be computed.
4 T-GQL Syntax and Semantics
This section introduces T-GQL, a high-level query language for graph databases.
The language syntax is based on GQL [15], a standard being defined at the time
of writing this paper. The language has an SQL flavor, although it is based on
Cypher6, Neo4j’s high-level query language. The formal semantics of Cypher can
be found in [12, 13]. T-GQL also extends Cypher with a collection of functions,
whose implementation is explained in Section 5.
4.1 Basic Statements
The syntax of the language has the typical SELECT-MATCH-WHERE form. The
SELECT clause performs a selection over variables defined in the MATCH clause
(aliases are allowed). The MATCH clause may contain one or more path patterns
(of fixed or variable length) and function calls. The result of the query is always
a temporal graph (analogous to relational temporal databases theory), although
the query may not mention temporal attributes. This can be modified by the
SNAPSHOT operator, which allows retrieving the state of the graph at a certain
point in time. The basic syntax and semantics will be introduced using the
social network in Figure 2. Path functions implementing the consecutive path
semantics will be covered using a flight scheduling example.
6https://neo4j.com/docs/cypher-manual/current/
14 A. Debouvier et al.
Consider the query: “List the friends of the friends of Mary Smith-Taylor”.
This does not include temporal features, but allows introducing the basic T-GQL
syntax. Temporal capabilities are addressed in the next sections.
SELECT p2
MATCH (p1:Person) - [:Friend*2] -> (p2:Person)
WHERE p1.Name = 'Mary Smith-Taylor'
Note that this query just returns the object nodes (recall the model of Defini-
tion 4), which, for a final user, would not be useful. A variant to the query above
would select the name of the friends of friends of Mary as follows (an alias is
used in the query):
SELECT p2.Name as friend_name
MATCH (p1:Person) - [:Friend*2] -> (p2:Person)
WHERE p1.Name = 'Mary Smith-Taylor'
For returning all the paths, the wildcard operator ‘*’ is used. The expression
below returns the three paths of length 2 from the node representing Mary.
SELECT *
MATCH (p1:Person) - [:Friend*2] -> (:Person)
WHERE p1.Name = 'Mary Smith-Taylor'
The T-GQL language supports the three path semantics explained in pre-
vious sections: (a) Continuous path semantics; (b) Pairwise continuous path
semantics; (c) Consecutive path semantics. These semantics are implemented by
means of functions, which are included in a library of Neo4j plugins. To compute
temporal paths, two types of functions are defined: Coexisting and Consecutive.
Both receive two nodes as arguments. These are explained in the following sec-
tions.
Remark 1. Functions computing continuous and pairwise paths, do not accept
the wildcard ‘*’. That is, the length of the paths must be constrained by the
user. On the contrary, temporal functions computing consecutive paths (earliest,
fastest, etc.), do not support a limited search, therefore ‘*’ must be used.
4.2 Continuous Path Queries
Query 1 requires the computation of all continuous paths of length 2, over the
social network running example. As Remark 1 mentions, the length of the con-
tinuous paths in a query must be explicitly specified.
Query 1 Compute the friends of the friends of each person, and the period such
that the relationship occurred through all the path.
15
In Figure 2, for example, Cathy (person node 12) was a friend of Pauline (person
node 11) between 2002 and 2017. Also, Pauline was a friend of Mary (person
node 14) between 2010 and 2018. Thus, the answer to Query 1 will include the
path Mary →P auline →Cathy, [2010,2017] since the whole path was valid in
this interval (Definition 7). The query reads in T-GQL:
SELECT path
MATCH (n:Person), path = cPath((n) - [:Friend*2] -> (:Person))
In this case, a record is returned for each path. The modifiers SKIP and LIMIT
can be used, as in Cypher, to get a specific path or a range. For example, to get
the third path in the answer:
SELECT path
MATCH (n:Person), path = cPath((n) - [:Friend*2] -> (:Person))
SKIP 2
LIMIT 1
A continuous path search between two specific persons can also be performed,
as Query 2 shows.
Query 2 Find the continuous paths between Mary Smith Taylor and Peter Bur-
ton with a minimum length of two and a maximum length of three.
SELECT paths
MATCH (p1:Person), (p2:Person),
paths = cPath((p1) - [:Friend*2..3] -> (p2))
WHERE p1.Name = 'Mary Smith-Taylor' and p2.Name = 'Peter Burton'
The cpath function computes the continuous path. The result is a single path of
length three (the other possible path, with length one, is discarded). The path
is an array of the object nodes traversed together with their interval, attributes,
id and title. The interval of the result is the intersection of the intervals of the
object nodes in the path.
The figure below shows the format of the result. It can be seen that the
attribute and value nodes are embedded in the answer in an inline fashion, to
facilitate their search (as mentioned previously, object nodes are not likely to be
useful for a final user). Note that the value node “Mary Smith” is ignored since
it’s interval [1937-1959] does not intersect with the continuous path’s interval
[2010 −2017]. Also note that the value node returned has the interval [2010 −
2017], which is the intersection of the intervals [1960 −N ow] (the interval of
the value) and [2010 −2017]. Finally, the interval of the continuous path is
[2010 −2017], which is the result of the intersection between the traversed edges
([2010 −2018],[2002 −2017],[1995 −Now]).
16 A. Debouvier et al.
paths
{
"path": [
{ "interval": ["1937-Now" ],
"attributes": {
"Name": [
{"value": "Mary Smith-Taylor",
"interval": "[2010 - 2017]" }]
},
"id": 8,
"title": "Person"
},
{
...
}
],
"interval": "2010-2017"
}
The cpath function is overloaded to return a Boolean value, like Query3
shows.
Query 3 Find the names of the persons such that there is a continuous path
from them to Peter Burton.
SELECT p1.Name
MATCH (p1:Person), (p2:Person)
WHERE p2.Name = 'Peter Burton'
and cPath((p1) - [:Friend*2..3] -> (p2))
In this case the function call is located in the WHERE clause, and the parser
decides from the context that the Boolean procedure must be used.
Pairwise continuous paths (Definition 3.2) can be also computed, using the
pairCPath function. Since they ask for pairwise intersection intervals, the function
computing these paths is denoted pairCPath. An example is shown below.
Query 4 Find the pairwise continuous paths between Mary Smith Taylor and
Peter Burton with a minimum length of two and a maximum length of three.
SELECT paths
MATCH (p1:Person), (p2:Person),
paths = pairCPath((p1) - [:Friend*2..3] -> (p2))
WHERE p1.Name = 'Mary Smith-Taylor' and p2.Name = 'Peter Burton'
The intermediate results of a query can be filtered by an interval I, provided by
the user. This will filter out the paths whose interval does not intersect with I.
The granularity of the starting and the ending instants of the interval must be
the same. Query 5 below illustrates this.
17
Query 5 Compute all the continuous paths of friends between Mary Smith Tay-
lor and Peter Burton, in the interval [2018,2020], with a minimum length of 2
and maximum length of three.
In the running example, there are two possible paths between Mary and Peter:
one of length 3 and the other of length 1 (which is thus, discarded). Therefore,
the only continuous path obtained would be M ary →P auline →C athy →
P eter, [2010,2017]. However, the path will be filtered out of the result set, since
[2018,2020] ∩[2018,2020] = ∅. The query is expressed as:
SELECT paths
MATCH (p1:Person), (p2:Person),
paths = cPath((p1) - [:Friend*2..3] -> (p2), '2018', '2020')
WHERE p1.Name = 'Mary Smith-Taylor' and p2.Name = 'Peter Burton'
The properties of the returned structure can also be retrieved. For example, if
only the interval of the path is needed in Query 2, the query would read:
SELECT paths.interval as interval
MATCH (p1:Person), (p2:Person),
paths = cPath((p1) - [:Friend*2..3] -> (p2))
WHERE p1.Name = 'Mary Smith-Taylor' and p2.Name = 'Peter Burton'
Furthermore the attributes in the path can be retrieved as in the following query,
where the names of the persons in the starting and in the the third position in
the resulting paths are requested.
SELECT paths.path[0].attributes.Name as start_node,
paths.path[3].attributes.Name as end_node
MATCH (p1:Person), (p2:Person),
paths = cPath((p1) - [:Friend*2..3] -> (p2))
WHERE p1.Name = 'Mary Smith-Taylor' and p2.Name = 'Peter Burton'
The head() and last() path methods can be used as follows.
SELECT head(paths.path).attributes.Name as start_node,
last(paths.path).attributes.Name as end_node
MATCH (p1:Person), (p2:Person),
paths = cPath((p1) - [:Friend*2..3] -> (p2))
WHERE p1.Name = 'Mary Smith-Taylor' and p2.Name = 'Peter Burton'
If more than one path were returned, the head() and last() functions will be
applied to each one.
18 A. Debouvier et al.
Airport
Bariloche
LocatedAt Airport
Sao
Paolo
code
BRC
LocatedAt
code
GRU
code
Flight [BA246] (2020/03/07 15:30, 2020/03/08 06:55)
Airport
London
BRC
LocatedAt
LHR
code
Flight [AR2020] (2020/03/07 17:00, 2020/03/07 21:35)
City
City
City
Fig. 4. A temporal graph for flight scheduling analysis.
4.3 Consecutive path queries
To illustrate consecutive path semantics (Definitions 9 and 10), a second running
example is introduced, depicted in Figure 4. In this example, there are two object
nodes, namely Airport and City. There are also two temporal relationships, Flight
and LocatedAt. The former is labeled with the interval [td, ta], where tdis the
departure time of a flight from an airport, and tais the arrival time at the
destination airport. Airport nodes are labeled with the period during which
an airport belongs to a city (not shown in the figure, for clarity). Note here
the flexibility that the ILTG model provides, allowing representing cases which
are typically modeled using DLTGs. It is worth remarking that, of course, this
does not intend to be a real-world example of a flight scheduling graph, but a
simplified portion of it.
Consecutive path semantics is implemented through functions that are called
from T-GQL. Four functions are currently supported: fastestPath,earliestPath,
shortestPath, and latestDeparturePath. The first three ones receive two nodes as
arguments (recall, ‘*’ is mandatory for these path functions, unlike the case of
continuous paths). The latter also receives a time instant. The queries below
illustrate their syntax and semantics, starting with an example of an earliest-
arrival path query.
Query 6 How can we go from Tokyo to Buenos Aires as soon as possible?
Recalling Definition 10, Query 6 refers to the earliest-arrival path from Tokyo
to Buenos Aires. Note that this query uses the consecutive path semantics of
Definition 9. Here, the difference with the continuous path semantics is clear: a
path in the solution must be such that the intervals of the edges are pairwise
disjoint. The T-GQL query is written as follows:
SELECT path
19
MATCH (c1:City) - [:LocatedAt] -> (a1:Airport),
(c2:City) - [:LocatedAt] -> (a2:Airport),
path = fastestPath((a1)-[:Flight*]->(a2))
WHERE c1.Name = 'Buenos Aires' AND c2.Name='Tokyo'
Opposite to the earliest-arrival path function, the latestDeparturePath function
needs a threshold parameter as argument. As an example, consider Query 7
below.
Query 7 How can we go from Tokyo to Buenos Aires, leaving as late as possible
and arriving before July 15 at 8 pm?
SELECT path
MATCH (c1:City) - [:LocatedAt] -> (a1:Airport),
(c2:City) - [:LocatedAt] -> (a2:Airport),
path = latestDeparturePath((a1)-[:Flight*]->(a2),'2019-07-15 20:20')
WHERE c1.Name = ‘Buenos Aires’ AND c2.Name='Tokyo'
4.4 Handling Temporal Granularity
The reader may have noticed that all time intervals in the social network example
are given in the Year time granularity; for the flight example, granularity is
Datetime. However, queries may mention a granularity different to the one in
the graph’s objects. This time granularity problem has been extensively studied
in temporal database theory, and it is common to all kinds of queries. When
a query includes a temporal condition with a temporal granularity tgdifferent
than the one of an object in the graph og, two cases may occur:
–tgis finer than og. In this case, both granularities are identified, in a way
such that the finer one is transformed into the coarser one. For example, if
og.interval = [2010,2012], and the condition is t I N og.interval, where t=
2/10/2012, then, the interval is transformed into the interval og.interval =
[1/1/2010,31/12/2012].
–tgis coarser than og. In this case, one time instant in the granularity of
ogis chosen. For example, if og.interval = [15/10/2010,23/12/2010], and
the condition is 2010 IN og.interval, the semantics would imply that the
condition is satisfied.
As mentioned, in the social network example, the granularity used in Year,
for all data. Also the example queries are given using this granularity, so no
problem arouses in this sense. However, if a query asks for Cathy’s friends on
October 10th, 2018, it would not be possible to give a precise answer, and the
query must use the semantics explained above. T-GQL supports the following
granularities and formats:
–Year: yyyy
–YearMonth: yyyy-MM
–Date: yyyy-MM-dd
–Datetime: yyyy-MM-dd HH:mm
Examples will be presented in the next sections.
20 A. Debouvier et al.
4.5 Temporal Operators
Some kinds of T-GQL queries require temporal operators and filters, explained
in this section. To begin with, the SNAPSHOT operator returns the state of the
graph at a certain point in time. Therefore, along the lines of temporal database
notions, the answer is a non-temporal graph, like in Query 8 below.
Query 8 Who where the friends of the friends of Cathy in 2018?
SELECT p2.Name as friend_name
MATCH (p1:Person) - [:Friend*2] -> (p2:Person)
WHERE p1.Name = 'Cathy Van Bourne'
SNAPSHOT '2018'
Exactly one value is allowed to be used in the SNAPSHOT clause. The following
non-temporal result is returned:
p2.Name
{
"value": "Mary Smith Taylor"
}
The relationship with Pauline is filtered out since it was valid during the interval
[2002,2017]. Therefore, there is only one object node reached, which has two
possibles values for the Name attribute. The value ”Mary Smith” is discarded
because it was not valid in 2018.
The BETWEEN operator performs an intersection of the graph intervals with
a given interval. Exactly one interval is allowed. The granularity of both intervals
must be the same, like in Query 9.
Query 9 Where did the friends of Pauline live between 2000 and 2004?
This query returns the cities where the friends of Pauline lived during the
given interval. The temporal semantics adopted also applies the condition on
the relationship interval. That means, for example, that the relationship with
Sandra will not be considered, since the interval of the relationship is [2005,
Now], thus, it does not intersect with the given interval. The T-GQL query is
written as follows:
SELECT c.Name
MATCH (p1:Person) - [:Friend] -> (p2:Person),
(p2) - [:LivedIn] -> (c:City)
WHERE p1.Name = 'Pauline Boutler'
BETWEEN '2000' and '2004'
Only the Friend relationship with Cathy Van Bourne was valid during the
interval used above, and the query returns Brussels and Paris, the cities where she
lived during the intervals [1980,2000], and [2001, Now], the ones that intersect
[2000,2004].
21
Finally, the WHEN clause is useful for answering parallel-period queries,
which follows the SQL inner query idea. The syntax has the form MATCH-
WHERE-WHEN, and the inner query can have references to variables in the
outer query. Function calls are not allowed within this clause, and it can only
handle exactly one two-node path in the inner MATCH clause.
like Query 10.
Query 10 Who were friends of Mary while she was living in Antwerp?
Mary lived in Antwerp between [1990-Now]. The adopted semantics for this
query is that the answer would be any person that was a friend of Mary at any
instant of that interval.
SELECT p2.Name as friend_name
MATCH (p1:Person) - [:Friend] -> (p2:Person)
WHERE p1.Name = 'Mary Smith-Taylor'
WHEN
MATCH (p1) - [e:LivedIn] -> (c:City)
WHERE c.Name = 'Antwerp'
For WHEN queries, the wildcard selection can only be performed on the nodes
of the outer query (the MATCH clause). In a nutshell, the inner query returns a
collection of intervals. Then, the WHEN clause performs a BETWEEN operation
with these intervals. Query 4.5 shows an even more involved example.
Query 11 Where did Cathy live when she and Sandra followed the same brands?
Cathy and Sandra both followed the brand LG. Sandra, during the interval
[1995, 2000], and Cathy, in the interval [1998-2000]. The query language allows
expressing a graph traversal to the node that indicates where did Cathy live
from 1998 to 2000. In this case, it would be the city of Brussels. For this, the
query must compute the intersection of the intervals. It can be noticed that
the former two queries would be much difficult and unnatural to express with a
duration-labeled representation.
SELECT c.Name as city_name, b1.Name as brand_name
MATCH (p1:Person) - [:LivedIn] -> (c:City),
(p1) - [:Fan] -> (b1:Brand)
WHERE p1.Name = 'Cathy Van Bourne'
WHEN
MATCH (p2:Person) - [f:Fan] -> (b2:Brand)
WHERE p2.Name = 'Sandra Carter' and b1.Name = b2.Name
5 Implementation
This section describes the implementation details of this proposal. First, the gen-
eral system architecture is presented. Then, the parsing process and the trans-
lation of a T-GQL query to Cypher are explained. Finally, the algorithms for
computing the temporal operators and the different kinds of paths are discussed.
22 A. Debouvier et al.
Fig. 5. General architecture.
5.1 Architecture
The model and language described in this paper were implemented over the
Neo4j temporal database, an open-source Java-based graph database. Neo4j al-
lows extending its functionality with user-defined procedures, which can be eas-
ily added as plugins, packed in a .jar file. These procedures can then be used in
Cypher queries as any of the other built-in functions that this language offers.
As mentioned in previous sections, the T-GQL language is based on the stan-
dard being developed, GQL. The language grammar was implemented using the
widely-known tool ANTLR.7Using this tool, T-GQL queries are translated into
Cypher, Neo4j’s high-level query language, so it can be executed over the Neo4j
database. Figure 5 sketches the system’s architecture. To edit and execute T-
GQL queries, a web application interface was developed, also coded in Java, using
the lightweight web framework Javalin.8The application exposes a page where
the queries can be executed from an endpoint. This tool uses the main parser to
translate the users’ queries into Cypher and execute it on a Neo4j server, that
contains the plugins to run the temporal operators and path algorithms.
In addition, for populating the social network running example database
(and also for the experiments reported in the next section), a dataset genera-
tor was developed. Parameters for this generator allow indicating the model to
be populated, as well as the number of relationships and nodes, and number of
intervals that each edge can have, among many other ones. The application com-
municates directly with a running Neo4j server through the bolt protocol, and
automatically populates the database by executing the corresponding Cypher
queries.
7https://www.antlr.org/
8http://javalin.io
23
Fig. 6. Social network metamodel.
5.2 Parsing and Query Translation
The parser was developed using ANTLR4, a parser generator that reads a gram-
mar and produces a recognizer for it. It is important to keep in mind that the
query language hides the actual data structure of the graph. Recall that the
model explained in Section 3 is composed of three kinds of nodes, namely ob-
ject, attribute, and value nodes, but the user writes her queries abstracting from
these elements. Consider for instance, the metamodel of the social network ex-
ample, depicted in Figure 6. It can be seen that Person,City, and Brand are
object nodes, connected by different kinds of relationships. These object nodes
are associated with attribute and value nodes through a single kind of edge,
denoted Edge (also not visible to the user). Thus, in the implementation, Person
is actually a property (denoted title) of the object node, the Name of a person
is a property (also denoted title) of an attribute node, and the actual name of
the person is stored as a property of a value node, denoted value. All of these
elements, again, are not perceived by the user, but stored in the Neo4j database,
as shown in Figure 7. In the figure it can be seen that there is an edge labelled
Edge outgoing from an object node labeled Person (which is the value of the
property title of the object node). That edge reaches the attribute node Name
(again, Name is a property of the attribute node), and finally another Edge links
that node with a value node with value = ‘New York’. Note that all of these nodes
and edges are associated with intervals, not shown in the figure. The translation,
then, must not only rewrite the query in terms of the Cypher language, but also
bridge the gap between the structure exposed to the user, and the model actually
stored in Neo4j.
To illustrate the parsing process, consider the query:
SELECT p
MATCH (p:Person)
24 A. Debouvier et al.
Fig. 7. Social network model for the metamodel in Figure 6.
WHERE p.Name = 'John Smith'
Figure 8 depicts the parse tree. The start rule is highlighted in blue, non-
terminal nodes are indicated in yellow, and terminal nodes in green. For the sake
of simplicity, not all the nodes needed for evaluating this query are expanded and
represented in the tree. Once the tree has been generated, it must be traversed.
ANTLR’s default method is represented in the figure with the dashed line. First,
all the tokens in the SELECT clause are recognized, followed by the MATCH
clause, and finally the WHERE clause. When the tree is fully traversed, the
Cypher query is generated. The query translation process is explained next.
The object nodes in the MATCH clause are translated as {alias:Object {title:
‘Name’}}, since, as explained above, this property contains the entity type that
the user refers to. For example “(p:Person)” would be translated to {p:Object
{title: ‘Person’}}. The edges do not need to be translated, since the grammar
for the edges matches the Cypher’s grammar. If a function call is found, the
corresponding procedure is called, with the given arguments. An example is
shown at the end of this section.
For each attribute in the SELECT clause, a three-node path (Object - At-
tribute - Value) is produced from the object node. For example “p.Name as
name” would generate the following path:
OPTIONAL MATCH(p)-->(internal_n:Attribute{title:'Name'})
25
Fig. 8. Example parse tree.
-->(name:Value)
Recall that title is a property of the attribute node. In this case, OPTIONAL
MATCH is used to allow replacing the missing values in the SELECT clause with
a NULL value, and to return the row instead of discarding it. Variables starting
with ‘internal’ are generated internally by the parser and are reserved. For the
conditions in the WHERE clause, the attributes are expanded as explained above,
and the constants are translated without changing them. Finally, for each at-
tribute, the access to the value property of the value node, is added. For example,
the condition “p.Name = ‘John’ and p.Age = 18” is translated as:
MATCH (p)-->(internal_n:Attribute{title:'Name'})-->(internal_v:Value)
MATCH (p)-->(internal_a:Attribute{title:'Age'})-->(internal_v1:Value)
WHERE internal_v.value = 'John' and internal_v1.value = 18
Queries mentioning functions are explained next. Consider the following
query, asking for a continuous path:
SELECT p.path as path, p.interval as interval
MATCH (p1:Person), (p2:Person),
p = cPath((p1) - [:Friend*2..3] -> (p2), '2016', '2018')
WHERE p1.Name = 'Mary Smith-Taylor'
The query is translated into Cypher as:
MATCH (p1:Object {title: 'Person'}), (p2:Object {title: 'Person'})
MATCH (p1)-->(internal_n0:Attribute {title: 'Name'})
-->(internal_v0:Value)
WHERE internal_v0.value = 'Mary Smith-Taylor'
CALL coexisting.coTemporalPaths(p1,p2,2,3{edgesLabel:'Friend',
26 A. Debouvier et al.
nodesLabel:'Person',between:'2016-2018',direction:'outgoing'})
YIELD path as internal_p1, interval as internal_i1
WITH {path: internal_p1, interval: internal_i1} as p
RETURN p.path as 'path', p.interval as 'interval'
The temporal procedures (like coexisting.coTemporalPaths, used in this
query) are described in Section 5.3. Note that after calling these path procedures,
the query may ask for just one of the computed paths. For example, the following
query asks for the fastest path between airports located in the cities of London,
UK and Bariloche, Argentina. Both cities have more than one airport.
SELECT path
MATCH (c1:City)<-[:LocatedAt]-(a1:Airport),
(c2:City)<-[:LocatedAt]-(a2:Airport),
path=fastestPath((a1)-[:Flight*]->(a2))
WHERE c1.Name='London' AND c2.Name='Bariloche'
This is translated to:
MATCH (c1:Object {title: 'City'})<-[internal_l0:LocatedAt]-
(a1:Object {title: 'Airport'}),
(c2:Object {title: 'City'})<-[internal_l1:LocatedAt]
-(a2:Object {title: 'Airport'})
MATCH (c1)-->(internal_n0:Attribute {title: 'Name'})
-->(internal_v0:Value)
MATCH (c2)-->(internal_n1:Attribute {title: 'Name'})
-->(internal_v1:Value)
WHERE internal_v0.value='London' AND internal_v1.value='Bariloche'
CALL consecutive.fastest(a1,a2,1,
{edgesLabel:'Flight',direction:'outgoing'})
YIELD path as internal_p0, interval as internal_i0
WITH paths.intervals.fastest(
{path: internal_p0, interval: internal_i0}) as path
RETURN path
To evaluate this Cypher query, the engine will look for all the airports in Lon-
don and Bariloche, and all the combinations from airports in London to airports
in Bariloche. To retrieve the fastest path, the paths.intervals.fastest ag-
gregation function is called. It receives all the paths and returns only the fastest
ones, according to Definition 10.
5.3 Temporal Procedures Algorithms
It was already explained that the Neo4j database was extended with temporal
capabilities by means of a collection of procedures. Implementing the proce-
dures on the server side allows calling the procedures directly from the Cypher
27
Language. Besides, a client-side implementation would require retrieving a con-
siderable portion of the graph to execute the queries, which would not scale for
large graphs. Thus, the algorithms will use less resources running on the server
side, since nodes and relationships are obtained directly from the database. Pro-
cedures can be classified in three groups, depending on their functionality:
–Temporal procedures: Implement basic temporal operations. In this work,
Between and Snapshot are defined.
–Coexisting paths procedures: Implement the continuous and pairwise contin-
uous path semantics.
–Consecutive paths procedures: Implement the consecutive path semantics.
The procedures above are packed in a library which is stored in the Neo4j’s
plugin folder. The Coexisting and Consecutive procedures extend a framework
defined to work on temporal graphs. This framework was based on the neo4j-
graph-algorithms library.9. This library contains some graph implementations of
graph algorithms, although there are no algorithms for temporal graphs.
5.3.1 Temporal procedures The Between and Snapshot procedures receive
a Cypher query, execute it, and filter the results depending on the operation.
Neo4j returns the results of a query as a stream of records, analogously to rela-
tional databases. The operations above are thus applied to all the rows in the
stream, filtering the results that do not satisfy the temporal restrictions. In both
cases, the procedure receives a string containing the query, and another string
representing the granularity that must be applied to the operation. In addition,
the Between operation receives an interval, and keeps the records in the stream
whose intervals are inside the former one. The Snapshot operation also receives a
string that contains a specific time instant, and keeps the records whose intervals
contain that specific time instant.
5.3.2 Coexisting paths procedures These procedures return the continu-
ous paths of a given length, either starting from a node, or between two nodes.
In addition to this, a Boolean alternative is implemented, that can be used,
for example, for checking whether or not a continuous path exists between two
nodes.
Algorithm 1 retrieves all of the coexisting paths between two nodes, receiv-
ing as input a graph G, a source node x, the minimum path length Lmin, the
maximum path length, Lmax, a function fthat returns an interval depending on
the algorithm, and optionally, a destination node y. The algorithm returns a set
Swith the results. Given two intervals, the function freturns another interval.
When computing continuous paths, fis defined as f(i1, i2) = i1∩i2. That way,
only the intersection of the intervals is stored, and the algorithm keeps iterating
with them. For pairwise temporal paths, fis defined as f(i1, i2) = i2, this way
9https://github.com/neo4j-contrib/neo4j-graph-algorithms
28 A. Debouvier et al.
Algorithm 1 Computes Coexisting Paths (Continuous and pairwise continuous
paths).
Input: A graph G, a source node x, the minimum path length Lmin, the maximum path length
Lmax, a function fdepending on the type of path requested, and a destination no de y (optional).
Output: A list of coexisting paths S.
Initialize a queue of paths Q and a list of solutions S.
Q.enqueue([(x, [−inf, +inf ],0)])
while not Q.isEmpty do
current =Q.dequeue()
z, interv al, length =cur rent.last()
for (z, otherI nterval, dest)∈ G.edg esF rom(z)do
if not current.containsNode(dest) and interval ∩otherI nterval 6=∅then
newT uple = (dest, f(inter val, otherInter val), length + 1)
copy =current.copy()
copy.insert(newtuple)
if Lmin <=length + 1 <=Lmax and (ynot exists or dest == y)then
S.insert(copy)
end if
if length + 1 < Lmax then
Q.enqueue(copy)
end if
end if
end for
end while
it only returns the latter interval, and the algorithm iterates only with the last
interval in the path.
The algorithm takes the source node xand adds it to a list, in a triplet con-
taining an interval [−inf, +inf ], whose values are the minimum and maximum
time instants of the node, and the length of the path, initially set to zero. This
list represents a path that starts at the source node, and is added to the queue.
The algorithm picks up the paths in the queue until the queue is empty. The
algorithm takes the last triplet of the path, and looks up in the graph Gfor the
edges associated with the node in this triplet. Then, for each edge, the algorithm
checks if the node in the opposite end of the edge is in the path, or the interval
in the edge does not intersect with the interval in the triplet. If that is the case,
the edge cannot continue the path. This prevents iterating over the same nodes.
For example, given an edge from A to B with interval [1, 2], a path A-B-A-B
would be possible without this limitation, because the interval between A and B
always intersects with itself. In the case that the edge can continue the path, a
triple with the new node is created, containing the result of the execution of the
function f, and the length of the path, which is the length of the last triplet in
the path, plus 1. The path is copied and the triplet is added to the copy. If the
copy of the path (which is also a path) has a length between Lmin and Lmax and
the node of the last triplet is also the destination node (if such node is defined
as input), this path is added to the set of solutions S. Otherwise, the path it is
added to the queue. When this queue is empty, the set of solutions Sis returned.
Algorithm 2 is the Boolean version of the previous one, since it computes if
there exists a continuous path between two nodes. That is, if a path is found,
true is returned, otherwise, it returns false.
29
Algorithm 2 Checks the existence of a Continuous Path.
Input: A graph G, a source node x, the minimum path length Lmin, the maximum path length
Lmax, a function f which depends on the type of path requested (continuous o pairwise), and a
destination node y (optional).
Output: True if a Continuous Path exists. False otherwise.
Initialize a queue of paths Q.
Q.enqueue([(x, [−inf, +inf ],0)])
while not Q.isEmpty do
current =Q.dequeue()
z, interv al, length =cur rent.last()
for (z, otherI nterval, dest)∈ G.edg esF rom(z)do
if not current.containsNode(dest) and interval ∩otherI nterval 6=∅then
newT uple = (dest, f(inter val, otherInter val), length + 1)
if Lmin <=length + 1 <=Lmax and (ynot exists or dest == y)then
return true
end if
if length + 1 < Lmax then
copy =current.copy()
copy.insert(newtuple)
Q.enqueue(copy)
end if
end if
end for
end while
return false
5.3.3 Consecutive paths procedures These procedures are based on the
graph transformation approach introduced by Wu et al. [32] for DLTGs, which
was adapted to address the temporal graph model of Definition 4. However,
unlike the approach presented in [32], the algorithm presented here prevents
creating the whole graph to apply the path computation algorithms, since this
would be extremely expensive. Instead, in the present proposal the transformed
graph is built as the iterations proceed over the original temporal graph, call
it G. The transformation creates a new graph, denoted Gt, where the nodes
contain either the starting time or the ending time of an interval of the temporal
graph (explained below), and the edges indicate the nodes that are reachable
from that position, where reachable means that both nodes are included in the
same interval, or that they start from the same node and the starting time of
the source node is prior to the one in the destination node. The weight of an
edge is the duration of the corresponding interval. This new graph is easier to
iterate over, as it does not contain cycles, because it is not possible to go from
a node with a greater time to a node with a lesser time, and all the weights of
the edges are (or can be represented as) positive numbers.
Algorithm 3 sketches the process. The algorithm receives, as arguments, a
temporal graph G, the source and destination nodes of the path (sand d, re-
spectively) to be computed, a function fto be used to sort the nodes of the
transformed graph in a priority queue -in a way which depends on the algo-
rithm (earliest, latest, fastest, shortest paths), and returns a set of nodes S. The
following is assumed in the sequel for f:
x < y if f (x, y )<0
x=y if f (x, y)=0
30 A. Debouvier et al.
Algorithm 3 Compute the minimum consecutive paths.
Input: A graph G, a source node s, a destination node d. A comparison function f(x, y) where x<y
if f(x, y)<0.
Output: A set with the optimal solutions S.
Initialize the transformed graph Gtand Q(priority queue of Gtnodes)
Q.enqueue((s, −∞,0, null))
while not Q.isEmpty do
current =Q.dequeue()
for (current.node, interv al, dest)∈ G.edgesF rom(curr ent.node)do
if current.time > interv al.start then
continue
end if
vOut = (current.node, interval.start, current.length + 1, current)
vIn = (dest, interval.end, current.length + 1, vOut)
if Gt.containsNode(v In.node, vI n.time)then
othervI n =Gt.get(vI n.node, V In.time)
if f(othervI n, vIn)>0then
continue
end if
end if
if dest == dthen
if S.isEmpty then
S.add(vIn)
else
s = S.getAny()
comp = f(vIn, s)
if comp > 0then
S.empty()
S.add(vIn)
else if comp == 0 then
S.add(vIn)
end if
end if
continue
end if
Q.insert(vI n)
end for
end while
return S
x > y if f (x, y )>0
The nodes of the transformed graph Gthave four attributes: a reference to
the node in the original graph, a time instant, the length of a path that passes
through that node to iterate the graph in a DFS way, and a reference to the
previous node in Gt, in order to allow rebuilding the paths after running the
algorithm. For clarity, these attributes are denoted (for a node n), n.noderef ,
n.time,n.length and n.previous in Algorithm 3.
After initializing the necessary structures, the algorithm adds the initial
transformed graph node to the priority queue. This node is a quadruple that
contains the source node s,−∞ as the time instant, 0 as length, and null as the
reference to the previous node. An element eis picked up from the queue until
the queue is empty. There is a node viin the temporal graph associated with e.
For each edge outgoing from viin G, the node is expanded creating the nodes
vout and vin in the transformed graph. The node vout contains vt(the start time
of the interval in the edge), the length of eplus 1, and eas the previous node,
that means (vt, t.start, e.length +1, null). The node vin contains the destination
31
node of the edge, the end time of the interval in the edge, the length of eplus
1, and vout as the previous node, that is (vf, tend, e.length + 1, vout ). If the start
time of the interval is less than the time instant of e, the path is not expanded,
because it means that this interval occurred prior to the interval associated with
the instant. For example, for the interval [5,8], if the time instant in eis 7, the
node will not be expanded, and it would not yield a consecutive path.
After creating vin and vout in the transformed graph Gt, the algorithm checks
if Gtalready contains a node vin0such that the temporal graph node and the
time moment are the same as the ones in vin. If this is the case, the two nodes
are compared with the function f. If f(vin, vin0)<0, the path is discarded. If
f(vin, vin0)>0 the node is replaced. Otherwise, the node is kept in the graph.
The rationale behind discarding the paths is that if two paths P1and P2in
Gtthat end at the same node d, contain the same transformation node n, if
f(P1(n), P2(n)) >0, then f(P1(d), P2(d)) >0, since the same nodes will be
expanded, and the functions fdepends on the nodes already traversed (e.g., for
the shortest-path, fdepends on the path length, for the earliest-path, it depends
on the arrival time to each node, and so on). Then, if vt, the temporal graph node
in vin is not the same as the one in the destination node d,vin is added to the
queue. If vtis the same as in d, and S=∅,vin is added to S. If S! = ∅, then any
s∈Sis picked up. If f(s, vin)<0, the whole set Sis discarded f(vin, s) == 0,
vin is added to S, and if f(vin, result)>0, Sis reset to {vin}. When Qis
emptied, the set of nodes in Gtis returned, and the algorithm reconstructs the
paths using the stored references to previous nodes in the paths. That is, for
each node, the algorithm follows the link to the previous node until there is no
previous node, like in the implementation of the Dijkstra algorithms.
It is worth remarking again that the function fis defined differently for each
kind of consecutive path. Given a function first that returns the first node of
the path defined by the reference to the previous node in a node in Gt,fis
defined as:
– Earliest-arrival path: f(x, y) = x.time −y.time.
– Latest-departure path: f(x, y) = f irst(x).time −first(y).time
– Shortest path: f(x, y) = x.length −y.length
– Fastest path: f(x, y) = (x.time −f irst(x).time)−(y.time −f irst(y).time)
The library that has been developed, also contains aggregation functions.
These functions iterate over the results and then return some value associated
with the input. They are used to filter the results obtained by executing the
consecutive paths procedure. They iterate over all the results received by the
execution of these procedures, and choose the fastest, earliest, shortest, latest
departure or latest arrival paths depending on the function we called. These func-
tions are useful when the procedures are called more than once, for preventing
returning non-optimal values.
Example 4 (Consecutive Paths Computation). Figure 9 shows a graph over which
the shortest path between nodes A and B will be computed with Algorithm 3.
32 A. Debouvier et al.
A
[1,2]
[2,4]
[5,7]
[6,8]
C
B
E
D
[5,6]
[1,3]
[3,4]
[4,6]
Fig. 9. An example for Algorithm 3.
The function fwill thus be f(x, y) = x.length −y.length. Figure 10 shows the
transformed graph at the end of the execution of the algorithm.
The first node created in Gtis (A, −∞,0) (the reference to the previous node
is omitted, for clarity), which will be added to the queue. Thus, Q= [(A, −∞,0)]
is the initial state of the queue. The node is picked up from the queue, and, since
the edges outgoing from A in the graph of Figure 9 have intervals [2,4] and [1,2],
taking [2,4], the nodes vout = (A, 2,0) and vin = (C, 4,1) are created in Gt.
Then, vin is picked up, and the edges outgoing from Chave intervals [5,7],[1,3]
and [6,8]. Here, [1,3] cannot be expanded, since it would not yield a consecutive
path. The new nodes vin are created. From these nodes, and (E, 7,2) is added
to the result set, and the new state of the queue is Q= [(B, 8,2),(B , 2,1)].
Since now a first solution is obtained, it is compared against (B, 8,2), and given
that f((B, 8,2),(E , 7,2)) = 2 −2 = 0, this path is discarded. Then, (B, 2,1) is
expanded, and the process continues in the same way. Finally, the two paths are:
(A, 1,0)−>(B, 2,1)−>(B , 4,1)−>(E, 6,2) and (A, 2,0)−>(C, 4,1)−>
(C, 5,1)−>(E, 7,2) which lead to the shortest paths A, B , E and A, C, E.
6 Evaluation
This section reports and discusses the experiments carried out in order to test
the different algorithms described and implemented in this work. These exper-
iments cover the two classes of path algorithms studied: continuous paths and
consecutive paths.
6.1 Description of the experiments
The goals of the experiments, and the experimental setup are detailed in this
section, for each of the classes of algorithms tested.
6.1.1 Continuous paths algorithms The goal of these experiments is to
test how does the length of the paths and the size of the dataset, impact on the
33
(A, 1,0)
(A, 2,0)
(A, 1,0)
(C, 4,1)
(B,2,1)
(C, 5,1)
(C, 6,1)
(E,7,2)
(B,8,2)
(B,3,1)
(D, 4,2)
(B,4,1)
(E,6,2)
Fig. 10. An example for Algorithm 3.
performance of the algorithm. Therefore, different tests are conducted, varying
both variables.
Typical continuous path queries are run over the social network temporal
graph, asking for continuous paths of different lengths between two specific per-
sons, the latter indicated by a property denoted id, generated during the popu-
lation of the dataset. For example, the query below asks for all the continuous
paths of length 8 between the Person nodes with id 10 and 30. This query is run
for different pairs of persons and different path lengths.
SELECT p
MATCH (n:Person), (m:Person), p = cPath((n)-[:Friend*8]-(m))
WHERE n[id] = 10 AND m[id] = 30
The same type of query was ran to test the pairwise continuous path algo-
rithm:
SELECT p
MATCH (n:Person), (m:Person), p = pairCPath((n)-[:Friend*8]-(m))
WHERE n[id] = 10 AND m[id] = 30
6.1.2 Consecutive paths algorithms The goal of these experiments is to
evaluate how do the different paths behave for various graph sizes. The tests
are run over real-world flights data sets, taking a subset of the airports in such
data sets. The chosen airports are of very different sizes, to cover a wide range of
connecting flights. The queries perform a consecutive path search for two specific
airports using their IATA (International Air Transportation Asociation) code,
34 A. Debouvier et al.
a three-letter code that uniquely identifies an airport. The queries address the
four kinds of consecutive path algorithms, and are of the following form:
SELECT path
MATCH (a1:Airport), (a2:Airport),
path = fastestPath((a1)-[:Flight*]->(a2))
WHERE a1.Code = 'BOS' and a2.Code = 'HOU'
SELECT path
MATCH (a1:Airport), (a2:Airport),
path = shortestPath((a1)-[:Flight*]->(a2))
WHERE a1.Code = 'BOS' and a2.Code = 'HOU'
SELECT path
MATCH (a1:Airport), (a2:Airport),
path = earliestPath((a1)-[:Flight*]->(a2))
WHERE a1.Code = 'BOS' and a2.Code = 'HOU'
SELECT path
MATCH (a1:Airport), (a2:Airport),
path = latestDeparturePath((a1)-[:Flight*]->(a2))
WHERE a1.Code = 'BOS' and a2.Code = 'HOU'
6.2 Datasets and setup
This section reports the characteristics of the datasets used for evaluating the
two kinds of algorithms. For continuous paths algorithms, synthetic data were
generated, resembling the social network running example (Figure 2). For con-
secutive paths algorithms, real-world flight data were used.
All experiments were run under the same environment, a Neo4j 3.5.17 server,
ran on Ubuntu 16.04 64-bits, with a 12-core CPU and 25 GB of RAM.
6.2.1 Continuous paths algorithms A dataset generator, based on the
model described on Definition 4 and represented in Figure 2, populates the
graph databases for these experiments. To generate data for the social network
graph, to test continuous and pairwise continuous path algorithms, the following
parameters are considered:
– N = Number of Person nodes.
– F = Maximum number of Friendship relationships per person.
– I = Maximum number of intervals per friendship.
– Number (C) and length (L) of the continuous paths.
First, the generator creates Ccontinuous paths of length Land then gen-
erates the friendship relationships for the whole graph. Furthermore, since the
former are randomly generated, the generator ensures a minimum of Ccontin-
uous paths of length L. Once the continuous paths are created, the id of the
35
N Nodes Edges Size
1000 3021 6833 747.95 MB
10000 30021 67676 776.02 MB
100000 300021 677278 1.06 GB
Table 1. Continuous paths experiments: Characteristics of each social network dataset.
Dataset Airports Flights Size
1 week 312 109911 1.92MB
1 month 312 469968 22.53 MB
3 months 315 1403471 64.52 MB
6 months 322 2889512 131.38 MB
1 year 629 5819079 413.23 MB
Table 2. Consecutive paths experiments: Number of airports, flights and sizes of each
dataset.
persons involved in each continuous path are stored, to be used in the queries as
the start and end of the paths of length L. Note that more than a path could be
obtained as a result, as a consequence of the random generation of friendships.
Three data sets were generated, with N = 1000, 10000 and 100000, and the
other parameters are fixed, with values F = 5 and I = 2. For each dataset, 3 paths
(i.e., C= 3) of each of the following lengths (L) were generated: 4, 6, 8, 10 and
12. Table 1 details the number of nodes, edges and sizes of the datasets. Indexes
were created on the Object, Value and Attribute nodes for the id property.
The execution of a query for a specific Nand Lis carried out Ctimes varying
the ids of the start and end nodes of the path, to account for different number
of paths of length Lthat may exist in a graph, and for the different starting and
ending nodes.
6.2.2 Consecutive paths algorithms Consecutive path algorithms were
tested using a real-world flight database, the Flight Delays and Cancellations
for US flights in 201510, using the original departure and arrival times for the
flights. Five datasets were generated from the former ones, filtering the flights
with different time intervals. The selected periods for the datasets were the first
week, first month, first three months, first half year, and the entire year. The
number of flights and airports are shown in Table 2. The number of Vout and Vin
nodes of the complete transformed graph is reported in Table 3. The following
airports were chosen:
1. ATL - Hartsfield–Jackson Atlanta International Airport, Atlanta, Georgia.
2. CLD - Mc Clellan-Palomar Airport, Carlsbad, California.
3. BOS - General Edward Lawrence Logan International Airport, Boston, Mas-
sachusetts.
10 https://www.kaggle.com/usdot/flight-delays?select=flights.csv
36 A. Debouvier et al.
Dataset Vout Vin Total
1 week 71455 84216 155661
1 month 308656 366301 674957
3 months 920257 1095713 2015970
6 months 1891583 2254938 4146521
1 year 3828264 4549494 8377758
Table 3. Total number of nodes in each dataset.
4. HOU - William P. Hobby Airport, Houston, Texas
5. SBN - South Bend Regional Airport, South Bend, Indiana
6. ISP - Long Island Mac Arthur Airport, Islip New York
Different routes were selected between the airports above:
1. ATL to CLD (A large airport to a small one)
2. BOS to HOU (A medium size airport to another medium one)
3. ATL to AUS (A large airport to a medium one)
4. SBN to ISP (A small airport to a another small one)
Routes between two large airports was not selected because usually there are
direct flights between them, meaning that a path of length 1 normally exists, and
therefore the results would not be representative, since they would be overvalued.
The number of incoming and outgoing flights are listed in Table 8 in Ap-
pendix A. Note that for airport CLD, the number of flights stops growing at the
6 months as the airport closes. This airport was chosen since it challenges the
latest departure path algorithm, as it will try to search for the latest departure
path going to the paths with the latest departure time, although the arrivals are
all in the first half of the year.
6.3 Results
This section reports the results of the experiments presented above. To compare
execution times would not be representative of the algorithms performance, since
these times depend on a number of factors, like, for example, the number of
continuous paths of a certain length that the algorithm finds, which may vary
for different pairs of starting and ending nodes. Therefore, the following average
definition is used to report the results.
T=1
n
n
X
i=1
tn
cn
=1
n(t1
c1
+... +tn
cn
)
In the expression above, nis the number of different pairs of nodes (start
and end of a continuous path) for which the query was run, tthe execution time
and cthe number of paths found for each pair of nodes.
For example, for C= 3, three continuous paths of length Lare generated.
One between node A1and node A2, and, analogously, between the pairs of nodes
A3−A4, and A5−A6. Example results of query executions are shown next.
37
Node pair Paths found Elapsed time
A1→A23 12 s
A3→A42 6 s
A5→A69 45 s
The weighted average Tis, for this example:
T=1
n(t1
c1
+t2
c2
+tn
cn
) = 1
3(12
3+6
2+45
9)=4
Finally for the consecutive paths, the usual definition of average is used,
running the algorithms three times for each path and dataset.
Figure 11 display the execution times for the continuous path and pairwise
continuous path algorithms. The x-axis represents the length of the continuous
paths in the queries. Figure 12 displays the execution times with respect to the
number of nodes visited by the algorithm, for the continuous path algorithm, for
N= 100,000 and L= 12. In this case, the execution time is the simple average
computed dividing the execution time by the number of paths found for each
pair of person nodes.
Figures 13 and 14 display the results for the tests addressing latest departure,
fastest, earliest, and shortest paths algorithms. Execution times are represented
on the y-axis, and the number of flights on the x-axis. Tables 4 through 7 show,
for each route tested, the average time and the number of paths in the result,
for each time partition of the dataset.
4 6 8 10 12
0
1,000
2,000
3,000
Path length (nodes)
Time (ms)
(a) Continuous path algorithm
N = 1000
N = 10000
N = 100000
4 5 6 7 8
0
200
400
600
800
1,000
Path length (nodes)
Time (ms)
(b) Pairwise continuous path algorithm
N = 1000
N = 10000
N = 100000
Fig. 11. (a) Execution time vs. Path length for continuous path algorithm; (b) Execu-
tion time vs. Path length for pairwise continuous path algorithm.
6.4 Discussion of Results
A discussion of the results reported in the previous section is presented next.
Continuous and pairwise continuous path experiments are commented first, fol-
38 A. Debouvier et al.
02468
·104
0
0.2
0.4
0.6
0.8
1
1.2
1.4
·106
Visited nodes (nodes)
Execution time (ms)
Visited nodes vs Execution time
N = 100000, L = 12
Fig. 12. Visited nodes vs. Execution time for continuous path algorithm on paths of L
= 12. The visited nodes are the nodes that are traversed for a continuous path search.
Each point is a division between the execution time and the paths found for a pair of
nodes.
Path Dataset Latest Departure Path
Average Time (ms) N. of Results
ATL →CLD
1 week 267 1
1 month 1318.33 1
3 months 4098 1
6 months 15622280.67 1
1 year 129165589.33 1
BOS →HOU
1 week 231 1
1 month 1224.33 3
3 months 3952.33 1
6 months 12072 1
1 year 33875.33 1
ATL →AUS
1 week 97.33 1
1 month 1807 2
3 months 7462.33 1
6 months 34883 1
1 year 118174.67 1
SBN →ISP
1 week 257 1
1 month 1263.67 9
3 months 3735.33 3
6 months 8829.67 3
1 year 18760.33 74
Table 4. Average time and number of results for the latest departure path algorithm.
39
Path Dataset Fastest Path
Average Time (ms) N. of Results
ATL →CLD
1 week 6755 3
1 month 140522.33 3
3 months 1427239.33 3
6 months 8172404 8
1 year 29744579 8
BOS →HOU
1 week 1969.33 7
1 month 51980.67 31
3 months 536338 31
6 months 2123572.67 2
1 year 8694658.33 11
ATL →AUS
1 week 973 3
1 month 17640.33 27
3 months 237272.33 45
6 months 671548 21
1 year 2938933 4
SBN →ISP
1 week 3925.33 1
1 month 72191.33 2
3 months 560382 4
6 months 3628807 21
1 year 13622908.67 1
Table 5. Average time and number of results for the fastest path algorithm.
0123456
102
103
104
105
106
107
108
# of Flights in the dataset (×106)
Time (ms)
(a) Latest departure path algorithm
ATL - CLD
BOS - HOU
ATL - AUS
SBN - ISP
0123456
103
104
105
106
107
# of Flights in the dataset (×106)
Time (ms)
(b) Fastest path algorithm
ATL - CLD
BOS - HOU
ATL - AUS
SBN - ISP
Fig. 13. (a) Execution time for each pair of airports for the latest departure path
algorithm; (b) Execution time for each pair of airports for the fastest path algorithm.
lowed by a discussion on the results obtained for the four kinds of consecutive
path algorithms.
40 A. Debouvier et al.
Path Dataset Earliest Path
Average Time (ms) N. of Results
ATL →CLD
1 week 412 1
1 month 1995.33 1
3 months 7349 1
6 months 18505 1
1 year 36813.67 1
BOS →HOU
1 week 360 1
1 month 1783.33 1
3 months 5699.67 1
6 months 14219.66 1
1 year 33812 1
ATL →AUS
1 week 98.33 1
1 month 414.33 1
3 months 1411.33 1
6 months 2758.33 1
1 year 6391.67 1
SBN →ISP
1 week 1992.67 9
1 month 10507 9
3 months 36670.33 9
6 months 102361.67 9
1 year 238015.67 9
Table 6. Average time and number of results for the earliest path algorithm.
0123456
102
103
104
105
# of Flights in the dataset (×106)
Time (ms)
(a) Earliest path algorithm
ATL - CLD
BOS - HOU
ATL - AUS
SBN - ISP
0123456
101
102
103
104
105
106
107
# of Flights in the dataset (×106)
Time (ms)
(b) Shortest path algorithm
ATL - CLD
BOS - HOU
ATL - AUS
SBN - ISP
Fig. 14. (a) Execution time for each pair of airports for the earliest path algorithm;
(b) Execution time for each pair of airports for the shortest path algorithm.
6.4.1 Continuous Paths The left-hand side of Figure 11 shows the execution
time for each dataset size, and different continuous path lengths. For N= 10000
and 100000, the execution times increase as the path length increases, starting
with values around 50 ms for L= 4 and growing up to 733 ms and 3279 ms,
41
Path Dataset Shortest Path
Average Time (ms) N. of Results
ATL →CLD
1 week 4031.67 1969
1 month 91449.67 39034
3 months 1060364.67 342124
6 months 4643210 391462
1 year Out of Memory
BOS →HOU
1 week 10.67 20
1 month 83.67 89
3 months 82 253
6 months 342.67 506
1 year 469.33 926
ATL →AUS
1 week 32.33 58
1 month 99 252
3 months 273 775
6 months 585.67 1667
1 year 1252.67 3154
SBN →ISP
1 week 8066 2783
1 month 270057.33 66464
3 months 3459141.33 699214
6 months Out of Memory
1 year Out of Memory
Table 7. Average time and number of results for the shortest path algorithm.
respectively for L = 12. On the other hand, for N= 1000, execution times
remain low, and, starting with an execution time of 30 ms, decreases for longer
paths, without exceeding 80 ms in any case. It can also be seen that, for N=
100000, execution times grow faster than for N= 10000. Figure 12 shows that
the execution time is linear with respect to the number of nodes visited by the
algorithm. Results for the pairwise continuous paths are depicted on the right-
hand side of Figure 11. Results are similar to the ones obtained for continuous
paths: increasing the length of the path searched implies higher execution times.
For the sake of space, figures are omitted.
6.4.2 Consecutive Paths Figures 13 and 14 display the results for the tests
addressing latest departure, fastest, earliest, and shortest paths algorithms. All
figures show a lineal behaviour in most of the cases. The y-axis is displayed in
logarithmic scale, since the difference between the running times of the algo-
rithms is very large, depending on the paths. As expected, the execution time
of the algorithm grows as the number of flights grows.
For the latest departure path (Figure 13 (a) and Table 4), all the tests show
a linear behaviour except the one from ATL to CLD. This is because all the
arrivals to the airport are in the first half of the year, so it takes a long time to
prune the graph to find a path between those airports. This is why the time grows
exponentially and then continue with a linear behaviour, reflected in the fact that
42 A. Debouvier et al.
the algorithm runs in 4098msec for the 3-months-dataset, and 15622280.67msec
for the 6-months one, that is, a growth of about 3800 times. For the other
airports, this ratio is about 3 times. However, note that this is a very particular
case. Solving this particular case is left for an improved version of the algorithm.
In the case of the shortest path algorithm(Figure 14(b) and Table 7), for
the largest dataset, the algorithm ran out of memory, due to the large number
of results obtained. The out-of-memory outcome was caused by the number of
paths that are stored in the memory. In the case of the two smallest airports,
there a lot of paths between this airports and with a relevant length. Another
particular situation occurs when a path starts in the beginning of the year, and
ends at the end of the year. Solving these long-time span paths problems is left
also for future enhancements.
For all algorithms, with the exception of the cases of shortest and latest
departure paths mentioned above, the execution times are similar; depending
on the algorithm, some path computations delivered better performance than
the others, but this does not depend on the type of airports chosen. The fastest
path algorithm (Figure 13(b) and Table 5) is the one with the lowest performance
(except from the particular cases mentioned above). For example, for the largest
dataset, for the paths from BOS to HOU and ATL to AUS, the average execution
times were 8,694,658msec and 2,938,933msec respectively. On the other hand,
for the earliest path (Figure 14 and Table 6), these times were 33,812ms and
6,391.67ms, for the shortest path 469.33ms and 1252.67ms, and for the latest
departure 33,875ms and 118,174ms, respectively. However, as the sizes of the
datasets decrease, the execution times also decrease in a significant way. For
example, in the case of the fastest path algorithm, for the 1-month dataset, for
the paths from BOS to HOU and ATL to AUS, the average execution times were
51,980msec and 17,640msec respectively.
The intuition is that the results reported may be caused by the nature of
the paths. The earliest departure path execution time depends on the time of
the last node in the path; for the latest departure, the execution time strongly
depends on the time of the first node of the path, and the shortest path on the
length of the path. For example, in the latest departure path algorithm, once a
path reaches a node that is part of a possible latest departure path, no better
path can be reached that contains that node, because the time of the first node
cannot change. On the other hand, in the fastest path algorithm, a fastest path
could be found, because the difference of time varies depending on the first and
last node, and there is more variety because the flight times are different between
airports. In the case of the shortest path algorithm, the algorithm explores the
same node many times, increasing the execution time.
7 Conclusions and Future Research Directions
This work introduced a temporal property graph data model, and an associ-
ated high-level temporal query language, denoted T-GQL, which supports two
kinds of temporal paths semantics: continuous paths and consecutive paths. As
43
real-world application examples, those semantics capture the dynamics of social
network evolution, and of travel scheduling, respectively. Algorithms for path
computation for both semantics were devised and implemented. Finally, experi-
ments were carried out, and the results are reported and discussed in the paper,
showing the plausibility of the approach, and highlighting the main issues that
need to be addressed in future work.
Ongoing and future work include indexing methods for the time-varying data,
therefore enhancing the performance of both the continuous and consecutive
path algorithms. That means, current research addresses indexing continuous
and consecutive paths. For this, existing research on indexing both, temporal
databases and graphs in paths are being considered.
Although the current version of the T-GQL language has powerful features, it
can, of course, be extended in many ways. Just as an example, the WHEN clause
could be improved so it can support a path function call (even more than one
such clauses can be supported). Further, the MATCH clause could be enhanced
to support more than one path, as in the current version.
Acknowledgments. Alejandro Vaisman was partially supported by Project
PICT 2017-1054, from the Argentinian Scientific Agency.
References
1. R. Angles. A Comparison of Current Graph Database Models. In Proceedings of
ICDE Workshops, pages 171–177, Arlington, VA, USA, 2012.
2. R. Angles, M. Arenas, P. Barcel´o, A. Hogan, J. L. Reutter, and D. Vrgoc. Foun-
dations of modern query languages for graph databases. ACM Comput. Surv.,
50(5):68:1–68:40, 2017.
3. Renzo Angles. The property graph database model. In Proceedings of the 12th
Alberto Mendelzon International Workshop on Foundations of Data Management,
Cali, Colombia, May 21-25, 2018, volume 2100 of CEUR Workshop Proceedings.
CEUR-WS.org, 2018.
4. Renzo Angles and Claudio Gutierrez. Survey of graph database models. ACM
Comput. Surv., 40(1):1:1–1:39, 2008.
5. Renzo Angles, Harsh Thakkar, and Dominik Tomaszuk. RDF and property graphs
interoperability: Status and issues. In Aidan Hogan and Tova Milo, editors, Pro-
ceedings of the 13th Alberto Mendelzon International Workshop on Foundations of
Data Management, Asunci´on, Paraguay, June 3-7, 2019, volume 2369 of CEUR
Workshop Proceedings. CEUR-WS.org, 2019.
6. Andrey Balmin, Thanos Papadimitriou, and Yannis Papakonstantinou. Hypothet-
ical queries in an OLAP environment. In VLDB 2000, Proceedings of 26th In-
ternational Conference on Very Large Data Bases, September 10-14, 2000, Cairo,
Egypt, pages 220–231. Morgan Kaufmann, 2000.
7. Alexander Campos, Jorge Mozzino, and Alejandro A. Vaisman. Towards temporal
graph databases. In Reinhard Pichler and Altigran Soares da Silva, editors, Pro-
ceedings of the 10th Alberto Mendelzon International Workshop on Foundations
of Data Management, Panama City, Panama, May 8-10, 2016, volume 1644 of
CEUR Workshop Proceedings. CEUR-WS.org, 2016.
44 A. Debouvier et al.
8. C. Cattuto, A. Panisson, and M. Quaggiotto. Representing time dependent graphs
in Neo4j. https://github.com/SocioPatterns/neo4j-dynagraph/wiki/Representing-
time-dependent-graphs-in-Neo4j, 2013.
9. Ciro Cattuto, Marco Quaggiotto, Andr´e Panisson, and Alex Averbuch. Time-
varying social networks in a graph database: a neo4j use case. In First International
Workshop on Graph Data Management Experiences and Systems, GRADES 2013,
co-loated with SIGMOD/PODS 2013, New York, NY, USA, June 24, 2013, page 11,
2013.
10. Ankur Dave, Alekh Jindal, Li Erran Li, Reynold Xin, Joseph Gonzalez, and Matei
Zaharia. Graphframes: an integrated API for mixing graph and relational queries.
In Proceedings of the Fourth International Workshop on Graph Data Management
Experiences and Systems, Redwood Shores, CA, USA, June 24 - 24, 2016, page 2,
2016.
11. Edsger W. Dijkstra. A note on two problems in connexion with graphs. Numerische
Mathematik, 1:269–271, 1959.
12. Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lin-
daaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Martin Schuster, Petra
Selmer, and Andr´es Taylor. Formal semantics of the language cypher. CoRR,
abs/1802.09984, 2018.
13. Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lin-
daaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and
Andr´es Taylor. Cypher: An evolving query language for property graphs. In Gau-
tam Das, Christopher M. Jermaine, and Philip A. Bernstein, editors, Proceedings of
the 2018 International Conference on Management of Data, SIGMOD Conference
2018, Houston, TX, USA, June 10-15, 2018, pages 1433–1445. ACM, 2018.
14. Matteo Golfarelli and Stefano Rizzi. What-if simulation modeling in business
intelligence. IJDWM, 5(4):24–43, 2009.
15. Alastair Green. Gql - initiating an industry standard property graph query lan-
guage, 2018.
16. W. Han, Y. Miao, K.Li, M. Wu, F. Yang, L. Zhou, V. Prabhakaran, W. Chen, and
E. Chen. Chronos: A Graph Engine por Temporal Graph Analysis. In Eurosys,
pages 1–14, 2014.
17. O. Hartig. Reconciliation of RDF* and property graphs. CoRR, abs/1409.3288,
2014.
18. O. Hartig. Position statement: The rdf* and sparql* approach to annotate state-
ments in rdf and to reconcile rdf and property graphs, 2019.
19. Huahai He and AmbujK. Singh. Query language and access methods for graph
databases. In Managing and Mining Graph Data, volume 40 of Advances in
Database Systems, pages 125–160. Springer US, 2010.
20. Silu Huang, James Cheng, and Huanhuan Wu. Temporal graph traversals: Defini-
tions, algorithms, and applications. CoRR, abs/1401.1919, 2014.
21. Wenyu Huo and Vassilis J. Tsotras. Efficient temporal shortest path queries on
evolving social graphs. In Conference on Scientific and Statistical Database Man-
agement, SSDBM, Aalborg, Denmark, June 30 - July 02, 2014, pages 38:1–38:4,
2014.
22. U. Khurana and A. Deshpande. Efficient snapshot retrieval over historical graph
data. CoRR, arxiv:1207.5777, 2012.
23. U. Khurana and A. Deshpande. HiNGE: Enabling Temporal Analytics at Scale.
In Proceedings of SIGMOD, NY, USA, 2013.
24. V. Kostakos. Temporal graphs. CoRR, arxiv:0807.2357, 2008.
45
25. F. Rizzolo and A. Vaisman. Temporal XML: Modeling, indexing, and query pro-
cessing. VLDB Journal, 1179–1212(5):39–65, 2008.
26. I. Robinson, J. Webber, and Emil Eifr´em. Graph Databases. O’Reilly Media, 2013.
27. Konstantinos Semertzidis and Evaggelia Pitoura. Top-k durable graph pattern
queries on temporal graphs. IEEE Trans. Knowl. Data Eng., 31(1):181–194, 2019.
28. A. Tansel, J. Clifford, and S. Gadia (eds.). Temporal Databases: Theory, Design
and Implementation. Benjamin/Cummings, 1993.
29. Harsh Thakkar, Renzo Angles, Dominik Tomaszuk, and Jens Lehmann. Direct
mappings between RDF and property graph databases. CoRR, abs/1912.02127,
2019.
30. Huanhuan Wu, James Cheng, Silu Huang, Yiping Ke, Yi Lu, and Yanyan Xu. Path
problems in temporal graphs. PVLDB, 7(9):721–732, 2014.
31. Huanhuan Wu, James Cheng, Yiping Ke, Silu Huang, Yuzhen Huang, and Hejun
Wu. Efficient algorithms for temporal path computation. IEEE Trans. Knowl.
Data Eng., 28(11):2927–2942, 2016.
32. Huanhuan Wu, James Cheng, Yi Lu, Yiping Ke, Yuzhen Huang, Da Yan, and Hejun
Wu. Core decomposition in large temporal graphs. In 2015 IEEE International
Conference on Big Data, Big Data 2015, Santa Clara, CA, USA, October 29 -
November 1, 2015, pages 649–658, 2015.
33. Huanhuan Wu, Yuzhen Huang, James Cheng, Jinfeng Li, and Yiping Ke. Reacha-
bility and time-based path queries in temporal graphs. In 32nd IEEE International
Conference on Data Engineering, ICDE 2016, Helsinki, Finland, May 16-20, 2016,
pages 145–156, 2016.
34. Yi Yang, Da Yan, Huanhuan Wu, James Cheng, Shuigeng Zhou, and John C. S. Lui.
Diversified temporal subgraph pattern mining. In Proceedings of the 22nd ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, San
Francisco, CA, USA, August 13-17, 2016, pages 1965–1974, 2016.
46 A. Debouvier et al.
A Appendix A
Characteristics of the data sets
Airport Dataset Departing Flights Arriving Flights
ATL
1 week 6707 6678
1 month 29512 29492
3 months 89632 89633
6 months 186135 186180
1 year 346836 346904
CLD
1 week 44 44
1 month 204 204
3 months 601 601
6 months 641 640
1 year 641 640
BOS
1 week 1943 1953
1 month 8837 8841
3 months 27188 27204
6 months 57973 57996
1 year 107847 107851
HOU
1 week 1105 1106
1 month 4650 4651
3 months 13628 13628
6 months 27972 27972
1 year 52042 52041
AUS
1 week 796 797
1 month 3376 3372
3 months 10182 10186
6 months 21941 21950
1 year 42067 42078
SBN
1 week 79 80
1 month 384 386
3 months 1246 1248
6 months 2452 2455
1 year 4454 4452
ISP
1 week 88 89
1 month 377 378
3 months 1162 1163
6 months 2462 2463
1 year 4392 4392
Table 8. Number of incoming and outgoing flights for each airport.
A preview of this full-text is provided by Springer Nature.
Content available from The VLDB Journal
This content is subject to copyright. Terms and conditions apply.