ArticlePDF Available

Abstract and Figures

We address the problem of managing historical data for large evolving information networks like social networks or citation networks, with the goal to enable temporal and evolutionary queries and analysis. We present the design and architecture of a distributed graph database system that stores the entire history of a network and provides support for efficient retrieval of multiple graphs from arbitrary time points in the past, in addition to maintaining the current state for ongoing updates. Our system exposes a general programmatic API to process and analyze the retrieved snapshots. We introduce DeltaGraph, a novel, extensible, highly tunable, and distributed hierarchical index structure that enables compactly recording the historical information, and that supports efficient retrieval of historical graph snapshots for single-site or parallel processing. Along with the original graph data, DeltaGraph can also maintain and index auxiliary information; this functionality can be used to extend the structure to efficiently execute queries like subgraph pattern matching over historical data. We develop analytical models for both the storage space needed and the snapshot retrieval times to aid in choosing the right parameters for a specific scenario. In addition, we present strategies for materializing portions of the historical graph state in memory to further speed up the retrieval process. Secondly, we present an in-memory graph data structure called GraphPool that can maintain hundreds of historical graph instances in main memory in a non-redundant manner. We present a comprehensive experimental evaluation that illustrates the effectiveness of our proposed techniques at managing historical graph information.
Content may be subject to copyright.
Efficient Snapshot Retrieval over Historical Graph Data
Udayan Khurana
University of Maryland, College Park
udayan@cs.umd.edu
Amol Deshpande
University of Maryland, College Park
amol@cs.umd.edu
ABSTRACT
We address the problem of managing historical data for large evolv-
ing information networks like social networks or citation networks,
with the goal to enable temporal and evolutionary queries and anal-
ysis. We present the design and architecture of a distributed graph
database system that stores the entire history of a network and pro-
vides support for efficient retrieval of multiple graphs from arbi-
trary time points in the past, in addition to maintaining the current
state for ongoing updates. Our system exposes a general program-
matic API to process and analyze the retrieved snapshots. We in-
troduce DeltaGraph, a novel, extensible, highly tunable, and dis-
tributed hierarchical index structure that enables compactly record-
ing the historical information, and that supports efficient retrieval
of historical graph snapshots for single-site or parallel processing.
Along with the original graph data, DeltaGraph can also maintain
and index auxiliary information; this functionality can be used to
extend the structure to efficiently execute queries like subgraph pat-
tern matching over historical data. We develop analytical models
for both the storage space needed and the snapshot retrieval times
to aid in choosing the right parameters for a specific scenario. In
addition, we present strategies for materializing portions of the
historical graph state in memory to further speed up the retrieval
process. Secondly, we present an in-memory graph data structure
called GraphPool that can maintain hundreds of historical graph
instances in main memory in a non-redundant manner. We present
a comprehensive experimental evaluation that illustrates the effec-
tiveness of our proposed techniques at managing historical graph
information.
1. INTRODUCTION
In recent years, we have witnessed an increasing abundance of
observational data describing various types of information networks,
including social networks, biological networks, citation networks,
financial transaction networks, communication networks, to name a
few. There is much work on analyzing such networks to understand
various social and natural phenomena like: “how the entities in a
network interact”,“how information spreads”,“what are the most
important (central) entities”, and “what are the key building blocks
of a network”. With the increasing availability of the digital trace of
such networks over time, the topic of network analysis has naturally
extended its scope to temporal analysis of networks, which has the
potential to lend much better insights into various phenomena, es-
pecially those relating to the temporal or evolutionary aspects of
the network. For example, we may want to know: “which ana-
lytical model best captures network evolution”,“how information
spreads over time”,“who are the people with the steepest increase
in centrality measures over a period of time”,“what is the aver-
age monthly density of the network since 1997”,“how the clusters
in the network evolve over time” etc. Historical queries like, “who
had the highest PageRank centrality in a citation network in 1960”,
“which year amongst 2001 and 2004 had the smallest network di-
ameter”,“how many new triangles have been formed in the net-
work over the last year”, also involve the temporal aspect of the
network. More generally a network analyst may want to process
the historical trace of a network in different, usually unpredictable,
ways to gain insights into various phenomena. There is also interest
in visualizations of such networks over time [2].
To support a broad range of network analysis tasks, we require
a graph1data management system at the backend capable of low-
cost storage and efficient retrieval of the historical network infor-
mation, in addition to maintaining the current state of the network
for updates and other queries, temporal or otherwise. However, the
existing solutions for graph data management lack adequate tech-
niques for temporal annotation, or for storage and retrieval of large
scale historical changes on the graph. In this paper, we present the
design of a graph data management system that we are building to
provide support for executing temporal analysis queries and histor-
ical queries over large-scale evolving information networks.
Our focus in this paper is on supporting snapshot queries where
the goal is to retrieve in memory one or more historical snapshots
of the information network as of specified time points. The typi-
cally unpredictable and non-declarative nature of network analysis
makes this perhaps the most important type of query that needs
to be supported by such a system. Furthermore, the snapshot re-
trieval times must be sufficiently low so that the analysis can be
performed in an interactive manner (this is especially vital for visu-
alization tools). There is a large body of work in temporal relational
databases that has attempted to address some of these challenges
for relational data (we discuss the related work in more detail in
the next section). However our primary focus on efficiently re-
trieving possibly 100’s of snapshots in memory, while maintaining
the current state of the database for ongoing updates and queries,
required us to rethink several key design decisions and data struc-
tures in such a system. We assume there is enough memory to hold
the retrieved snapshots in memory (we discuss below how we ex-
ploit overlap in the retrieved snapshots to minimize the memory
requirements); we allow the snapshots to be retrieved in a parti-
tioned fashion across a set of machines in parallel to handle very
large scale networks. This design decision was motivated by both
the current hardware trends and the fact that, most network analysis
tasks tend to access the underlying network in unpredictable ways,
leading to unacceptably high penalties if the data does not fit in
memory. Most current large-scale graph analysis systems, includ-
ing Pregel [18], Giraph2, Trinity [29], Cassovary (Twitter graph
library)3, Pegasus [13], load the entire graph into memory prior to
1We use the terms graph and network interchangeably in this paper.
2http://giraph.apache.org
3https://github.com/twitter/cassovary
arXiv:1207.5777v1 [cs.DB] 24 Jul 2012
titjtk
2000 2001 2002 2003 2004
Year
0
100
200
300
400
500
Rank (according to PageRank)
Figure 1: Dynamic network analysis (e.g., understanding how
“communities” evolve in a social network, how centrality scores
change in co-authorship networks, etc.) can lend important in-
sights into social, cultural, and natural phenomena. The right
plot was constructed using our system over the DBLP network,
and shows the evolution of the nodes ranked in top 25 in 2004.
execution.
The cornerstone of our system is a novel hierarchical index struc-
ture, called DeltaGraph, over the historical trace of a network. A
DeltaGraph is a rooted, directed graph whose lowest level corre-
sponds to the snapshots of the network over time (that are not ex-
plicitly stored), and the interior nodes correspond to graphs con-
structed by combining the lower level graphs (these are typically
not valid graphs as of any specific time point). The information
stored with each edge, called edge deltas, is sufficient to construct
the graph corresponding to the target node from the graph corre-
sponding to the source node, and thus a specific snapshot can be
created by traversing any path from the root to the snapshot. While
conceptually simple, DeltaGraph is a very powerful, extensible,
general, and tunable index structure that enables trading off the
different resources and user requirements as per a specific applica-
tion’s need. By appropriately tuning the DeltaGraph construction
parameters, we can trade decreased snapshot retrieval times for in-
creased disk storage requirements. One key parameter of the Delt-
aGraph enables us to control the distribution of average snapshot
retrieval times over history. Portions of the DeltaGraph can be pre-
fetched and materialized, allowing us to trade increased memory
utilization for reduced query times. Many of these decisions can
be made at run-time, enabling us to adaptively respond to chang-
ing resource characteristics or user requirements. One immediate
consequence is also that we can optimally use the “current graph”
that is in memory at all times to reduce the snapshot retrieval times.
DeltaGraph is highly extensible, providing a user the opportunity
to define additional indexes to be stored on edge deltas in order to
perform specific operations more efficiently. Finally, DeltaGraph
utilizes several other optimizations, including a column-oriented
storage to minimize the amount of data that needs to be fetched
to answer a query, and multi-query optimization to simultaneously
retrieve many snapshots.
DeltaGraph naturally enables distributed storage and processing
to scale to very large graphs. The edge deltas can be stored in a
distributed fashion through use of horizontal partitioning, and the
historical snapshots can be loaded parallely onto a set of machines
in a partitioned fashion; in general, the two partitionings need not
be aligned, but for computational efficiency, we currently require
that they be aligned. Horizontal partitioning also results in lower
snapshot retrieval latencies since the different deltas needed for re-
construction can be fetched in parallel.
The second key component of our system is an in-memory data
structure called GraphPool. A typical network evolution query
may require analyzing 100’s of snapshots from the history of a
graph. Maintaining these snapshots in memory independently of
each other would likely be infeasible. The GraphPool data structure
exploits the commonalities in the snapshots that are currently in
memory, by overlaying them on a single graph data structure (typi-
cally a union of all the snapshots in memory). GraphPool also em-
ploys several optimizations to minimize the amount of work needed
to incorporate a new snapshot and to clean up when a snapshot is
purged after the analysis has completed. We have implemented the
GraphPool as a completely in-memory structure because of the in-
efficiency of general graph algorithms on a disk-resident/buffered-
memory storage scheme. However, the design of our system itself
enforces no such restriction on GraphPool. We address the general
problem of scalability through horizontal partitioning, as explained
later.
We have built a prototype implementation of our system in Java,
using the Kyoto Cabinet4disk-based key-value store as the back-
end engine to store the DeltaGraph; in the distributed case, we run
one instance on each machine. Our design decision to use a key-
value store at the back-end was motivated by the flexibility, the
fast retrieval times, and the scalability afforded by such systems;
since we only require a simple get/put interface from the storage
engine, we can easily plug in other cloud-based, distributed key-
value stores like HBase5or Cassandra [15]. Our comprehensive
experimental evaluation shows that our system can retrieve histori-
cal snapshots containing up to millions of nodes and edges in sev-
eral 100’s of milliseconds or less, often an order of magnitude faster
than prior techniques like interval trees, and that the execution time
penalties of our in-memory data structure are minimal.
Finally, we note that our proposed techniques are general and
can be used for efficient snapshot retrieval in temporal relational
databases as well. In fact, both the DeltaGraph and the GraphPool
data structures treat the network as a collection of objects and do
not exploit any properties of the graphical structure of the data.
Outline: We begin with a discussion of the prior work (Section 2).
We then discuss the key components of the system, the data model,
and present the high level system architecture (Section 3). Then, we
describe the DeltaGraph structure in detail (Section 4), and develop
analytical models for the storage space and snapshot access times
(Section 5). We then briefly discuss GraphPool (Section 6). Finally,
we present the results of our experimental evaluation (Section 7).
2. RELATED WORK
There has been an increasing interest in dynamic network anal-
ysis over the last decade, fueled by the increasing availability of
large volumes of temporally annotated network data. Many works
have focused on designing analytical models that capture how a
network evolves, with a primary focus on social networks and the
Web (see, e.g., [1, 17, 14]). There is also much work on under-
standing how communities evolve, identifying key individuals, and
locating hidden groups in dynamic networks. Berger-Wolf et al. [5,
35], Tang et al. [33] and Greene et al. [11] address the problem
of community evolution in dynamic networks. McCulloh and Car-
ley [20] present techniques for social change detection. Asur et
al. [4] present a framework for characterizing the complex behav-
ioral patterns of individuals and communities over time. In a recent
4http://fallabs.com/kyotocabinet
5http://hbase.apache.org
work, Ahn et al. [2] present an exhaustive taxonomy of temporal vi-
sualization tasks. Biologists are interested in discovering historical
events leading to a known state of a Biological network (e.g., [22]).
Ren et al. [25] analyze evolution of shortest paths between a pair
of vertices over a set of snapshots from the history. Our goal in
this work is to build a graph data management system that can effi-
ciently and scalably support these types of dynamic network anal-
ysis tasks over large volumes of data in real-time.
There is a vast body of literature on temporal relational databases,
starting with the early work in the 80’s on developing temporal data
models and temporal query languages. We won’t attempt to present
a exhaustive survey of that work, but instead refer the reader to sev-
eral surveys and books on this topic [7, 31, 23, 34, 9, 32, 27]. The
most basic concepts that a relational temporal database is based
upon are valid time and transaction time, considered orthogonal
to each other. Valid time denotes the time period during which a
fact is true with respect to the real world. Transaction time is the
time when a fact is stored in the database. A valid-time tempo-
ral database permits correction of a previously incorrectly stored
fact [31], unlike transaction-time databases where an inquiry into
the past may yield the previously known, perhaps incorrect version
of a fact. Under this nomenclature, our data management system is
based on valid time – for us the time the information was entered
in the database is not critical, but our focus is on the time period
when the information is true.
From a querying perspective, both valid-time and transaction-
time databases can be treated as simply collections of intervals [27],
however indexing structures that assume transaction times can of-
ten be simpler since they don’t need to support arbitrary inserts or
deletes into the index. Salzberg and Tsotras [27] present a compre-
hensive survey of indexing structures for temporal databases. They
also present a classification of different queries that one may ask
over a temporal database. Under their notation, our focus in this
work is on the valid timeslice query, where the goal is to retrieve
all the entities and their attribute values that are valid as of a spe-
cific time point. We discuss the related work on snapshot retrieval
queries in more detail in Section 4.1.
There has been resurgence of interest in general-purpose graph
data management systems in both academia and industry. Several
commercial and open-source graph management systems are be-
ing actively developed (e.g., Neo4j6, GBase7, Pregel [18], Giraph,
Trinity [29], Cassovary, Pegasus [13], GPS [26]). There is much
ongoing work on efficient techniques for answering various types
of queries over graphs and on building indexing structures for them.
However, we are not aware of any graph data management system
that focuses on optimizing snapshot retrieval queries over historical
traces, and on supporting rich temporal analysis of large networks.
There is also prior work on temporal RDF data and temporal
XML Data. Motik [21] presents a logic-based approach to rep-
resenting valid time in RDF and OWL. Several works (e.g., [24,
36]) have considered the problems of subgraph pattern matching or
SPARQL query evaluation over temporally annotated RDF data.
There is also much work on version management in XML data
stores. Most scientific datasets are semistructured in nature and can
be effectively represented in XML format [8]. Lam and Wong [16]
use complete deltas, which can be traversed in either direction of
time for efficient retrieval. Other systems store the current version
as a snapshot and the historical versions as deltas from the cur-
rent version [19]. For such a system, the deltas only need to be
unidirectional. Ghandeharizadeh et al. [10] provide a formalism
6http://www.neo4j.org
7http://www.graphbase.net
on deltas, which includes a delta arithmetic. All these approaches
assume unique node identifiers to merge deltas with deltas or snap-
shots. Buneman et al. [8] propose merging all the versions of the
database into one single hierarchical data structure for efficient re-
trieval. In a recent work, Seering et al. [28] presented a disk based
versioning system using efficient delta encoding to minimize space
consumption and retrieval time in array-based systems. However,
none of that prior work focuses on snapshot retrieval in general
graph databases, or proposes techniques that can flexibly exploit
the memory-resident information.
3. SYSTEM OVERVIEW
We begin with briefly describing our graph data model and cate-
gorizing the types of changes that it may capture. We then discuss
different types of snapshot retrieval queries that we support in our
system, followed by the key components of the system architecture.
3.1 Graph Data Model
The most basic model of a graph over a period of time is as a
collection of graph snapshots, one corresponding to each time in-
stance (we assume discrete time). Each such graph snapshot con-
tains a set of nodes and a set of edges. The nodes and edges are
assigned unique ids at the time of their creation, which are not re-
assigned after deletion of the components (a deletion followed by a
re-insertion results in assignment of a new id). A node or an edge
may be associated with a list of attribute-value pairs; the list of at-
tribute names is not fixed a priori and new attributes may be added
at any time. Additionally an edge contains the information about
whether it is a directed edge or an undirected edge.
We define an event as the record of an atomic activity in the net-
work. An event could pertain to either the creation or deletion of
an edge or node, or change in an attribute value of a node or an
edge. Alternatively, an event can express the occurrence of a tran-
sient edge or node that is valid only for that time instance instead
of an interval (e.g., a “message” from a node to another node). Be-
ing atomic refers to the fact that the activity can not be logically
broken down further into smaller events. Hence, an event always
corresponds to a single timepoint. So, the valid time interval of an
edge, [ts, te], is expressed by two different events, edge addition
and deletion events at tsand terespectively. The exact contents of
an event depend on the event type; below we show examples of a
new edge event (NE), and an update node attribute event (UNA).
(a) {NE, N:23, N:4590, directed:no, 11/29/03 10:10}
(b) {UNA, N:23, ‘job’, old:‘..’, new:‘..’, 11/29/07 17:00}
We treat events as bidirectional, i.e., they could be applied to a
database snapshot in either direction of time. For example, say that
at times tk1and tk, the graph snapshots are Gk1and Gkrespec-
tively. If Eis the set of all events at time tk, we have that:
Gk=Gk1+E, Gk1=GkE
where the +and operators denote application of the events in E
in the forward and the backward direction. All events are recorded
in the direction of evolving time, i.e., going ahead in time. A list of
chronologically organized events is called an eventlist.
3.2 System Overview
Figure 2 shows a high level overview of our system and its key
components. At a high level, there are multiple ways that a user or
an application may interact with a historical graph database. Given
the wide variety of network analysis or visualization tasks that are
commonly executed against an information network, we expect a
large fraction of these interactions will be through a programmatic
Key-Value Store
DeltaGraph
GraphPool
Active Graph Pool Table
{Query, Time, Bit, Graph}
GraphManager
Manage GraphPool -
Overlaying historical
graphs and cleanup
HistoryManager
Manage DeltaGraph -
Query Planning, Disk
Read/Write
QueryManager
Translate user query into
Graph Retrieval and
execute Algorithms on
graphs
Social
Network
Analysis
Software
Analyst
System
Figure 2: Architecture of our system: our focus in this work is
on the components below the dashed line.
API where the user or the application programmer writes her own
code to operate on the graph (as shown in the figure). Such interac-
tions result in what we call snapshot queries being executed against
the database system. Executing such queries is the primary focus
of this paper, and we further discuss these types of queries below.
In ongoing work, we are also working on developing a high-level
declarative query language (similar to TSQL [32]) and query pro-
cessing techniques to execute such queries against our database. As
a concrete example, an analyst who may have designed a new net-
work evolution model and wants to see how it fits the observed data,
may want to retrieve a set of historical snapshots and process them
using the programmatic API. On the other hand, a declarative query
language may better fit the needs of a user interested in searching
for a temporal pattern (e.g., find nodes that had the fastest growth
in the number of neighbors since joining the network).
Next, we briefly discuss snapshot queries and the key compo-
nents of the system.
3.2.1 Snapshot Queries
We differentiate between a singlepoint snapshot query and a mul-
tipoint snapshot query. An example of the first query is: “Retrieve
the graph as of January 2, 1995”. On the other hand, a multipoint
snapshot query requires us to simultaneously retrieve multiple his-
torical snapshots (e.g, “Retrieve the graphs as of every Sunday be-
tween 1994 to 2004”). We also support more complex snapshot
queries where a TimeExpression or a time interval is specified in-
stead. Any snapshot query can specify whether it requires only the
structure of the graph, or a specified subset of the node or edge
attributes, or all attributes.
Specifically, the following is a list of some of the retrieval func-
tions that we support in our programmatic API.
GetHistGraph(Time t, String attr options): In the basic single-
point graph retrieval call, the first parameter indicates the time;
the second parameter indicates the attribute information to be
fetched as a string formed by concatenating sub-options listed
in Table 1. For example, to specify that all node attributes ex-
Option Explanation
-node:all (default) None of the node attributes
+node:all All node attributes
+node:attr1 Node attribute named “attr1”; overrides
“-node:all” for that attribute
-node:attr1 Node attribute named “attr1”; overrides
“+node:all” for that attribute
Table 1: Options for node attribute retrieval. Similar options
exist for edge attribute retrieval.
cept salary, and the edge attribute name should be fetched, we
would use: attr options = “+node:all-node:salary+edge:name”.
GetHistGraphs(List<Time>t list, String attr options), where
t list specifies a list of time points.
GetHistGraph(TimeExpression tex, String attr options): This is
used to retrieve a hypothetical graph using a multinomial Boolean
expression over time points. For example, the expression (t1
¬t2)specifies the components of the graph that were valid at
time t1but not at time t2. The TimeExpression data struc-
ture consists of a list of ktime points, {t1, t2,...,tk}, and a
Boolean expression over them.
GetHistGraphInterval(Time ts, Time te, String attr options):
This is used to retrieve a graph over all the elements that were
added during the time interval [ts, te). It also fetches the tran-
sient events, not fetched (by definition) by the above calls.
The (Java) code snippet below shows an example program that re-
trieves several graphs, and operates upon them.
/* Loading the index */
GraphManager gm = new GraphManager(...);
gm.loadDeltaGraphIndex(...);
...
/* Retrieve the historical graph structure along with node names as of
Jan 2, 1985 */
HistGraph h1 = gm.GetHistGraph(“1/2/1985”, “+node:name”);
...
/* Traversing the graph*/
List<HistNode>nodes = h1.getNodes();
List<HistNode>neighborList = nodes.get(0).getNeighbors();
HistEdge ed = h1.getEdgeObj(nodes.get(0), neighborList.get(0));
...
/* Retrieve the historical graph structure alone on Jan 2, 1986 and Jan
2, 1987 */
listOfDates.add(“1/2/1986”);
listOfDates.add(“1/2/1987”);
List<HistGraph>h1 = gm.getHistGraphs(listOfDates, “”);
...
Eventually, our goal is to support Blueprints, a collection of inter-
faces analogous to JDBC but for graph data (we currently support
a subset). Blueprints is a generic graph Java API that already binds
to various graph database backends (e.g., Neo4j), and many graph
processing and programming frameworks are built on top of it (e.g.,
Gremlin, a graph traversal language8; Furnace, a graph algorithms
package9; etc.). By supporting the Blueprints API, we immediately
enable use of many of these already existing toolkits.
3.2.2 Key Components
There are two key data structure components of our system.
1. GraphPool is an in-memory data structure that can store multi-
ple graphs together in a compact way by overlaying the graphs
8http://github.com/tinkerpop/gremlin/wiki
9http://github.com/tinkerpop/furnace/wiki
on top of each other. At any time, the GraphPool contains: (1)
the current graph that reflects the current state of the network,
(2) the historical snapshots, retrieved from the past using the
commands above and possibly modified by an application pro-
gram, and (3) materialized graphs, which are graphs that corre-
spond interior or leaf nodes in the DeltaGraph, but may not cor-
respond to any valid graph snapshot (Section 4.5). GraphPool
exploits redundancy amongst the different graph snapshots that
need to be retrieved, and considerably reduces the memory re-
quirements for historical queries. We discuss GraphPool in de-
tail in Section 6.
2. DeltaGraph is an index structure that stores the historical net-
work data using a hierarchical index structure over deltas and
leaf-level eventlists (called leaf-eventlists). To execute a snap-
shot retrieval query, a set of appropriate deltas and leaf-eventlists
is fetched and the resulting graph snapshot is overlaid on the
existing set of graphs in the GraphPool. The structure of the
DeltaGraph itself, called DeltaGraph skeleton, is maintained
as a weighted graph in memory (it contains statistics about the
deltas and eventlists, but not the actual data). The skeleton is
used during query planning to choose the optimal set of deltas
and eventlists for a given query. We describe DeltaGraph in
detail in the next section.
The data structures are managed and maintained by several sys-
tem components. HistoryManager deals with the construction of
the DeltaGraph, plans how to execute a singlepoint or a multipoint
snapshot query, and reads the required deltas and eventlists from the
disk. GraphManager is responsible for managing the GraphPool
data structure, including the overlaying of deltas and eventlists, bit
assignment, and post-query clean up. Finally, the QueryManager
manages the interface with the user or the application program, and
extracts a snapshot query to be executed against the DeltaGraph.
One of its functions is to translate any explicit references (e.g. user-
id) from the query to the corresponding internal-id and vice-versa
for the final result, using a lookup table. As discussed earlier, such
a component is usually highly application-specific, and we do not
discuss it further in this paper.
In a distributed deployment, DeltaGraph and GraphPool are both
partitioned across a set of machines by partitioning the node ID
space, and assigning each partition to a separate machine (Sec-
tion 4.6). The partitioning used for storage can be different from
that used for retrieval and processing; however, for minimizing
wasted network communication, it would be ideal for the two par-
titionings to be aligned so that multiple DeltaGraph partitions may
correspond to a single GraphPool partition, but not vice versa. Snap-
shot retrieval on each machine is independent of the others, and
requires no network communication among those. Once the snap-
shots are loaded into the GraphPool, any distributed programming
framework can be used on top; we have implemented an iterative
vertex-based message-passing system analogous to Pregel [18].
For clarity, we assume a single-site deployment (i.e., no horizon-
tal partitioning) in most of the description that follows.
4. DELTAGRAPH:PHYSICAL STORAGEOF
HISTORICAL GRAPH DATA
We begin with discussing previously proposed techniques for
supporting snapshot queries, and why they do not meet our needs.
We then present details of our proposed DeltaGraph data structure.
4.1 Prior Techniques and Limitations
An optimal solution to answering snapshot retrieval queries is
the external interval tree, presented by Arge and Vitter [3]. Their
S7=
f(S5,S6)
S5 =
f(S1,S2)
S6=
f(S3,S4)
S1S2S3S4
S8=
(S1,S5) (S2,S5)
(S5,S7) (S6,S7)
(S7,S8)
(S4,S6)
E1E2E3
L L L
(S3,S6)
Super-Root
Root
(a)
S7=
f1(S5,S6)
S5 =
f1(S1,S2)
S6=
f1(S3,S4)
S1S2S3S4
S8=
E1E2E3
L L L
S11=
f2(S9,S10)
S9 =
f2(S1,S2)
S10=
f2(S3,S4)
Super-Root
Root1Root2
(b)
Figure 3: DeltaGraphs with 4leaves, leaf-eventlist size L, arity
2.∆(Si, Sj)denotes delta needed to construct Sifrom Sj.
proposed index structure uses optimal space on disk, and supports
updates in optimal (logarithmic) time. Segment trees [6] can also
be used to solve this problem, but may store some intervals in a
duplicated manner and hence use more space. Tsotras and Kange-
laris [37] present snapshot index, an I/O optimal solution to the
problem for transaction-time databases. Salzberg and Tsotras [27]
also discuss two extreme approaches to supporting snapshot re-
trieval queries, called Copy and Log approaches. In the Copy
approach, a snapshot of the database is stored at each transaction
state, the primary benefit being fast retrieval times; however the
space requirements make this approach infeasible in practice. The
other extreme approach is the Log approach, where only and all
the changes are recorded to the database, annotated by time. While
this approach is space-optimal and supports O(1)-time updates (for
transaction-time databases), answering a query may require scan-
ning the entire list of changes and takes prohibitive amount of time.
A mix of those two approaches, called Copy+Log, where a subset
of the snapshots are explicitly stored, is often a better idea.
We found these (and other prior) approaches to be insufficient
and inflexible for our needs for several reasons. First, they do
not efficiently support multipoint queries that we expect to be very
commonly used in evolutionary analysis, that need to be optimized
by avoiding duplicate reads and repeated processing of the events.
Second, to cater to the needs of a variety of different applications,
we need the index structure to be highly tunable, and to allow trad-
ing off different resources and user requirements (including mem-
ory, disk usage, and query latencies). Ideally we would also like
to control the distribution of average snapshot retrieval times over
the history, i.e., we should be able to reduce the retrieval times for
more recent snapshots at the expense of increasing it for the older
snapshots (while keeping the utilization of the other resources the
same), or vice-versa. For achieving low latencies, the index struc-
ture should support flexible pre-fetching of portions of the index
into memory and should avoid processing any events that are not
needed by the query (e.g., if only the network structure is needed,
then we should not have to process any events pertaining to the
node or edge attributes). Finally, we would like the index structure
to be able to support different persistent storage options, ranging
from a hard disk to the cloud; most of the previously proposed in-
dex structures are optimized primarily for disks.
4.2 DeltaGraph Overview
Our proposed index data structure, DeltaGraph, is a directed
graphical structure that is largely hierarchical, with the lowest level
of the structure corresponding to equi-spaced historical snapshots
of the network (equal spacing is not a requirement, but simplifies
analysis). Figure 3(a) shows a simple DeltaGraph, where the nodes
S1,...,S4correspond to four historical snapshots of the graph,
spaced Levents apart. We call these nodes leaves, even though
there are bidirectional edges between these nodes as shown in the
figure. The interior nodes of the DeltaGraph correspond to graphs
that are constructed from its children by applying what we call a
differential function, denoted f(). For an interior node Spwith
children Sc1,...,Sck,10 we have that Sp=f(Sc1,...,Sck). The
simplest differential function is perhaps the Intersection function.
We discuss other differential functions in the next section.
The graphs Spare not explicitly stored in the DeltaGraph. Rather
we only store the delta information with the edges. Specifically, the
directed edge from Spto Sciis associated with a delta ∆(Sci, Sp)
which allows construction of Scifrom Sp. It contains the elements
that should be deleted from Sp(i.e., SpSci) and those that should
be added to Sp(i.e., SciSp). The bidirectional edges among
the leaves also store similar deltas; here the deltas are simply the
eventlists (denoted E1, E2, E3in Figure 3), called leaf-eventlists.
For a leaf-eventlist E, we denote by [Estart, Eend )the time inter-
val that it corresponds to. For convenience, we add a special root
node, called super-root, at the top of the DeltaGraph that is asso-
ciated with a null graph (S8in Figure 3). For convenience, we call
the children of the super-root as roots.
A DeltaGraph can simultaneously have multiple hierarchies that
use different differential functions (Figure 3(b)); this can used to
improve query latencies at the expense of higher space requirement.
The deltas and the leaf-eventlists are given unique idsin the Delt-
aGraph structure, and are stored in a columnar fashion, by sepa-
rating out the structure information from the attribute information.
For simplicity, we assume here a separation of a delta into three
components: (1) struct, (2) nodeattr , and (3) edgeattr . For
a leaf-eventlist E, we have an additional component, Etransient,
where the transient events are stored.
Finally, the deltas and the eventlists are stored in a persistent, dis-
tributed key-value store, the key being hpartition id, delta id, ci,
where c∈ {δstruct, δnodeattr ,...,Etransient }specifies which of
the components is being fetched or stored, and partition id speci-
10We abuse the notation somewhat to let Spdenote both the interior
node and the graph corresponding to it.
S7
S5S6
S1S2S3S4
c(E1)
c((S1,S5))
c(E2)
S8
St1
c(S1,t1) c(S2,t1)
c(E3)
c((S5,S7))
c((S7,S8))
(a) Singlepoint query t1
S7
S5S6
S1S2S3S4
S8
St1 St2 St3
(b) Multipoint query {t1, t2, t3}
Figure 4: Example plans for singlepoint and multipoint re-
trieval on the DeltaGraph shown in Figure 3(a).
fies the partition. Each delta or eventlist is typically partitioned and
stored in a distributed manner. Based on the node id of the con-
cerned node(s), each event, edge, node and attribute is designated
a partition, using a hash function hp, such that, partition id =
hp(node id). In a set up with kdistributed storage units, all the
deltas and eventlists are likely to have kpartitions each.
4.3 Singlepoint Snapshot Queries
Given a singlepoint snapshot query at time t1, there are many
ways to answer it from the DeltaGraph. Let Edenote the leaf-
eventlist such that t1[Estart, E end)(found through a binary
search at the leaf level). Any (directed) path from the super-root
to the two leaves adjacent to Erepresents a valid solution to the
query. Hence we can find the optimal solution by finding the path
with the lowest weight, where the weight of an edge captures the
cost of reading the associated delta (or the required subset of it),
and applying it to the graph constructed so far. We approximate this
cost by using the size of the delta retrieved as the weight. Note that,
each edge is associated with three or four weights, corresponding
to different attr options. In the distributed case, we have a set of
weights for each partition. We also add a new virtual node (node
St1in Figure 4(a)), and add edges to it from the adjacent leaves as
shown in the figure. The weights associated with these two edges
are set by estimating the portion of the leaf-eventlist Ethat must
be processed to construct St1from those leaves.
We use the standard Dijkstra’s shortest path algorithm to find the
optimal solution for a specific singlepoint query, using the appro-
priate weights. Although this algorithm requires us to traverse the
entire DeltaGraph skeleton for every query, it is needed to handle
the continuously changing DeltaGraph skeleton, especially in re-
sponse to memory materialization (discussed below). Second, the
weights associated with the edges are different for different queries
and the weights are also highly skewed, so the shortest paths can
be quite different for the same timepoint for different attr options.
Further, the sizes of the DeltaGraph skeletons (i.e., the number of
nodes and edges) are usually small, even for very large histori-
cal traces, and the running time of the shortest path algorithm is
dwarfed by the cost of retrieving the deltas from the persistent stor-
age. In ongoing work, we are working on developing an algorithm
based on incrementally maintaining single source shortest paths to
handle very large DeltaGraph skeletons.
4.4 Multipoint Snapshot Queries
Similarly to singlepoint snapshot queries, a multipoint snapshot
query can be reduced to finding a Steiner tree in a weighted directed
graph. We illustrate this through an example. Consider a multipoint
query over three timepoints t1, t2, t3over the DeltaGraph shown in
Figure 3(a). We first identify the leaf-level eventlists that contain
the three time points, and add virtual nodes St1, St2, St3, shown in
the Figure 4(b) using shaded nodes. The optimal solution to con-
struct all three snapshots is then given by the lowest-weight Steiner
tree that connects the super-root and the three virtual nodes (us-
ing appropriate weights depending on the attributes that need to
be fetched). A possible Steiner tree is depicted in the figure using
thicker edges. As we can see, the optimal solution to the multipoint
query does not use the optimal solutions for each of the constituent
singlepoint queries. Finding the lowest weight Steiner tree is un-
fortunately NP-Hard (and much harder for directed graphs vs undi-
rected graphs), and we instead use the standard 2-approximation for
undirected Steiner trees for that purpose. We first construct a com-
plete undirected graph over the set of nodes comprising the root and
the virtual nodes, with the weight of an edge between two nodes set
to be the weight of the shortest path between them in the skeleton.
We then compute the minimum spanning tree over this graph, and
“unfold” it to get a Steiner tree over the original skeleton. This al-
gorithm does not work for general directed graphs, however we can
show that, because of the special structure of a DeltaGraph, it not
only results in valid Steiner trees, but retains the 2-approximation
guarantee as well.
Aside from multipoint snapshot queries, this technique is also
used for queries asking for a graph valid for a composite Time-
Expression, which we currently execute by fetching the required
snapshots into memory and then operating upon them to find the
components that satisfy the TimeExpression.
4.5 Memory Materialization
For improving query latencies, some nodes in the DeltaGraph
are typically pre-fetched and materialized in memory. In particu-
lar, the highest levels of the DeltaGraph should be materialized, and
further, the “rightmost” leaf (that corresponds to the current graph)
should also be considered as materialized. The task of materializ-
ing one or more DeltaGraph nodes is equivalent to running a single-
point or a multipoint snapshot retrieval query, and we can use the
algorithms discussed above for that purpose. After a node is materi-
alized, we modify the in-memory DeltaGraph skeleton, by adding a
directed edge with weight 0 from the super-root to that node. Any
further snapshot retrieval queries will automatically benefit from
the materialization.
The option of memory materialization enables fine-grained run-
time control over the query latencies and the memory consumption,
without the need to reconstruct the DeltaGraph. For instance, if
we know that a specific analysis task may need access snapshots
from a specific period, then we can materialize the lowest common
ancestor of the snapshots from that period to reduce the query la-
tencies. One extreme case is what we call total materialization,
where all the leaves are materialized in memory. This reduces to
the Copy+Log approach with the difference that the snapshots are
stored in memory in an overlaid fashion (in the GraphPool). For
mostly-growing networks (that see few deletions), such material-
ization can be done cheaply resulting in very low query latencies.
4.6 DeltaGraph Construction
Besides the graph itself, represented as a list of all events in a
chronological order, E, the DeltaGraph construction algorithm ac-
cepts four other parameters: (1) L, the size of a leaf-level eventlist;
(2) k, the arity of the graph; (3) f(), the differential function that
computes a combined delta from a given set of deltas; and (4) a
partitioning of the node ID space. The DeltaGraph is constructed
in a bottom-up fashion, similar to how a bulkloaded B+-Tree is con-
structed. We scan Efrom the beginning, creating the leaf snapshots
and corresponding eventlists (containing Levents each). When k
of the snapshots are created, a parent interior node is constructed
from those snapshots. Then the deltas corresponding to the edges
are created, those snapshots are deleted, and we continue scanning
the eventlist.
The entire DeltaGraph can thus be constructed in a single pass
over E, assuming sufficient memory is available. At any point dur-
ing the construction, we may have up to k1snapshots for each
level of the DeltaGraph constructed so far. For higher values of
k, this can lead to very high memory requirements. However, we
use the GraphPool data structure to maintain these snapshots in an
overlaid fashion to decrease the total memory consumption. We
were able to scale to reasonably large graphs using this technique.
Further scalability is achieved by making multiple passes over E,
processing one partition (Section 4.2) in each pass.
4.7 Extensibility
To efficiently support specific types of queries or tasks, it is bene-
ficial to maintain and index auxiliary information in the DeltaGraph
and use it during query execution. We extend the DeltaGraph func-
tionality through user-defined modules and functions for this pur-
pose. In essence, the user can supply functions that compute aux-
iliary information for each snapshot, that will be automatically in-
dexed along with the original graph data. The user may also supply
a different differential function to be used for this auxiliary infor-
mation. The basic retrieval functionality (i.e., retrieve snapshots
as of a specified time points) is thus naturally extended to such
auxiliary information, which can also be loaded into memory and
operated upon. In addition, the user may also supply functions that
operate on the auxiliary information deltas during retrieval, that can
be used to directly answer specific types of queries.
The extensibility framework involves the AuxiliarySnapshot and
AuxiliaryEvent structures which are similar to the graph snapshot
and event structures, respectively. An AuxiliarySnapshot consists
of a hashtable of string key-value pairs where as an AuxiliaryEvent
consists of the event timestamp, a flag indicating the addition, dele-
tion or the change of a key-value pair, and finally, a key-value pair
itself. Given the very general nature of the auxiliary structures, it
is possible for the consumer (programs using this API) to define a
wide variety of graph indexing semantics whose historical indexing
will be done automatically by the HistoryManager. For a histori-
cal graph, any number of auxiliary indexes may be used, each one
by extending the AuxIndex abstract class, defining the following,
a) method CreateAuxEvent that generates an AuxiliaryEvent corre-
sponding to a plain Event, based upon the current Graph and the
latest Auxiliary Snapshot, b) method CreateAuxSnapshot, to cre-
ate a leaf-level AuxiliarySnapshot based upon the previous Auxi-
larySnapshot and the AuxiliaryEventList in between the two, and
c) a method AuxDF, a differential function that computes the parent
AuxiliarySnapshot, given a list of kAuxiliarySnapshots and corre-
sponding k1AuxEventlists.
Any number of queries may be defined on an auxiliary index
by extending either of the three Auxiliary Query abstract classes,
namely, AuxHistQueryPoint,AuxHistQueryInterval or AuxHistQuery,
depending on the temporal nature of the query - point, range or
across entire time span, respectively. An implementation of any
of these classes involves attaching a pointer to the AuxiliaryIndex
which the queries are supposed to operate upon (in addition to the
plain DeltaGraph index itself), and the particular query method,
AuxQuery. While doing so, the the programmer typically makes
use of the methods like GetAuxSnapshot provided by the Extensi-
bility API.
We illustrate this extensibility through an example of a subgraph
pattern matching index. Techniques for subgraph pattern matching
are very well studied in literature (see, e.g., [12]). Say we want to
support finding all instances of a node-labeled query graph (called
pattern graph) in a node-labeled data graph. One simple way to
efficiently support such queries is to index all paths of say length 4
in the data graph [30]. This pattern index here takes the form of a
key-value data structure, where a key is a quartet of labels, and the
value is the set of all paths in the data graph over 4 nodes that match
it. To find all matching instances for a pattern, we then decompose
the pattern into paths of length 4 (there must be at least one such
path in the pattern), use the index to find the sets of paths that match
those decomposed paths, and do an appropriate join to construct the
entire match.
We can extend the basic DeltaGraph structure to support such
an index as follows. The auxiliary information maintained is pre-
cisely the pattern index at each snapshot, which is naturally stored
in a compact fashion in the DeltaGraph (by exploiting the com-
monalities over time). It involves implementation of the AuxIn-
dex and AuxHistQueryInterval classes accordingly. For example,
the CreateAuxEvent, creates an AuxEvent, defining the addition or
deletion of a path of four nodes, by finding the effect of a plain
Event (in terms of paths) in the context of the current graph. In-
stead of using the standard differential function, we use one that
achieves the following effect: a specific path containing 4 nodes,
say hn1, n2, n3, n4i, is present in the pattern index associated with
an interior node if and only if, it is present in all the snapshots below
that interior node. This means that, if the path is associated with
the root, it is present throughout the history of the network. Such
auxiliary information can now be directly used to answer subgraph
pattern matching query to find all matching instances over the his-
tory of the graph. We evaluated our implementation on Dataset 1
(details in the Section 7), and assigned labels to each node by ran-
domly picking one from a list of ten labels. We built the index as
described above (by indexing all paths of length 4). We were able
to run a subgraph pattern query in 148 seconds to find all occur-
rences of a given pattern query, returning a total of 14109 matches
over the entire history of the network.
This extensibility framework enables us to quickly design and
deploy reasonable strategies for answering many different types of
queries. We note that, for specific queries, it may be possible to de-
sign more efficient strategies. In ongoing work, we are investigat-
ing the issues in answering specific types of queries over historical
graphs, including subgraph pattern matching, reachability, etc.
5. DELTAGRAPH ANALYSIS
Next we develop analytical models for the storage requirements,
memory consumption, and query latencies for a DeltaGraph.
5.1 Model of Graph Dynamics
Let G0denote the initial graph as of time 0, and let G|E|denote
the graph after |E|events. To develop the analytical models, we
make some simplifying assumptions, the most critical being that
we assume a constant rate of inserts or deletes. Specifically, we
assume that a δfraction of the events result in an addition of an
element to a graph (i.e., inserts), and ρfraction of the events result
in removal of an existing element from the graph (deletes). An
update is captured as a delete followed by an insert. Thus, we have
that |G|E||=|G0|+|Eδ−|Eρ. We have that δ+ρ<1,
but not necessarily = 1 because of transient events that don’t affect
the graph size. Typically we have that δ> ρ. If ρ= 0, we call
the graph a growing-only graph.
Note that, the above model does not require that the graph change
at a constant rate over time. In fact, the above model (and the Delt-
Table 2: Differential Functions
Name Description
Intersection f(a, b, c . . . ) = abc . . .
Union f(a, b, c . . . ) = abc . . .
Skewed f(a, b) = a+r.(ba),0r1
Right Skewed f(a, b) = ab+r.(bab),0r1
Left Skewed f(a, b) = ab+r.(aab),0r1
Mixed f(a, b, c . . . ) = a+r1.(δab +δbc . . . )
r2.(ρab +ρbc . . . ),0r2r11
Balanced f(a, b, c . . . ) = a+1
2.(δab +δbc . . . )
1
2.(ρab +ρbc . . . )
Empty f(a, b, c . . . ) =
aGraph structure) don’t explicitly reason about time but rather only
about the events. To reason about graph dynamics over time, we
need a model that captures event density, i.e., number of events that
take place over a period of time. Let g(t)denote the total number
of events that take place from time 0to time t. For most real-world
networks, we expect g(t)to be a super-linear function of t, indicat-
ing that the rate of change over time itself increases over time.
5.2 Differential Functions
Recall that a differential function specifies how the snapshot cor-
responding to an interior node should be constructed from snap-
shots corresponding to its children. The simplest differential func-
tion is intersection. However, for most networks, intersection does
not lead to desirable behavior. For a growing-only graph, intersec-
tion results in a left-skewed DeltaGraph, where the delta sizes are
lower on the part corresponding to the older snapshots. In fact, the
root is exactly G0for a strictly growing-only graph.
Table 2 shows several other differential functions with better and
tunable behavior. Let pbe an interior node with children aand b.
Let ∆(a, p)and ∆(b, p)denote the corresponding deltas. Further,
let b=a+δab ρab.
Skewed: For the two extreme cases, r= 0 and r= 1, we have
that f(a, b) = aand f(a, b) = brespectively. By using an ap-
propriate value of r, we can control the sizes of the two deltas.
For example, for r= 0.5, we get p=a+1
2δab. Here 1
2δab
means that we randomly choose half of the events that com-
prise δab (by using a hash function that maps the events to 0 or
1). So |∆(a, p)|=1
2|δab|, and |∆(b, p)|=1
2|δab|+|ρab |.
Balanced: This differential function, a special case of mixed, en-
sures that the delta sizes are balanced across aand b, i.e.,
|∆(a, p)|=|∆(b, p)|=1
2|δab|+1
2|ρab|. Note that, here
we make an assumption that a+1
2δab 1
2ρab is a valid oper-
ation. A problem may occur because an event ρab may be
selected for removal, but may not exist in a+1
2δab. We can en-
sure that this does not happen by using the same hash function
for choosing both 1
2δab and 1
2ρab.
Empty: This special case makes the DeltaGraph approach identi-
cal to the Copy+Log approach.
The other functions shown in Table 2 can be used to expose more
subtle trade-offs, but our experience with these functions suggests
that, in practice, Intersection, Union, and Mixed functions are likely
to be sufficient for most situations.
5.3 Space and Time Estimation Models
Next, we develop analytical models for various quantities of in-
terest in the DeltaGraph, including the space required to store it,
the distribution of the delta sizes across levels, and the snapshot
retrieval times. We focus on the Balanced and Intersection differ-
ential functions, and omit detailed derivations for lack of space.
We make several simplifying assumptions in the analysis be-
low. As discussed above, we assume constant rates of inserts and
deletes. Let Ldenote the leaf-eventlist size, and let Edenote the
complete eventlist corresponding to the historical trace. Thus, we
have N=|E|
L+ 1 leaf nodes. We denote by kthe arity of the
graph, and assume that Nis a power of k(resulting in a complete
k-ary tree). We number the DeltaGraph levels from the bottom,
starting with 1 (i.e., the bottommost level is called the first level).
Balanced Function: Although it appears somewhat complex, the
Balanced differential function is the easiest to analyze. Consider
an interior node pwith kchildren, c1,...,ck. If pis at level 2 (i.e.,
if ci’s are leaves), then Sp=Sc1+1
2δc1c21
2ρc1c2+1
2δc2c3−·· · .
It is easy to show that:
|∆(p, ci)|=1
2(|δc1c2|+|ρc1c2|+···),i
=1
2(k1)(δ+ρ)L, i
The number of edges between the nodes at the first and second
levels is N, thus the total space required by the deltas at this level
is: N1
2(k1)(δ+ρ)L=1
2(k1)(δ+ρ)|E|.
If pis an interior node at level 3, then the distance between c1
and c2in terms of the number of events is exactly k(δ+ρ)L.
This is because c2contains: (1) the insert and delete events from
c1’s children that c1does not contain, (2) (δ+ρ)events that
occur between c1’s last child and c2s first child, and (3) a further
(k1)
2(δ+ρ)Levents from its own children. Using a similar
reasoning to above, we can see that:
|∆(p, ci)|=1
2(k1)k(δ+ρ)L, i
Surprisingly, because the number of edges at this level is N
k, the
total space occupied by the deltas at this level is same as that at the
first level, and this is true for the higher levels as well.
Thus, the total amount of space required to store all the deltas
(excluding the one from the empty super-root node to the root) is:
(logkN1)
2(k1)(δ+ρ)|E|
The size of the snapshot corresponding to the root itself can be seen
to be: |G0|+1
2(δρ)|E|(independent of k). Although this
may seem high, we note that the size of the current graph (G|E|)
is: |G0|+ (δρ)|E|, which is larger than the size of the root.
Further, there is a significant overlap between the two, especially if
|G0|is large, making it relatively cheap to materialize the root.
Finally, using the symmetry, we can show that the total weight of
the shortest path between the root and any leaf is: 1
2(δ+ρ)|E|,
resulting in balanced query latencies for the snapshots (for spe-
cific timepoints corresponding to the same leaf-eventlist, there are
small variations because of different portions of the leaf-eventlist
that need to be processed).
Intersection: On the other hand, the Intersection function is much
trickier to analyze. In fact, just calculating the size of the intersec-
tion for a sequence of snapshots is non-trivial in the general case.
As above, consider a graph containing |E|events. The root of the
DeltaGraph contains all events that were not deleted from G0dur-
ing that event trace. We state the following analytical formulas for
the size of the root for some special cases without full derivations.
ρ= 0: For a growing-only graph, root snapshot is exactly G0.
δ=ρ: In this case, the size of the graph remains constant (i.e.,
G|E|=G0). We can show that: |root|=|G0|e|E|δ
|G0|.
δ= 2ρ:|root|=|G0|2
|G0|+ρ|E|.
The last two formulas both confirm our intuition that, as the total
number of events increases, the size of the root goes to zero. Sim-
ilar expressions can be derived for the sizes of any specific interior
node or the deltas, by plugging in appropriate values of |E|and
|G0|. We omit resulting expressions for the total size of the index
for the latter two cases.
The Intersection function does have a highly desirable property
that, the total weight of the shortest path between the super-root
and any leaf, is exactly the size of that leaf. Since an interior node
contains a subset of the events in each of its children, we only need
to fetch the remaining events to construct the child. However, this
means that the query latencies are typically skewed, with the older
snapshots requiring less time to construct than the newer snapshots
(that are typically larger).
5.4 Discussion
We briefly discuss the impact of different construction parame-
ters and suggest strategies for choosing the right parameters. We
then briefly present a qualitative comparison with interval trees,
segment trees, and the Copy+Log approach.
Effect of different construction parameters: The parameters in-
volved in the construction of the DeltaGraph give it high flexibility,
and must be chosen carefully. The optimal choice of the param-
eters is highly dependent on the application scenario and require-
ments. The effect of arity is easy to quantify in most cases: higher
arity results in lower query access times, but usually much higher
disk space utilization (even for the Balanced function, the query ac-
cess time becomes dependent on kfor a more realistic cost model
where using a higher number of queries to fetch the same amount
of information takes more time). Parameters such as r(for Skewed
function) and r1, r2(for Mixed function) can be used to control the
query retrieval times over the span of the eventlist. For instance, if
we expect a larger number of queries to be accessing newer snap-
shots, then we should choose higher values for these parameters.
The choice of differential function itself is quite critical. In-
tersection typically leads to lower disk space utilization, but also
highly skewed query latencies that cannot be tuned except through
memory materialization. Most other differential functions lead to
higher disk utilization but provide better control over the query la-
tencies. Thus if disk utilization is of paramount importance, then
Intersection would be the preferred option, but otherwise, the Mixed
function (with the values of r1and r2set according to the expected
query workload) would be the recommended option.
Fine-tuning the values of these parameters also requires knowl-
edge of g(t), the event density over time. The analytical models
that we have developed reason about the retrieval times for the leaf
snapshots, but these must be weighted using g(t)to reason about
retrieval times over time. For example, the Balanced function does
not lead to uniform query latencies over time for graphs that show
super-linear growth. Instead, we must choose r1, r2>0.5to guar-
antee uniform query latencies over time in that case.
Qualitative comparison with other approaches: The Copy+Log
approach can be seen as a special case of DeltaGraph with Empty
differential function (and arity = N). Compared to interval trees,
DeltaGraph will almost always need more space, but its space con-
sumption is usually lower than segment trees. Assume that δ+
ρ= 1 (this is the worst case for DeltaGraph). Then, for the
Balanced function, with arity (k) = 2, the disk space required is
O(|E|log N). Since the number of intervals is at least |E|/2, the
space requirements for interval trees and segment trees are O(|E|)
and O(|E|log |E|)respectively. For growing-only graphs and the
Intersection function, we see similar behavior. In most other sce-
1
98
76
Gt1 1
8
76
3
4
5
Gcurrent
6
3
2
4
Gt2
1
98
76
32
4
5
13
4
6
7
8
GraphPool(current,t1,t2)
(a)
(b) (c)
1
98
76
Gt1 1
8
76
3
4
5
Gcurrent
6
3
2
4
Gt2
1
98
76
32
4
5
13
4
6
7
8
GraphPool(current,t1,t2)
1
98
76
Gt1
1
8
76
3
4
5
Gcurrent
6
3
2
4
Gt2
1
98
76
32
4
5
13
4
6
7
8
GraphPool(current,t1,t2)
Figure 5: GraphPool consisting of overlaid graphs at times
tcurrent ,t1and t2.
Table 3: GraphId-Bit Mapping Table
Bit GraphID Graph Dep
2,3 34 Hist. Graph -
4 4 Mat. Graph -
5 41 Mat. Graph -
6,7 35 Hist. Graph 4
icant optimization opportunity. Even if a historical graph differs
from the current graph or one of the materialized graphs in only a
few elements, we would still have to set the corresponding bit ap-
propriately for all the elements in the graph. We can use the bit
pair, {2i, 2i+1}, to eliminate this step. We mark the historical
graph as being dependent on a materialized graph (or the current
graph) in such a case. For example, in Table ??, historical snapshot
35 is dependent on materialized graph 4. If Bit 2iis true, then the
membership of an element in the historical graph is identical to its
membership in the materialized graph (i.e., if present in one, then
present in another). On the other hand, if Bit 2iis false, then Bit
2i+1tells us whether the element is contained in the historical
graph or not (independent of the materialized graph).
When a graph is pulled into the memory either in response to
a query or for materialization, it is overlaid onto the current in-
memory graph, edge by edge and node by node. The number of
graphs that can be overlaid simultaneously depends primarily on
the amount of memory required to contain the union of all the
graphs. The bitmap size is dynamically adjusted to accommo-
date more graphs if needed, and overall does not occupy significant
space. The determination of whether to store a graph as being de-
pendent on a materialized graph is made at the query time. During
the query plan construction, we count the total number of events
that need to be applied to the materialized graph, and if it is small
relative to the size of the graph, then the fetched graph is marked
as being dependent on the materialized graph.
Updates to the Current graph: As the current graph is being up-
dated, the DeltaGraph index is continuously updated. All the new
events are recorded in a recent eventlist. When the eventlist reaches
sufficient size (i.e., L), the eventlist is inserted into the index and
the index is updated by adding appropriate edges and deltas. We
omit further details because of lack of space.
Clean-up of a graph from memory: When a historical graph is
no longer needed, it needs to be cleaned. Cleaning up a graph
is logically a reverse process of fetching it in to the memory. The
naive way would be to go through all the elements in the graph,
and reset the appropriate bit(s), and delete the element if no bits
are set. However the cost of doing this can be quite high. We in-
stead perform clean-up in a lazy fashion, periodically scanning the
GraphPool in the absence of query load, to reset the bits, and to see
if any elements should be deleted. Also, in case the system is run-
ning low on memory and there are one or more unneeded graphs,
the Cleaner thread is invoked and not interrupted until the desired
amount of memory is liberated.
7. EMPIRICAL EVALUATION
In this section, we present the results of a comprehensive exper-
imental evaluation conducted to evaluate the performance of our
prototype system, implemented in Java using the Kyoto Cabinet
key-value store as the underlying persistent storage. The system
provides a programmatic API including the API discussed in Sec-
tion 3.2.1; in addition, we have implemented a Pregel-like itera-
tive framework for distributed processing, and the subgraph pattern
matching index presented in Section ??. Unless otherwise stated,
the experiments were run on a single core on a single machine.
XXX need details.
Datasets: We used three datasets in our experimental study.
(1) Dataset 1 is a growing-only, co-authorship network extracted
from the DBLP dataset, with 2Medges in all. The network starts
empty and grows over a period of seven decades. The nodes (au-
thors) and edges (co-author relationships) are added to the network,
and no nodes or edges are dropped. At the end, the total number
of unique nodes present in the graph is around 330,000, and the
number of edges with unique end points is 1.04M. Each node was
assigned 10 attribute key-value pairs, generated at random.
(2) Dataset 2 is a randomly generated historical trace with Dataset
1 as the starting snapshot, followed by 2Mevents where 1Medges
added and 1Medges are deleted over time.
(3) Dataset 3 is a randomly generated historical trace with a start-
ing snapshot containing 10 million (10M) edges and 3Mnodes
(from a patent citation network), followed by 50Mevents, 25M
edge additions and 25Medge deletions.
Experiments with Dataset 3: We created a partitioned index for
Dataset 0 and deployed a parallel framework for Page Rank exe-
cution using 5 machines, each running a single core and approx-
imately 1.4 GB of available memory. Each DeltaGraph partition
was approximately 1.7 GB. On an average, it takes 22 seconds to
calculate PageRank for a graph snapshot, including snapshot re-
trieval time.
Comparison with other storage approaches: We begin with com-
paring our approach with in-memory interval trees, and the Copy+Log
approach. Both of those were integrated into our system such that
any of the approaches could be used to fetch the historical snap-
shots into the GraphPool, and we report the time taken to do so.
Figure ?? shows the results of our comparison between Copy+Log
and DeltaGraph approaches for time taken to execute 25 uniformly
spaced queries on Datasets 1 and 2. The leaf-eventlist sizes were
chosen so that the disk storage space consumed by both the ap-
proaches was about the same. For similar disk space constraints
Figure 5: GraphPool consisting of overlaid graphs at times
tcurrent ,t1and t2.
Bit GraphID Graph Dep
2,3 34 Hist. Graph -
4 4 Mat. Graph -
5 41 Mat. Graph -
6,7 35 Hist. Graph 4
icant optimization opportunity. Even if a historical graph differs
from the current graph or one of the materialized graphs in only a
few elements, we would still have to set the corresponding bit ap-
propriately for all the elements in the graph. We can use the bit
pair, {2i, 2i+1}, to eliminate this step. We mark the historical
graph as being dependent on a materialized graph (or the current
graph) in such a case. For example, in Table ??, historical snapshot
35 is dependent on materialized graph 4. If Bit 2iis true, then the
membership of an element in the historical graph is identical to its
membership in the materialized graph (i.e., if present in one, then
present in another). On the other hand, if Bit 2iis false, then Bit
2i+1tells us whether the element is contained in the historical
graph or not (independent of the materialized graph).
When a graph is pulled into the memory either in response to
a query or for materialization, it is overlaid onto the current in-
memory graph, edge by edge and node by node. The number of
graphs that can be overlaid simultaneously depends primarily on
the amount of memory required to contain the union of all the
graphs. The bitmap size is dynamically adjusted to accommo-
date more graphs if needed, and overall does not occupy significant
space. The determination of whether to store a graph as being de-
pendent on a materialized graph is made at the query time. During
the query plan construction, we count the total number of events
that need to be applied to the materialized graph, and if it is small
relative to the size of the graph, then the fetched graph is marked
as being dependent on the materialized graph.
Updates to the Current graph: As the current graph is being up-
dated, the DeltaGraph index is continuously updated. All the new
events are recorded in a recent eventlist. When the eventlist reaches
sufficient size (i.e., L), the eventlist is inserted into the index and
the index is updated by adding appropriate edges and deltas. We
omit further details because of lack of space.
Clean-up of a graph from memory: When a historical graph is
no longer needed, it needs to be cleaned. Cleaning up a graph
is logically a reverse process of fetching it in to the memory. The
naive way would be to go through all the elements in the graph,
and reset the appropriate bit(s), and delete the element if no bits
are set. However the cost of doing this can be quite high. We in-
stead perform clean-up in a lazy fashion, periodically scanning the
GraphPool in the absence of query load, to reset the bits, and to see
if any elements should be deleted. Also, in case the system is run-
ning low on memory and there are one or more unneeded graphs,
the Cleaner thread is invoked and not interrupted until the desired
amount of memory is liberated.
7. EMPIRICAL EVALUATION
In this section, we present the results of a comprehensive exper-
imental evaluation conducted to evaluate the performance of our
prototype system, implemented in Java using the Kyoto Cabinet
key-value store as the underlying persistent storage. The system
provides a programmatic API including the API discussed in Sec-
tion 3.2.1; in addition, we have implemented a Pregel-like itera-
tive framework for distributed processing, and the subgraph pattern
matching index presented in Section 4.7.
Datasets: We used three datasets in our experimental study.
(1) Dataset 1 is a growing-only, co-authorship network extracted
from the DBLP dataset, with 2Medges in all. The network starts
empty and grows over a period of seven decades. The nodes (au-
thors) and edges (co-author relationships) are added to the network,
and no nodes or edges are dropped. At the end, the total number
of unique nodes present in the graph is around 330,000, and the
number of edges with unique end points is 1.04M. Each node was
assigned 10 attribute key-value pairs, generated at random.
(2) Dataset 2 is a randomly generated historical trace with Dataset
1 as the starting snapshot, followed by 2Mevents where 1Medges
added and 1Medges are deleted over time.
(3) Dataset 3 is a randomly generated historical trace with a start-
ing snapshot containing 10 million (10M) edges and 3Mnodes
(from a patent citation network), followed by 50Mevents, 25M
edge additions and 25Medge deletions.
Experimental Setup: We created a partitioned index for Dataset
3 and deployed a parallel framework for PageRank computation
using 5 machines, each running a single Amazon EC2 core with
approximately 1.4GB of available memory. Each DeltaGraph par-
tition was approximately 1.7GB. Note that the index is stored in
a compressed fashion (using built-in compression in Kyoto Cabi-
net). On average, it took us 22 seconds to calculate PageRank for a
specific graph snapshot, including the snapshot retrieval time. This
experiment illustrates the effectiveness of our framework at easily
handling very large graphs.
For the rest of the experimental study, we report results for Datasets
1 and 2; the techniques we compare against are centralized, and
further the cost of constructing the index makes it hard to run ex-
periments that evaluate the effect of the construction parameters.
Unless otherwise specified, the experiments were run on a single
Amazon EC2 core (with 1.4GB memory).
Comparison with other storage approaches: We begin with com-
paring our approach with in-memory interval trees, and the Copy+Log
approach. Both of those were integrated into our system such that
any of the approaches could be used to fetch the historical snap-
shots into the GraphPool, and we report the time taken to do so.
Figure 6 shows the results of our comparison between Copy+Log
and DeltaGraph approaches for time taken to execute 25 uniformly
spaced queries on Datasets 1 and 2. The leaf-eventlist sizes were
chosen so that the disk storage space consumed by both the ap-
Table 3: GraphId-Bit Mapping Table
Bit GraphID Graph Dep
2,3 34 Hist. Graph -
4 4 Mat. Graph -
5 41 Mat. Graph -
6,7 35 Hist. Graph 4
ble attribute values, are associated with a bitmap string (called BM
henceforth), used to decide which of the active graphs contain that
component or attribute. A GraphID-Bit mapping table is used to
maintain the mapping of bits to different graphs. Table 6 shows an
example of such a mapping. Each historical graph that has been
fetched is assigned two consecutive Bits, {2i, 2i+1},i1. Ma-
terialized graphs, on the other hand, are only assigned one Bit.
Bits 0 and 1 are reserved for the current graph membership.
Specifically, Bit 0 tells us whether the element belongs to the cur-
rent graph or not. Bit 1, on the other hand, is used for elements that
may have been recently deleted, but are not part of the DeltaGraph
index yet. A Bit associated with a materialized graph is interpreted
in a straightforward manner.
Using a single bit for a historical graph misses out on a signif-
icant optimization opportunity. Even if a historical graph differs
from the current graph or one of the materialized graphs in only a
few elements, we would still have to set the corresponding bit ap-
propriately for all the elements in the graph. We can use the bit
pair, {2i, 2i+1}, to eliminate this step. We mark the historical
graph as being dependent on a materialized graph (or the current
graph) in such a case. For example, in Table 6, historical snapshot
35 is dependent on materialized graph 4. If Bit 2iis true, then the
membership of an element in the historical graph is identical to its
membership in the materialized graph (i.e., if present in one, then
present in another). On the other hand, if Bit 2iis false, then Bit
2i+1tells us whether the element is contained in the historical
graph or not (independent of the materialized graph).
When a graph is pulled into the memory either in response to
a query or for materialization, it is overlaid onto the current in-
memory graph, edge by edge and node by node. The number of
graphs that can be overlaid simultaneously depends primarily on
the amount of memory required to contain the union of all the
graphs. The bitmap size is dynamically adjusted to accommo-
date more graphs if needed, and overall does not occupy significant
space. The determination of whether to store a graph as being de-
pendent on a materialized graph is made at the query time. During
the query plan construction, we count the total number of events
that need to be applied to the materialized graph, and if it is small
relative to the size of the graph, then the fetched graph is marked
as being dependent on the materialized graph.
Updates to the Current graph: As the current graph is being up-
dated, the DeltaGraph index is continuously updated. All the new
events are recorded in a recent eventlist. When the eventlist reaches
sufficient size (i.e., L), the eventlist is inserted into the index and
the index is updated by adding appropriate edges and deltas. We
omit further details because of lack of space.
Clean-up of a graph from memory: When a historical graph is
no longer needed, it needs to be cleaned. Cleaning up a graph
is logically a reverse process of fetching it in to the memory. The
naive way would be to go through all the elements in the graph,
and reset the appropriate bit(s), and delete the element if no bits
are set. However the cost of doing this can be quite high. We in-
stead perform clean-up in a lazy fashion, periodically scanning the
GraphPool in the absence of query load, to reset the bits, and to see
if any elements should be deleted. Also, in case the system is run-
ning low on memory and there are one or more unneeded graphs,
the Cleaner thread is invoked and not interrupted until the desired
amount of memory is liberated.
7. EMPIRICAL EVALUATION
In this section, we present the results of a comprehensive exper-
imental evaluation conducted to evaluate the performance of our
prototype system, implemented in Java using the Kyoto Cabinet
key-value store as the underlying persistent storage. The system
provides a programmatic API including the API discussed in Sec-
tion 3.2.1; in addition, we have implemented a Pregel-like itera-
tive framework for distributed processing, and the subgraph pattern
matching index presented in Section 4.7.
Datasets: We used three datasets in our experimental study.
(1) Dataset 1 is a growing-only, co-authorship network extracted
from the DBLP dataset, with 2Medges in all. The network starts
empty and grows over a period of seven decades. The nodes (au-
thors) and edges (co-author relationships) are added to the network,
and no nodes or edges are dropped. At the end, the total number
of unique nodes present in the graph is around 330,000, and the
number of edges with unique end points is 1.04M. Each node was
assigned 10 attribute key-value pairs, generated at random.
(2) Dataset 2 is a randomly generated historical trace with Dataset
1 as the starting snapshot, followed by 2Mevents where 1Medges
added and 1Medges are deleted over time.
(3) Dataset 3 is a randomly generated historical trace with a start-
ing snapshot containing 10 million (10M) edges and 3Mnodes
(from a patent citation network), followed by 50Mevents, 25M
edge additions and 25Medge deletions.
Experimental Setup: We created a partitioned index for Dataset
3 and deployed a parallel framework for PageRank computation
using 5 machines, each running a single Amazon EC2 core with
approximately 1.4GB of available memory. Each DeltaGraph par-
tition was approximately 1.7GB. Note that the index is stored in
a compressed fashion (using built-in compression in Kyoto Cabi-
net). On average, it took us 22 seconds to calculate PageRank for a
specific graph snapshot, including the snapshot retrieval time. This
experiment illustrates the effectiveness of our framework at easily
handling very large graphs.
For the rest of the experimental study, we report results for Datasets
1 and 2; the techniques we compare against are centralized, and
further the cost of constructing the index makes it hard to run ex-
periments that evaluate the effect of the construction parameters.
Unless otherwise specified, the experiments were run on a single
Amazon EC2 core (with 1.4GB memory).
Comparison with other storage approaches: We begin with com-
paring our approach with in-memory interval trees, and the Copy+Log
approach. Both of those were integrated into our system such that
any of the approaches could be used to fetch the historical snap-
shots into the GraphPool, and we report the time taken to do so.
Figure 6 shows the results of our comparison between Copy+Log
and DeltaGraph approaches for time taken to execute 25 uniformly
spaced queries on Datasets 1 and 2. The leaf-eventlist sizes were
chosen so that the disk storage space consumed by both the ap-
proaches was about the same. For similar disk space constraints
(450MB and 950MB for Dataset 1 and 2, respectively), the Delt-
aGraph could afford a smaller size of Land hence higher number
of snapshots than the Copy+Log approach. As we can see, the best
DeltaGraph variation outperformed the Copy+Log approach by a
Figure 5: (a) Graphs at times tcurr ent,t1and t2; (b) GraphPool
consisting of overlaid graphs; (c) GraphID-Bit Mapping Table
narios, we expect the total space requirement for DeltaGraph to
be somewhere in between O(|E|)and O(|E|log N), and lower if
δ+ρ1(which is often the case for social networks).
Regarding query latencies, for the Intersection function without
any materialization, the amount of information retrieved for an-
swering a snapshot query is exactly the size of the snapshot. Both
interval trees and segment trees behave similarly. On the other
hand, if the root or some of the higher levels of the DeltaGraph are
materialized, then the query latencies could be significantly lower
than what we can achieve with either of those approaches. For Bal-
anced function, if the root is materialized, then the average query
latencies are similar for the three approaches. However, for the
Balanced function, the retrieval times do not depend on the size of
the retrieved snapshot, unlike interval and segment trees, leading to
more predictable and desirable behavior. Again, with materializa-
tion, the query latencies can be brought down even further.
6. GRAPHPOOL
The in-memory graphs are stored in the in-memory GraphPool
in an overlapping manner. In this section, we briefly describe the
key ideas behind this data structure.
Description: GraphPool maintains a single graph that is the union
of all the active graphs including: (1) the current graph, (2) his-
torical snapshots, and (3) materialized graphs (Figure 5). Each
component (node or edge), and for each attribute, each of its possi-
ble attribute values, are associated with a bitmap string (called BM
henceforth), used to decide which of the active graphs contain that
component or attribute. A GraphID-Bit mapping table is used to
maintain the mapping of bits to different graphs. Figure 5(c) shows
an example of such a mapping. Each historical graph that has been
fetched is assigned two consecutive Bits, {2i, 2i+ 1}, i 1. Ma-
terialized graphs, on the other hand, are only assigned one Bit.
Bits 0 and 1 are reserved for the current graph membership.
Specifically, Bit 0 tells us whether the element belongs to the cur-
rent graph or not. Bit 1, on the other hand, is used for elements that
may have been recently deleted, but are not part of the DeltaGraph
index yet. A Bit associated with a materialized graph is interpreted
in a straightforward manner.
Using a single bit for a historical graph misses out on a signif-
icant optimization opportunity. Even if a historical graph differs
from the current graph or one of the materialized graphs in only a
few elements, we would still have to set the corresponding bit ap-
propriately for all the elements in the graph. We can use the bit
pair, {2i, 2i+ 1}, to eliminate this step. We mark the historical
graph as being dependent on a materialized graph (or the current
graph) in such a case. For example, in Figure 5(c), historical snap-
shot 35 is dependent on materialized graph 4. If Bit 2iis true, then
the membership of an element in the historical graph is identical
to its membership in the materialized graph (i.e., if present in one,
then present in another). On the other hand, if Bit 2iis false, then
Bit 2i+ 1 tells us whether the element is contained in the historical
graph or not (independent of the materialized graph).
When a graph is pulled into the memory either in response to
a query or for materialization, it is overlaid onto the current in-
memory graph, edge by edge and node by node. The number of
graphs that can be overlaid simultaneously depends primarily on
the amount of memory required to contain the union of all the
graphs. The bitmap size is dynamically adjusted to accommo-
date more graphs if needed, and overall does not occupy significant
space. The determination of whether to store a graph as being de-
pendent on a materialized graph is made at the query time. During
the query plan construction, we count the total number of events
that need to be applied to the materialized graph, and if it is small
relative to the size of the graph, then the fetched graph is marked
as being dependent on the materialized graph.
Updates to the Current graph: As the current graph is being up-
dated, the DeltaGraph index is continuously updated. All the new
events are recorded in a recent eventlist. When the eventlist reaches
sufficient size (i.e., L), the eventlist is inserted into the index and
the index is updated by adding appropriate edges and deltas. We
omit further details because of lack of space.
Clean-up of a graph from memory: When a historical graph is
no longer needed, it needs to be cleaned. Cleaning up a graph
is logically a reverse process of fetching it into the memory. The
naive way would be to go through all the elements in the graph,
and reset the appropriate bit(s), and delete the element if no bits
are set. However the cost of doing this can be quite high. We in-
stead perform clean-up in a lazy fashion, periodically scanning the
GraphPool in the absence of query load, to reset the bits, and to see
if any elements should be deleted. Also, in case the system is run-
ning low on memory and there are one or more unneeded graphs,
the Cleaner thread is invoked and not interrupted until the desired
amount of memory is liberated.
7. EMPIRICAL EVALUATION
In this section, we present the results of a comprehensive exper-
imental evaluation conducted to evaluate the performance of our
prototype system, implemented in Java using the Kyoto Cabinet
key-value store as the underlying persistent storage. The system
provides a programmatic API including the API discussed in Sec-
tion 3.2.1; in addition, we have implemented a Pregel-like itera-
tive framework for distributed processing, and the subgraph pattern
matching index presented in Section 4.7.
Datasets: We used three datasets in our experimental study.
(1) Dataset 1 is a growing-only, co-authorship network extracted
from the DBLP dataset, with 2Medges in all. The network starts
empty and grows over a period of seven decades. The nodes (au-
thors) and edges (co-author relationships) are added to the network,
and no nodes or edges are dropped. At the end, the total number
of unique nodes present in the graph is around 330,000, and the
1985 1990 1995 2000
Query Timepoint
0
1000
2000
3000
4000
Graph Retrieval Time (ms)
(a) Performance: Dataset 1
Copy + Log
DG(Int)
2006 2008 2010
Query Timepoint
0
1000
2000
3000
4000
Graph Retrieval Time (ms)
(b) Performance: Dataset 2
Copy + Log
DG(Int)
DG(Int, Root Mat)
Figure 6: Comparing DeltaGraph and Copy+Log. Int and Bal
denote the Intersection and Balanced functions respectively.
number of edges with unique end points is 1.04M. Each node was
assigned 10 attribute key-value pairs, generated at random.
(2) Dataset 2 is a randomly generated historical trace with Dataset
1 as the starting snapshot, followed by 2Mevents where 1Medges
added and 1Medges are deleted over time.
(3) Dataset 3 is a randomly generated historical trace with a start-
ing snapshot containing 10 million (10M) edges and 3Mnodes
(from a patent citation network), followed by 100Mevents, 50M
edge additions and 50Medge deletions.
Experimental Setup: We created a partitioned index for Dataset 3
and deployed a parallel framework for PageRank computation us-
ing 7 machines, each with a single Amazon EC2 core and approxi-
mately 1.4GB of available memory. Each DeltaGraph partition was
approximately 2.2GB. Note that the index is stored in a compressed
fashion (using built-in compression in Kyoto Cabinet). On average,
it took us 23.8 seconds to calculate PageRank for a specific graph
snapshot, including the snapshot retrieval time. This experiment
illustrates the effectiveness of our framework at scalably handling
large historical graphs.
For the rest of the experimental study, we report results for Datasets
1 and 2; the techniques we compare against are centralized, and
further the cost of constructing the index makes it hard to run ex-
periments that evaluate the effect of the construction parameters.
Unless otherwise specified, the experiments were run on a single
Amazon EC2 core (with 1.4GB memory).
Comparison with other storage approaches: We begin with com-
paring our approach with in-memory interval trees, and Copy+Log
approach. Both of those were integrated into our system such that
any of the approaches could be used to fetch the historical snap-
shots into the GraphPool, and we report the time taken to do so.
Figure 6 shows the results of our comparison between Copy+Log
and DeltaGraph approaches for time taken to execute 25 uniformly
spaced queries on Datasets 1 and 2. The leaf-eventlist sizes were
chosen so that the disk storage space consumed by both the ap-
proaches was about the same. For similar disk space constraints
(450MB and 950MB for Dataset 1 and 2, respectively), the Delt-
aGraph could afford a smaller size of Land hence higher number
of snapshots than the Copy+Log approach. As we can see, the best
DeltaGraph variation outperformed the Copy+Log approach by a
factor of at least 4, and orders of magnitude in several cases.
Figure 7 shows the comparison between an in-memory interval
tree and two DeltaGraph variations: (1) with low materialization,
(2) with all leaf nodes materialized. We compared these configura-
tions for time taken to execute 25 queries on Dataset 2, using k= 4
and L= 30000. We can see that both the DeltaGraph approaches
1980 1990 2000 2010
Query Timepoint
0
500
1000
1500
2000
Graph Retrieval Time (ms)
(a) Performance: Dataset 2
Interval Tree
DG (Root’s GrChild Mat)
DG (Total Mat)
Materialization
0
500
1000
1500
Space (MB)
(b) Index Memory: Dataset 2
Interval Tree
DG (Root GC Mat)
DG (Total Mat)
Figure 7: Performance of different DeltaGraph configurations
vs. Interval Tree
outperform the interval tree approach, while using significantly less
memory than the interval tree (even with total materialization). The
largely disk-resident DeltaGraph with root’s grandchildren mate-
rialized is more than four times as fast as the regular approach,
whereas the total materialization approach, a more fair compari-
son, is much faster.
We also evaluated a naive approach similar to the Log technique,
with raw events being read from input files directly (not shown in
graphs). The average retrieval times were worse than DeltaGraph
by factors of 20 and 23 for Datasets 1 and 2 respectively.
GraphPool memory consumption: Figure 8(a) shows the total (cu-
mulative) memory consumption of the GraphPool when a sequence
of 100 singlepoint snapshot retrieval queries, uniformly spaced over
the life span of the network, is executed against Datasets 1 and 2.
By exploiting the overlap between these snapshots, the GraphPool
is able to maintain a large number of snapshots in memory. For
Dataset 2, if the 100 graphs were to be stored disjointly, the total
memory consumed would be 50GB, whereas the GraphPool only
requires about 600MB. The plot of Dataset 1 is almost a constant
because, for this dataset, any historical snapshot is a subset of the
current graph. The minor increase toward the end is due to the
increase in the bitmap size, required to accommodate new queries.
Multicore Parallelism: Figure 8(b) shows the advantage of con-
current query processing on a multi-core processor using a parti-
tioned DeltaGraph approach, where we retrieve the graph parallely
using multiple threads. We observe near-linear speedups further
validating our parallel design.
Multipoint queries: Figure 8(c) shows the time taken to retrieve
multiple graphs using our multipoint query retrieval algorithm, and
multiple invocations of the single query retrieval algorithm on Dataset
1. The x-axis represents the number of snapshot queries, which
were chosen to be 1 month apart. As we can see, the advantages of
multipoint query retrieval are quite significant because of the high
overlap between the retrieved snapshots.
Advantages of columnar storage: Figure 8(d) shows the perfor-
mance benefits of our columnar storage approach for Dataset 1. As
we can see, if we are only interested in the network structure, our
approach can improve query latencies by more than a factor of 3.
Bitmap penalty: We compared the penalty of using the bitmap fil-
tering procedure in GraphPool, by doing a PageRank computation
without and with use of bitmaps. We observed that the execution
time increases from 1890ms to 2014ms, increase of less than 7%.
Effect of DeltaGraph construction parameters: We measured the
average query times and storage space consumption for different
23456
# Queries
0
50
100
150
200
Graph Retrieval Time (ms)
(c)
SinglePoint Queries
Multipoint Query
0 20 40 60 80 100
Query Count
200
300
400
500
600
Memory (MB)
(a)
Dataset 2
Dataset 1
01234
# Cores
0
200
400
600
800
Avg Graph Retrieval Time (ms)
(b)
Avg QueryTime
2006 2008 2010
Query Timepoint
0
2000
4000
6000
Graph Retrieval Time (ms)
(d)
Structure+Attributes
Structure Only
Figure 8: (a) Cumulative GraphPool memory consumption; (b)
Multi-core parallelism (Dataset 2); (c) Multipoint query exe-
cution vs multiple singlepoint queries; (d) Retrieval with and
without attributes (Dataset 2)
values of the arity (k) and leaf-eventlist sizes (L) for Dataset 1.
Figure 9(a) shows that with an increase in the arity of the Delta-
Graph, the average query time decreases rapidly in the beginning,
but quickly flattens. On the other hand, the space requirement in-
creases in general with small exceptions when the height of the
tree does not change with increasing arity. We omit a detailed
discussion on the exceptions as their impact is minimal. Again,
referring back to the discussion in Section 5.4, the results corrobo-
rate our claim that higher arity leads to smaller DeltaGraph heights
and hence smaller query times, but results in higher space require-
ments. The effect of the leaf-eventlist size is also as expected. As
the leaf-eventlist size increases, the total space consumption de-
creases (since there are fewer leaves), but the query times also in-
crease dramatically.
Materialization: Figure 10 shows the benefits of materialization
for a DeltaGraph and the associated cost in terms of memory, for
Dataset 2 with arity = 4 and using the Intersection differential func-
tion. We compared four different situations: (a) no materialization,
(b) root materialized, (c) both children of the root materialized, and
(d) all four grandchildren of the root materialized. The results are
as expected – we can significantly reduce the query latencies (up to
a factor of 8) at the expense of higher memory consumption.
Differential functions: In Section 5.4, we discussed how the choice
of an appropriate differential function can help achieve desired dis-
tributions of retrieval times for a given network. Figure 11(a) com-
pares the behavior of Intersection and Balanced functions with and
without root materialization on Dataset 1. On the growing-only
graph, using the Intersection function results in skewed query times,
with the larger (newer) snapshots taking longer to load. The bal-
anced function, on the other hand, provides a more uniform access
pattern, although the average time taken is higher. By materializ-
185
190
195
200
205
210
Average Time (ms)
2 4 6 8
Arity
280
300
320
340
360
380
Space (MB)
(a) Varying Arity (Dataset 1)
450
500
550
Average Time (ms)
10000 20000 30000 40000
Eventlist Size
280
300
320
340
Space (MB)
(b) Varying EventList Size (Dataset 1)
Figure 9: Effect of varying arity and leaf-eventlist sizes
0
500
1000
1500
Time (ms)
(a) Average Query Time
None
Root
Root’s children
Root’s grandchildren
0
100
200
300
400
500
Space (MB)
(b) Materialization Memory Requirement
None
Root
Root’s children
Root’s grandchildren
Figure 10: Effect of materialization
ing the root node, the average becomes comparable to that of In-
tersection, yielding a uniform access pattern. A few different con-
figurations for the mixed function are shown in Figure 11(b), with
r1= 0.5,r2= 0.5denoting the Balanced function. As we can see,
by choosing an appropriate differential function, we can exercise
fine-grained control over the query retrieval times thus validating
one of our main goals with the DeltaGraph approach.
8. CONCLUSIONS AND FUTURE WORK
In this paper, we presented an approach for managing historical
data for large information networks, and for executing snapshot re-
trieval queries on them. We presented DeltaGraph, a distributed hi-
erarchical structure to compactly store the historical trace of a net-
work, and GraphPool, a compact way to maintain and operate upon
multiple graphs in memory. Our experimental evaluation shows
that the choice of DeltaGraph is superior to the existing alterna-
tives. We showed both analytically and empirically that the flexi-
bility of DeltaGraph helps control the distribution of query access
times through appropriate parameter choices at construction time,
and memory materialization at runtime. Our experimental evalua-
tion demonstrated the impact of many of our optimizations, includ-
ing multi-query optimization and columnar storage. Our work so
far has also opened up many opportunities for further work, includ-
ing developing improved DeltaGraph construction algorithms and
techniques for processing different types of temporal queries over
the historical trace, that we are planning to pursue in future work.
9. REFERENCES
[1] C. Aggarwal and H. Wang. Graph data management and
mining: A survey of algorithms and applications. In
Managing and Mining Graph Data, pages 13–68. 2010.
1990 1995 2000
Query Timepoint
0
500
1000
1500
2000
Graph Retrieval Time (ms)
(a)
Balanced
Intersection
Balanced (root materialized)
1985 1990 1995 2000
Query Timepoint
200
400
600
800
1000
Graph Retrieval Time (ms)
(b)
r1=0.1, r2=0.1
r1=0.5, r2=0.5
r1=0.9, r2=0.9
Figure 11: (a) Comparison of differential functions, Intersec-
tion vs Balanced (Dataset 1) and (b) Comparison of different
Mixed function configurations (Dataset 1)
[2] J. Ahn, C. Plaisant, and B. Shneiderman. A task taxonomy of
network evolution analysis. HCIL Tech. Reports, April 2011.
[3] L. Arge and J. Vitter. Optimal dynamic interval management
in external memory. In FOCS, 1996.
[4] S. Asur, S. Parthasarathy, and D. Ucar. An event-based
framework for characterizing the evolutionary behavior of
interaction graphs. ACM TKDD, 2009.
[5] T. Berger-Wolf and J. Saia. A framework for analysis of
dynamic social networks. In KDD, 2006.
[6] G. Blankenagel and R. Guting. External segment trees.
Algorithmica, 12(6):498–532, 1994.
[7] A. Bolour, T. L. Anderson, L. J. Dekeyser, and H. K. T.
Wong. The role of time in information processing: a survey.
SIGMOD Rec., 1982.
[8] P. Buneman, S. Khanna, K. Tajima, and W. Tan. Archiving
scientific data. ACM TODS, 29(1):2–42, 2004.
[9] Chris J. Date, Hugh Darwen, and Nikos A. Lorentzos.
Temporal data and the relational model. Elsevier, 2002.
[10] S. Ghandeharizadeh, R. Hull, and D. Jacobs. Heraclitus:
elevating deltas to be first-class citizens in a database
programming language. ACM TODS, 21(3), 1996.
[11] D. Greene, D. Doyle, and P. Cunningham. Tracking the
evolution of communities in dynamic social networks. In
ASONAM, 2010.
[12] H. He and A. Singh. Graphs-at-a-time: query language and
access methods for graph databases. In SIGMOD, 2008.
[13] U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A
peta-scale graph mining system implementation and
observations. In ICDM, 2009.
[14] R. Kumar, J. Novak, and A. Tomkins. Structure and
evolution of online social networks. In KDD, 2006.
[15] A. Lakshman and P. Malik. Cassandra: a decentralized
structured storage system. SIGOPS Oper. Syst. Rev., 2010.
[16] N. Lam and R. Wong. A fast index for XML document
version management. In APWeb, 2003.
[17] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graph evolution:
Densification and shrinking diameters. ACM TKDD, 2007.
[18] G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn,
N. Leiser, and G. Czajkowski. Pregel: a system for
large-scale graph processing. In PODC, 2009.
[19] A. Marian, S. Abiteboul, G. Cobena, and L. Mignet.
Change-centric management of versions in an XML
warehouse. In VLDB, 2001.
[20] I. McCulloh and K. Carley. Social network change detection.
Center for the Computational Analysis, 2008.
[21] B. Motik. Representing and querying validity time in RDF
and OWL: A logic-based approach. In ISWC, 2010.
[22] S. Navlakha and C. Kingsford. Network archaeology:
Uncovering ancient networks from present-day interactions.
PLoS Comput Biol, 7(4), 2011.
[23] G. Ozsoyoglu and R.T. Snodgrass. Temporal and real-time
databases: a survey. IEEE TKDE, 7(4):513 –532, aug 1995.
[24] A. Pugliese, O. Udrea, and V. Subrahmanian. Scaling rdf
with time. In WWW, 2008.
[25] C. Ren, E. Lo, B. Kao, X. Zhu, and R. Cheng. On querying
historial evolving graph sequences. In VLDB, 2011.
[26] Semih Salihoglu and Jennifer Widom. Gps: A graph
processing system. Technical Report 1039, Stanford
University, 2012.
[27] B. Salzberg and V. Tsotras. Comparison of access methods
for time-evolving data. ACM Comput. Surv., 31(2), 1999.
[28] A. Seering, P. Cudre-Mauroux, S. Madden, and
M. Stonebraker. Efficient versioning for scientific array
databases. In ICDE, 2012.
[29] B. Shao, H. Wang, and Y. Li. The trinity graph engine.
Technical Report MSR-TR-2012-30, Microsoft Research,
2012.
[30] D. Shasha, J. Wang, and R. Giugno. Algorithmics and
applications of tree and graph searching. In PODS, 2002.
[31] R. Snodgrass and I. Ahn. A taxonomy of time in databases.
In SIGMOD, pages 236–246, 1985.
[32] Richard T. Snodgrass, editor. The TSQL2 Temporal Query
Language. Kluwer, 1995.
[33] L. Tang, H. Liu, J. Zhang, and Z. Nazeri. Community
evolution in dynamic multi-mode networks. In KDD, 2008.
[34] A. Tansel, J. Clifford, S. Gadia, S. Jajodia, A. Segev, and
R. Snodgrass (editors). Temporal Databases: Theory,
Design, and Implementation. 1993.
[35] C. Tantipathananandh, T. Berger-Wolf, and D. Kempe. A
framework for community identification in dynamic social
networks. In KDD, 2007.
[36] J. Tappolet and A. Bernstein. Applied temporal RDF:
Efficient temporal querying of RDF data with SPARQL. In
ESWC, pages 308–322, 2009.
[37] V. Tsotras and N. Kangelaris. The snapshot index: an I/O-
optimal access method for timeslice queries. Inf. Syst., 1995.
... In response to this issue, [21] proposes a strategy to determine when snapshots should be materialized based on the distribution of historical queries. [22] introduces an in-memory data structure and a hierarchical index structure to retrieve efficiently snapshots of an evolving graph. [23] proposes a framework to construct a small number of representative graphs based on similarity. ...
... Research axis Work Purpose Setting Data redundancy reduction [21] Snapshot storage and retrieval, Distribution of historical queries Centralized [22] Snapshot storage and retrieval Distributed [23] Snapshot storage and retrieval Centralized [24] Temporal graph storage and algorithms, Interval-centric model Distributed [25] Temporal property graph, Interval-centric model, Temporal path queries Distributed [26] Historical graph storage and analysis, Node-centric model Distributed Implementation environment [27] Modelling, storing and querying time-varying graphs, Neo4j Centralized [28] Temporal graph data management system, ACID transactions, Neo4j Centralized these optimization techniques for snapshots always accept some data redundancy. To completely avoid data redundancy, some works recommend to use a data model completely in break with snapshots. ...
Article
Graph data management systems are designed for managing highly interconnected data. However, most of the existing work on the topic does not take into account the temporal dimension of such data, even though they may change over time: new interconnections, new internal characteristics of data (etc.). For decision makers, these data changes provide additional insights to explain the underlying behaviour of a business domain. The objective of this paper is to propose a complete solution to manage temporal interconnected data. To do so, we propose a new conceptual model of temporal graphs. It has the advantage of being generic as it captures the different kinds of changes that may occur in interconnected data. We define a set of translation rules to convert our conceptual model into the logical property graph. Based on the translation rules, we implement several temporal graphs according to benchmark and real-world datasets in the Neo4j data store. These implementations allow us to carry out a comprehensive study of the feasibility and usability (through business analyses), the efficiency (saving up to 99% query execution times comparing to classic approaches) and the scalability of our solution.
... If an arc changes in the next frame, then they create a new adjacency list for that vertex's arc and add a pointer to the new frame. The DeltaGraph [15] is also a distributed index, but it groups different snapshots in a hierarchical structure based on common arcs. EveLog [11] is a compressed adjacency log structure based on the log of events strategy. ...
Article
Full-text available
Time-evolving web and social network graphs are modeled as a set of pages/individuals (nodes) and their arcs (links/relationships) that change over time. Due to their popularity, they have become increasingly massive in terms of their number of nodes, arcs, and lifetimes. However, these graphs are extremely sparse throughout their lifetimes. For example, it is estimated that Facebook has over a billion vertices, yet at any point in time, it has far less than 0.001% of all possible relationships. The space required to store these large sparse graphs may not fit in most main memories using underlying representations such as a series of adjacency matrices or adjacency lists. We propose building a compressed data structure that has a compressed binary tree corresponding to each row of each adjacency matrix of the time-evolving graph. We do not explicitly construct the adjacency matrix, and our algorithms take the time-evolving arc list representation as input for its construction. Our compressed structure allows for directed and undirected graphs, faster arc and neighborhood queries, as well as the ability for arcs and frames to be added and removed directly from the compressed structure (streaming operations). We use publicly available network data sets such as Flickr, Yahoo!, and Wikipedia in our experiments and show that our new technique performs as well or better than our benchmarks on all datasets in terms of compression size and other vital metrics.
... Prior research efforts also dealt with the challenge of efficiently querying historical graph data [25], [26], [27]. These earlier approaches do consider compact representations that exploit the redundancy exhibited by temporal graphs. ...
Conference Paper
Full-text available
Contemporary data-systems empowering the daily human activity are routinely represented with graphs. During the last decade, the volume growth of such systems has been unprecedented. This hinders the timely analysis of the formed networks due to existing physical memory limitations and significant I/O overheads. Graph compression techniques have managed to reduce memory requirements and allow for representing such networks using a few bits-per-edge. Respective approaches offer succinct mappings for social, biological, and information networks while allowing for the efficient access of sought graph elements. Despite their success, such methods mostly focus on static graphs, and predominantly offer access to either a snapshot or an aggregated view of a network. In reality however, networks change over time and, in many instances, we are interested in capturing and studying this evolution. In this paper we propose a framework for compressing emerging temporal graphs based on a dual-representation which articulates both network structure and corresponding temporal information. We empirically establish properties exhibited by community-networks regarding their time aspect(s) and harness these features in our proposed representation. Our experimental evaluation demonstrates that our approach for compressing temporal graphs readily outperforms competing techniques, attaining compression ratios that are on average around 60% of the space required by state-of-the-art techniques. Moreover, our memory-efficient representation yields more than 70% faster graph compression and orders of magnitude quicker retrieval of graphs' elements, especially when it comes to large-scale networks. Finally, our framework is the first effort we are aware of, that considers actual time instead of time steps. This helps us attain better control for the size of our representation and reap further memory savings.
... Solutions for graph stream analytics, e.g., [25,58,67,71,82] may also be seen as related to InTempo, since they are also concerned with the efficient storage and querying of graphs that are constantly updated. These solutions model history as a sequence of change-based snapshots stored either entirely or partly on disk. ...
Article
Full-text available
Modern software systems are intricate and operate in highly dynamic environments for which few assumptions can be made at design-time. This setting has sparked an interest in solutions that use a runtime model which reflects the system state and operational context to monitor and adapt the system in reaction to changes during its runtime. Few solutions focus on the evolution of the model over time, i.e., its history, although history is required for monitoring temporal behaviors and may enable more informed decision-making. One reason is that handling the history of a runtime model poses an important technical challenge, as it requires tracing a part of the model over multiple model snapshots in a timely manner. Additionally, the runtime setting calls for memory-efficient measures to store and check these snapshots. Following the common practice of representing a runtime model as a typed attributed graph, we introduce a language which supports the formulation of temporal graph queries, i.e., queries on the ordering and timing in which structural changes in the history of a runtime model occurred. We present a querying scheme for the execution of temporal graph queries over history-aware runtime models. Features such as temporal logic operators in queries, the incremental execution, the option to discard history that is no longer relevant to queries, and the in-memory storage of the model, distinguish our scheme from relevant solutions. By incorporating temporal operators, temporal graph queries can be used for runtime monitoring of temporal logic formulas. Building on this capability, we present an implementation of the scheme that is evaluated for runtime querying, monitoring, and adaptation scenarios from two application domains.
... Multiple temporal graph models have been proposed in this category with most of them using a versioning method. For example, Khurana et al. [26] proposed a way to efficiently query historical data. They focus on querying the state of a network at a particular point in time (snapshot). ...
Article
In recent decades, there has been a significant increase in the use of smart devices and sensors that led to high-volume temporal data generation. Temporal modeling and querying of this huge data have been essential for effective querying and retrieval. However, custom temporal models have the problem of generalizability, whereas the extended temporal models require users to adapt to new querying languages. In this thesis, we propose a method to improve the modeling and retrieval of temporal data using an existing graph database system (i.e., Neo4j) without extending with additional operators. Our work focuses on temporal data represented as intervals (event with a start and end time). We propose a novel way of storing temporal interval as cartesian points where the start time and the end time are stored as the x and y axis of the cartesian coordinate. We present how queries based on Allen’s interval relationships can be represented using our model on a cartesian coordinate system by visualizing these queries. Temporal queries based on Allen’s temporal intervals are then used to validate our model and compare with the traditional way of storing temporal intervals (i.e., as attributes of nodes). Our experimental results on a soccer graph database with around 4000 games show that the spatial representation of temporal interval can provide significant performance (up to 3.5 times speedup) gains compared to a traditional model.
Article
Dynamic graph traversals (DGTs) are widely used in many important application domains nowadays, especially in this big-data era that urgently demands high-performance graph processing and analysis. Unlike static graph traversals, DGTs in real-world application scenarios require not only fast traversal acceleration itself, but also more importantly, a runtime strategy that can effectively accommodate ever-evolving nature of the graph structure updates followed by a diverse range of graph traversal algorithms . Because of these special features, state-of-the-art designs on conventional compute-centric architectures (e.g., CPU and GPU) struggle to provide sufficient acceleration for DGT processing due to the dominating irregular memory access patterns in graph traversal algorithms and inefficient platform-specific update mechanisms. In this paper, we explore the algorithmic features and runtime requirements of real-world DGTs and identify their unique opportunities of acceleration on the recent Micron Automata Processor (AP), an in-situ memory-centric pattern-matching architecture. These features include the natural mapping between traversal algorithms’ path exploration pattern to classic non-deterministic finite automata processing, AP’s architectural and compilation support for DGTs’ evolving traversal operations, and its inherent hardware fitness. However, despite these benefits, enabling highly-efficient DGT execution on AP is non-trivial and faces several major challenges. To tackle them, we propose DynamAP , the first AP framework design that enables fast processing for general DGTs. DynamAP is oblivious to periodical traversal algorithm changes and can address the significant overhead caused by frequent graph updates and AP recompilation through our novel hybrid macro designs and its associated efficient updating strategies. We evaluate DynamAP against the current DGT designs on CPU, GPU and AP with a range of widely-adopted DGT algorithms and real-world graphs. For a single update request , our DynamAP achieves an average speedup of 21.3x (up to 39.2x ) over the state-of-the-art implementation on host-AP architecture; an average speedup of 9.2x (up to 14.7x ) and 1.7x (up to 2.8x ) over two highly optimized DGT design frameworks on a 64GB Intel(R) Xeon CPU and a 32GB Nvidia Tesla V100 GPU. DynamAP also maintains high performance and resource utilization for high graph update ratios, and can significantly benefit natural graphs that present high average vertex degree.
Article
There is a growing need to perform real-time analytics on dynamic graphs in order to deliver the values of big data to users. An important problem from such applications is continuously identifying and monitoring critical patterns when fine-grained updates at a high velocity occur on the graphs. A lot of efforts have been made to develop practical solutions for these problems. Despite the efforts, existing algorithms showed limited running time and scalability in dealing with large and/or many graphs. In this paper, we study the problem of continuous multi-query optimization for subgraph matching over dynamic graph data. (1) We propose annotated query graph, which is obtained by merging the multi-queries into one. (2) Based on the annotated query, we employ a concise auxiliary data structure to represent partial solutions in a compact form. (3) In addition, we propose an efficient maintenance strategy to detect the affected queries for each update and report corresponding matches in one pass. (4) Extensive experiments over real-life and synthetic datasets verify the effectiveness and efficiency of our approach and confirm a two orders of magnitude improvement of the proposed solution.
Article
Full-text available
Visualization has proven to be a useful tool for understanding network structures. Yet the dynamic nature of social media networks requires powerful visualization techniques that go beyond static network diagrams. To provide strong temporal network visualization tools, designers need to understand what tasks the users have to accomplish. This paper describes a taxonomy of temporal network visualization tasks. We identify the 1) entities, 2) properties, and 3) temporal features, which were extracted by surveying 53 existing temporal network visualization systems. By building and examining the task taxonomy, we report which tasks are well covered by existing systems and make suggestions for designing future visualization tools. The feedback from 12 network analysts helped refine the taxonomy.
Article
Full-text available
In this paper, we describe a versioned database storage manager we are developing for the SciDB scientific database. The system is designed to efficiently store and retrieve array-oriented data, exposing a ``no-overwrite'' storage model in which each update creates a new ``version'' of an array. This makes it possible to perform comparisons of versions produced at different times or by different algorithms, and to create complex chains and trees of versions. We present algorithms to efficiently encode these versions, minimizing storage space while still providing efficient access to the data. Additionally, we present an optimal algorithm that, given a long sequence of versions, determines which versions to encode in terms of each other (using delta compression) to minimize total storage space or query execution cost. We compare the performance of these algorithms on real world data sets from the National Oceanic and Atmospheric Administration (NOAA), Open Street Maps, and several other sources. We show that our algorithms provide better performance than existing version control systems not optimized for array data, both in terms of storage size and access time, and that our delta-compression algorithms are able to substantially reduce the total storage space when versions exist with a high degree of similarity.
Article
The need for supporting time varying information in databases has been recognized for quite some time Many authors have proposed numerous schemes to satisfy this need by incorporating one or two time attributes in the database. Unfortunately, there has been confusion concerning the terminology and definition of theae trme attributes This paper proposes a new taxonomy of three times for use in databases, one that is more cleanly defined, that may be conceptualized tn a prctoraal fashron, and that defines several krnda of databases differentiated by their ability to represent temporal tnjormatton The paper argues that future database management ayatema should aupport all three ttme8 to fully capture time varying behavior.
Conference Paper
GPS (for Graph Processing System) is a complete open-source system we developed for scalable, fault-tolerant, and easy-to-program execution of algorithms on extremely large graphs. This paper serves the dual role of describing the GPS system, and presenting techniques and experimental results for graph partitioning in distributed graph-processing systems like GPS. GPS is similar to Google's proprietary Pregel system, with three new features: (1) an extended API to make global computations more easily expressed and more efficient; (2) a dynamic repartitioning scheme that reassigns vertices to different workers during the computation, based on messaging patterns; and (3) an optimization that distributes adjacency lists of high-degree vertices across all compute nodes to improve performance. In addition to presenting the implementation of GPS and its novel features, we also present experimental results on the performance effects of both static and dynamic graph partitioning schemes, and we describe the compilation of a high-level domain-specific programming language to GPS, enabling easy expression of complex algorithms.
Article
We introduce Trinity, a memory-based distributed database and computation platform that supports online query processing and offline analytics on graphs. Trinity leverages graph access patterns in online and offline computation to optimize the use of main memory and communication in order to deliver the best performance. With Trinity, we can perform efficient graph analytics on web-scale, billion-node graphs using dozens of commodity machines, while existing platforms such as MapReduce and Pregel require hundreds of machines. In this paper, we analyze several typical and important graph applications, including search in a social network, calculating Pagerank on a web graph, and subgraph matching on web-scale graphs without using index, to demonstrate the strength of Trinity.
Article
When processing massive data sets, a core task is to construct synopses of the data. To be useful, a synopsis data structure should be easy to construct while also yielding good approximations of the relevant properties of the data set. A particularly useful class of synopses are sketches, i.e., those based on linear projections of the data. These are applicable in many models including various parallel, stream, and compressed sensing settings. A rich body of analytic and empirical work exists for sketching numerical data such as the frequencies of a set of entities. Our work investigates graph sketching where the graphs of interest encode the relationships between these entities. The main challenge is to capture this richer structure and build the necessary synopses with only linear measurements. In this paper we consider properties of graphs including the size of the cuts, the distances between nodes, and the prevalence of dense sub-graphs. Our main result is a sketch-based sparsifier construction: we show that O̅(nε-2) random linear projections of a graph on n nodes suffice to (1 + ε) approximate all cut values. Similarly, we show that O(ε-2) linear projections suffice for (additively) approximating the fraction of induced sub-graphs that match a given pattern such as a small clique. Finally, for distance estimation we present sketch-based spanner constructions. In this last result the sketches are adaptive, i.e., the linear projections are performed in a small number of batches where each projection may be chosen dependent on the outcome of earlier sketches. All of the above results immediately give rise to data stream algorithms that also apply to dynamic graph streams where edges are both inserted and deleted. The non-adaptive sketches, such as those for sparsification and subgraphs, give us single-pass algorithms for distributed data streams with insertion and deletions. The adaptive sketches can be used to analyze MapReduce algorithms that use a small number of rounds.
Article
Changes in observed social networks may signal an underlying change within an organization, and may even predict significant events or behaviors. The breakdown of a team’s effectiveness, the emergence of informal leaders, or the preparation of an attack by a clandestine network may all be associated with changes in the patterns of interactions between group members. The ability to systematically, statistically, effectively and efficiently detect these changes has the potential to enable the anticipation of change, provide early warning of change, and enable faster response to change. By applying statistical process control techniques to social networks we can detect changes in these networks. Herein we describe this methodology and then illustrate it using three data sets. The first deals with the email communications among graduate students. The second is the perceived connections among members of al Qaeda based on open source data. The results indicate that this approach is able to detect change even with the high levels of uncertainty inherent in these data.
How do real graphs evolve over time? What are normal growth patterns in social, technological, and information networks? Many studies have discovered patterns in static graphs , identifying properties in a single snapshot of a large network or in a very small number of snapshots; these include heavy tails for in- and out-degree distributions, communities, small-world phenomena, and others. However, given the lack of information about network evolution over long periods, it has been hard to convert these findings into statements about trends over time. Here we study a wide range of real graphs, and we observe some surprising phenomena. First, most of these graphs densify over time with the number of edges growing superlinearly in the number of nodes. Second, the average distance between nodes often shrinks over time in contrast to the conventional wisdom that such distance parameters should increase slowly as a function of the number of nodes (like O (log n ) or O (log(log n )). Existing graph generation models do not exhibit these types of behavior even at a qualitative level. We provide a new graph generator, based on a forest fire spreading process that has a simple, intuitive justification, requires very few parameters (like the flammability of nodes), and produces graphs exhibiting the full range of properties observed both in prior work and in the present study. We also notice that the forest fire model exhibits a sharp transition between sparse graphs and graphs that are densifying. Graphs with decreasing distance between the nodes are generated around this transition point. Last, we analyze the connection between the temporal evolution of the degree distribution and densification of a graph. We find that the two are fundamentally related. We also observe that real networks exhibit this type of relation between densification and the degree distribution.