Conference PaperPDF Available

Scaling Betweenness Centrality in Dynamic Graphs

Authors:

Abstract and Figures

The Betweenness Centrality of a vertex is an important metric used for determining how "central" a vertex is in a graph based on the number of shortest paths going through that vertex. Computing the betweenness centrality of a graph is computationally expensive, O(V ·(V +E)). This has led to the development of several important optimizations including: approximation, parallelization, and dealing with dynamic updates. Dynamic graph algorithms are extremely favorable as the amount of work that they require is orders of magnitude smaller than their static graph counterparts. Recently, several such dynamic graph algorithms for betweenness centrality have been introduced. Many of these new dynamic graph algorithms tend to have decent parallel scalability when the betweenness centrality metric is computed in an exact manner. However, for the cases where the approximate solution is used, the scalability drops because of bad load-balancing due to the reduction in the amount of work. In this paper, we show a dynamic graph betweenness centrality algorithm that has good parallel scalability for both exact and approximate computations. We show several new optimizations made to the data structures, the load balancing technique, and the parallel granularity that have improved overall performance to 1.6X − 4X faster than one of the fastest previous implementations. More so, our new algorithm scales to larger thread counts than before.
Content may be subject to copyright.
Scaling Betweenness Centrality in Dynamic Graphs
Alok Tripathy, Oded Green1
1Georgia Institute of Technology
Abstract The Betweenness Centrality of a vertex is an
important metric used for determining how “central” a vertex
is in a graph based on the number of shortest paths going
through that vertex. Computing the betweenness centrality of a
graph is computationally expensive, O(V·(V+E)). This has led
to the development of several important optimizations includ-
ing: approximation, parallelization, and dealing with dynamic
updates. Dynamic graph algorithms are extremely favorable as
the amount of work that they require is orders of magnitude
smaller than their static graph counterparts. Recently, several
such dynamic graph algorithms for betweenness centrality have
been introduced. Many of these new dynamic graph algorithms
tend to have decent parallel scalability when the betweenness
centrality metric is computed in an exact manner. However, for
the cases where the approximate solution is used, the scalability
drops because of bad load-balancing due to the reduction
in the amount of work. In this paper, we show a dynamic
graph betweenness centrality algorithm that has good parallel
scalability for both exact and approximate computations. We
show several new optimizations made to the data structures,
the load balancing technique, and the parallel granularity that
have improved overall performance to 1.6X4Xfaster than
one of the fastest previous implementations. More so, our new
algorithm scales to larger thread counts than before.
I. INTRODUCTION
Betweenness Centrality (BC) is a popular graph analytic
for assigning the relative importance of vertices. It can be
used to find important players in a social network, vital trans-
port hubs in road networks, and key animals in a biological
network. The BC of a vertex is the fraction of shortest paths
going through the vertex over all shortest paths in the graph.
Computing betweenness centrality can be computationally
expensive, and as such, it is desirable to avoid recomputing
this metric for each change in a dynamic graph. A dynamic
graph is one that supports edge and vertex insertions and
deletions, and a dynamic graph algorithm is one that properly
accounts for changes in the graph. A naive approach is to
rerun the static graph algorithm for each change, but this
can be computationally expensive and perform numerous
unnecessary operations. In the last decade, several different
BC algorithms have been introduced, including the first three
algorithms that came out at roughly the same time [12], [18],
[15]. While each of these algorithms, and many of the later
algorithms, took a distinct approach with dealing the graph
updates, they can be distinguished in several different ways:
1) are they scalable with respect to the graph size? 2) can
they be parallelized? and 3) can they be approximated? In
the related work section, we discuss this in more detail.
One of the key challenges with dynamic graph algorithms
is that they require orders of magnitude less work than their
FIG . 1: Parallel scalability of the dynamic BC algorithm
taken from [12], [13] on the as-skitter network.
static graph counterparts. This is a double-edged sword: on
the one hand far less needs to be done, but on the other
hand the parallelization becomes more challenging. This is
especially true for approximation algorithms as these even
further reduce the amount of work by an additional order
of magnitude. Applying all these optimizations to a single
algorithm can lead to significant load-balancing problems.
This is illustrated in Figure 1, which depicts a speedup
plot of for dynamic graph BC taken from [13] (using edge
insertions). As can be seen, the scalability of the algorithm
stops by the time 15 threads are introduced. This is especially
undesirable as the number of threads in a system is likely
only to increase in the upcoming years. Increasing scalability
of dynamic graph algorithms ensures that fewer kernels are
needed to fully utilize advanced systems.
In this work, we optimize the dynamic betweenness cen-
trality algorithm first introduced in [12] and later extended
in [13] - we refer to this algorithm as Green’s algorithm.
Green’s algorithm is fully dynamic (supports both edge
insertions and deletions), can be parallelized with relative
ease, and can be approximated. Green’s algorithm originally
focused on coarse-grained parallelism, i.e. each BFS tree is
assigned a single thread. This is one of the factors that led to
load balancing problems as a single BFS tree could become
an execution bottleneck. We resolve this problem in this
paper. On top of providing a load-balanced mechanism, we
also introduce a parallel hybrid implementation for Green’s
algorithm that selects between fine-grain and coarse-grain
parallelism.
Overall, our main contributions are:
We propose a novel data structure for improving the
performance of Green’s streaming betweenness centrality
algorithm. This data structure increases parallel scalability
(allowing for fine-grain parallelism) while also reducing
certain overheads. The overhead reduction is crucial for
the workload estimation. This optimization alone improves
performance by a factor of 2Xin addition to the scalability
improvements.
We propose a new method for estimating the amount of
work generated due to an edge update. Using this workload
estimation, we develop several different load-balancing tech-
niques that can be used for different types of parallel tech-
niques. This approach improves performance and scalability
by 1.7X2.6Xover [12].
Unlike Green’s algorithm, which used only a coarse-grain
parallelism, we show dynamic betweenness centrality at sev-
eral different granularities. Our new coarse-grain approach is
faster by 70%260%. Our new hybrid coarse-grain and fine-
grain approach to dynamic betweenness centrality improves
scalability by 10% over the coarse-grain approach.
II. RE LATE D WORK
A. Betweenness Centrality
The BC of a vertex is the fraction of shortest paths in
the graph that go through this vertex over the total number
of shortest paths in the graph. Formally, this is defined as
follows: σst is the number of shortest paths from a vertex s
to a vertex t, and σst(v)is the number of shortest paths from
sto tthrough a vertex v. For a vertex v, the BC CB(v)is
CB(v) = X
s6=t6=v
σst(v)
σst
(1)
Brandes [6] showed an efficient algorithm for computing
the BC values in O(V·(V+E)) time, and that greatly
reduced the amount of work for computing it. Specifically,
Brandes’s algorithm has the best known time complexity
for static graph algorithms and has been used for several
dynamic graph algorithms as the basis of the computation.
In Brandes’s algorithm, a BFS traversal is executed from
each vertex in the graph, followed by a second phase called
the dependency accumulation phase. Readers are referred to
[6] for additional detail.
Approximation techniques, such as [1] and [9], take dif-
ferent approaches on selecting a subset of roots that will be
used to estimate the BC values. However, the differences
in the selection processes is not important for the dynamic
graph algorithm as these are determined in a pre-processing
phase. What is important to note is that the approximation
techniques reduce the time complexity from O(V·(V+E))
to O(K·(V+E)), where Kis the number of roots used
in the approximation. For K << |V|, the approximation is
orders of magnitude faster than the exact solution.
B. All-Pair Shortest Path
Computing exact BC can be done by simply running an
All-Pair Shortest Path (APSP) algorithm. Algorithms such
as Floyd-Warshall and Johnson’s algorithm can be used but
have significant overheads: 1) Floyd-Warshall runs in O(V3)
time and requires O(V2)storage, and 2) Johnson’s algorithm
reduces the time complexity to O(V2log V+V E)through
the use of Fibonacci heaps, which in turn reduce scalability.
Brandes’s algorithm improves on both these algorithms.
A variety of dynamic graph APSP algorithms have been
developed as well. While some target exact solutions, e.g.
[7], [16], others approximate values, e.g. [16] as well as
[23]. Unfortunately, most of these algorithms are highly
theoretical and struggle to perform efficiently on large graphs
in part due to the storage requirements and lack of parallel
scalability. Similar constraints are also seen in dynamic graph
BC algorithms - especially for those that use dynamic graph
APSP to implement their dynamic graph BC algorithms.
C. Dynamic Betweenness Centrality
Because of the aforementioned issues, there has been a
plethora of work done in the area of dynamic BC algorithms.
The first three algorithms for dynamic BC are Lee et al. [18],
Kas et al. [15], and Green et al. [12]. Green, McColl, and
Bader will henceforth be referred to as “Green’s algorithm”.
Green’s algorithm has been extended to the GPU [21] and
the map-reduce framework [17].
The algorithms in this literature vary in several different
ways, such as parallelization, scalability, and approximation.
Lee et al.’s [18] QUBE algorithm targets exact BC and works
well only for sparse graphs where |E|'|V|. Jamour et
al. [14] show an algorithm with similar constraints as [18],
though parallelization has been added. Approximation is
not supported, and their comparison with Green’s algorithm
shows that they are over 100X slower even though their
algorithm can only work for networks of extreme sparsity.
Green’s algorithm scales to graphs with tens of millions of
vertices as well as much denser graphs.
Kas et al. [15] outperform QUBE while supporting
weighted networks though parallelization and deletions are
not discused. Bergamini et al. [5] studies several sequential
and approximate implementations. While their algorithms
can approximate the betweenness centrality, it seems that par-
allelization is not possible for all phases of the computation.
Nasre et al. [22] provide an additional exact betweenness
centrality algorithm but do not provide experimental results
or details on how to support deletions.
To the best of our knowledge, Green’s algorithm [12] is the
only one in the literature that supports exact and approximate
computation of Betweenness Centrality, supports edge inser-
tions, edge deletions, and parallelization. For these reasons,
we choose to optimize Green’s algorithm. In this work we
will show how to scale Green’s algorithm to larger thread
counts.
D. Dynamic Graph Data Structures
In order to implement algorithms for dynamic graphs
in an efficient manner, a dynamic graph data structure is
necessary. There are several such structures in the literature.
One popular CPU-based data structure is STINGER [8].
cuSTINGER [11] shows a dynamic CSR data structure on
the GPU. EvoGraph [24] and AIMS [26] are additional GPU-
based dynamic graphs. EvoGraph [24] uses two separate data
structures for managing updates made to the graph, requires
restarts every so often, and reduces locality as edges might
appear in both data structures. AIMS [26] uses the entire
GPU memory just for the data structure and thus cannot be
used for any analytics.
III. OPTIMIZATIONS AND WOR KL OA D ESTIMATION
In this section we present several optimizations we have
made to Green’s algorithm [12], [13] to improve its per-
formance and scalability. We start off by briefly discussing
Green’s algorithm in additional detail.
A. Green’s Algorithm
Green’s dynamic BC algorithm can compute both exact
and approximate BC values. It also works for both edge
insertions and deletions. For exact BC computation in dy-
namic graphs, it extends Brandes’ algorithm. In the exact
scenario, Green’s algorithm maintains |V|BFS trees, such
as the ones represented in Figure 2. Each tree is rooted at a
different vertex in the graph. When inserting an edge (u, v)
into the graph, the edge is inserted into each of the trees.
Taking advantage of the fact that these trees have already
been computed, Green’s algorithm can update shortest paths
by going through on a partial set of vertices and edges in the
graph. Thus, the dynamic graph algorithm can end up being
thousands of time faster than the static graph algorithm. The
algorithm works in two phases.
First, it performs a BFS down a subtree within each tree.
The edge (u, v)is either inserted or deleted in the graph,
and assume d[u]< d[v]without loss of generality. A BFS
is performed on v’s subtree since there are no new shorter
paths above the vto the root with this (u, v)inserted. In every
iteration, only vertices below vare enqueued and accessed.
This is illustrated by the yellow sub-tree (triangle) in Figure
2. A nice artifact of the algorithm is that in the event d[u] =
d[v], no work is required to update the shortest paths or the
BC scores.
In the second phase, the dependency accumulation (aka
reverse-BFS) phase is executed as highlighted by Brandes.
In Brandes’s algorithm, the dependency accumulation goes
through all the vertices accessed in the BFS phase1. Green’s
algorithm, however, will go through all the vertices found
in the first BFS phase as well as enqueue vertices on the
way up that have an update on the ratio of of shortest paths
going through them. This is also depicted in Tree 1 of Figure
2. Suppose zis a vertex found in the BFS traversal and y
is a neighbor of zthat is not found in the BFS. While y
does not have new shortest paths, zdoes have additional
shortest paths to the root, and, thus, the ratio of shortest
paths through yneeds to be updated. This will be true for
all vertices that have yon a shortest path to the root. As
such, the reverse-BFS is computationally more demanding
than the BFS phase.
Lastly, in the approximate scenario where K << V ,K
roots are maintained as opposed to V.
1Specifically, in Brandes’s algorithm, all the vertices that have a path
from the root are accessed
FIG . 2: Two different BFS Trees for an edge update. (u, v)
can be inserted or deleted.
a b c d e fvertices
levels
0 1 2 3 4 5 6
FIG . 3: Multi-Level Queue for Tree 2
B. Data Structure Optimization
In [12], a multi-level queue is used for storing the vertices
in each level of the tree. Specifically, in the BFS phase, there
is a queue for each level in the tree, and each vertex is added
to the queue corresponding to its level. This is especially
important for some cases where vertices change level in the
tree in comparison to their previous level. For such cases, the
number of vertices in a specific level is not known apriori,
and thus these queues cannot be statically preallocated. When
implemented with a straightforward linked list for each level,
as was done in [12], the multi-level queue proves to be a
performance bottleneck (in part due to the memory allocation
of new nodes). Additionally, traversing these linked lists
limits scalability and removes locality.
Our first contribution is to show a new multi-level queue
data structure. This data structure can be preallocated, re-
moving several memory allocations, and can improve locality
during the reverse-BFS when it accesses the vertices reached
by the BFS phase.
This data structure exploits the sequential nature of BFS.
When enqueueing vertices in the BFS phase into the multi-
level queue, vertices from depth iwill be inserted first, then
vertices from depth i+1, and so on. Thus, rather than having
an naive linked list for each level in the multi-level queue, we
can implement the multi-level queue in a CSR-like fashion
with two arrays. One array stores all the nodes inserted
contiguously, and another stores offsets into this array for
each level.
As an example, Figure 3 depicts the multi-level queue for
Tree 2 in Figure 2. The data structure has two arrays: the
vertices array and the levels array. Each entry in the levels
array has two pointers, one to the first vertex with some
depth and one the last vertex with that depth. If no vertices of
depth iexist, these pointers are null. Inserting a vertex vinto
this new multi-level queue implementation is straightforward.
This new implementation does not require memory allocation
while enqueueing and yields high amounts of scalability,
due to locality. This makes the edge insertion and deletion
process significantly more efficient empirically.
In the reverse-BFS, we only require two queues: one for
the current depth, and one for the next depth. Recall that
in the reverse-BFS, new vertices can be found that were not
accessed in the BFS phase. This is the key difference between
static graph BC and dynamic graph BC.
In some cases, the difference between the number of
vertices and edges traversed from one root to another can be
orders of magnitude (especially as some roots might not have
any work at all - see [12] for additional details). Though, the
amount of work per root is not known in advance and leads
us to our next contribution. To add to this complexity is the
fact the number of roots can be fairly small, K << |V|.
When K∼ |V|this problem disappears as the number of
roots per thread is fairly large.
C. Workload Estimation for Each BFS Tree
Our second contribution is load balancing Green’s algo-
rithm, which had limited scalability with each root computed
by a single thread [10]. This is referred to as coarse-grain
parallelism. Given the reduced amount of work computed in
the dynamic graph algorithm, the fine-grain parallelism2can
prove to be prohibitive due to various types of overheads
(synchronization, load-balancing, and thread management).
Since an edge can be inserted anywhere within a BFS tree
and each BFS tree is different, the amount of work per tree
will be different. This is illustrated in Figure 2, where the
same edge (u, v)is placed in two different BFS trees. Both
the BFS and the reverse-BFS are impacted by the location
of the edge insertion, and the execution time is primarily
dependent on the number of edges traversed in both these
phases. As such, we attempt to estimate the number of edges
accessed in the dynamic graph algorithm. For each root and
its tree, we estimate the number of edges below and above
each vertex using the arrays EB and EA, respectively. These
values are computed as part of the static graph algorithm and
are updated throughout the update process. How these values
are computed is discussed below. Given an edge update
(u, v), s.t. d[u]< d[v], we estimate the work for a given
root as follows:
W ork[v] = 2 EB[v] + EA[v](1)
The intuition for this estimate is as follows. The BFS
accesses all edges below v, so it accesses at least EB[v]
edges. The reverse-BFS will also access all edges below
vsince it traverses vs subtree upwards towards the root.
On top of this, the reverse-BFS wil continue iterating into
frontiers above v. In Tree 1 within Figure 2, for instance,
the reverse-BFS will reach vand then continue to enqueue
2Where multiple threads compute betweenness centrality for a given root.
a,w,b, and so on towards the root. Thus, the reverse-
BFS will access all EA[v]edges above vas well. We can
then estimate the number of edges that will be accessed as
2EB[v] + EA[v]edges. Recall that this is not the exact
amount of work done as the reverse-BFS can access edges
outside of v’s subtree as well. However, we show that this
approximation for the work performs well experimentally.
Approximating EA and EB :To use this work estima-
tion scheme, however, we must provide a way to estimate
EA[v]and E B[v]. We use a simple recursive expression
to approximate values for each. These recursive equations
can be updated in the BFS and reverse-BFS computations.
Informally, EA[v]is the sum of the EA[v]value for all
neighbors above vertex v, plus one per neighbor above vertex
vto account for edges incident to v. Mathematically, this is
EA[v] = X
u∈{N(v):
d[u]<d[v]}
(EA[u] + 1) (2)
Likewise, estimating EB[v]is
EB[v] = X
u∈{N(v):
d[u]>d[v]}
(EB[u] + 1) (3)
In the left Tree in Fig. 2, for example, EA[v]=(EA[a] +
1) + (EA[w] + 1) + (E A[b] + 1) and EB[v]=(EB[c] +
1) + (EB[d] + 1). By applying Eq. (1) to each of trees (and
their respective roots) we can estimate the work for each tree.
Using these estimates, we can create better load-balancing.
D. Parallelization Schemes
Given the workload estimation, we are now able to im-
prove the scalability of the algorithm by ordering the execu-
tion of the roots based on the heaviest root first. Recall that
coarse-grain parallelism, one root per thread, is used in the
implementation of Green’s algorithm. To avoid bottlenecks,
such as those in Green’s algorithm where a single “heavy”
root becomes an execution bottleneck on one thread, we
suggest using a fine-grain implementation to alleviate the
workload imbalance.
There are numerous tradeoffs between fine-grain and
coarse-grain parallelism: fine-grain parallelism requires using
atomic instructions, refined load-balancing, and synchroniza-
tion. In contrast, the coarse-grain approach uses significantly
more memory.
Hybrid Implementation: As a first step, our goal is to
identify roots that can potentially become bottlenecks. This
can be done using the workload estimation discussed above.
For these roots, we will use multiple threads as discussed
in [2], [20]. For the “light” roots we use the coarse-grain
algorithm3.
Then, we establish a simple scheme to determine when
to run the coarse-grain implementation and when the run
the fine-grain implementation of the algorithm. Overall, the
3This removes numerous parallel overheads such as thread launching and
synchronizations. This tends to be a large overhead for dynamic graph BC
as there little work involved.
Algorithm 1 Hybrid implementation with (u, v)inserted,
d[u]< d[v]
1: work estimates []
2: for ttrees do
3: work estimates[t]2EB [t[v]] + EA[t[v]]
4: tmax max{work estimates}
5: Tsum{work estimates}
6: if tmax/T > threshold then
7: fine grained(tmax)
8: parallel for all remaining trees tdo
9: coarse grained(t)
10: end parallel for
Table 1: List of Networks Used
Networks for Experiments
Graph Type |V| |E|
in-2004 [3] Internet 1.3M 13.5M
as-skitter [19] Internet 1.6M 11.0M
com-youtube [19] Social 1.1M 3.0M
belgium [3] Road 1.4M 1.5M
hybrid-grained implementation works as follows: 1) compute
the work estimate for each tree, 2) use the fine-grain imple-
mentation on the trees with the most amount of work, using
a predetermined threshold value, and 3) run the coarse-
grained algorithm for the remaining trees.
a) Threshold Selection: As threshold can vary quite
a bit between graphs and roots, we used threshold as
percentage of the total expected amount of work. Algorithm
1 depicts a high-level overview of the hybrid-grained imple-
mentation.
IV. EXP ER IM EN TAL SETUP
The following experiments were run on an Intel Knights
Landing node on TACC’s Stampede2 cluster. A single node
was used for each experiment. Each node has a Xeon Phi
7250 processor with a 1.4GHz frequency, 68 physical cores,
and 4-way SMT. The system also had a 34MB LLC, 16GB
of MCDRAM, and 96GB of DDR4 RAM.
Datasets: We used four undirected and unweighted
real-world graphs to evaluate our algorithms. These graphs
were taken from the DIMACS 10 Graph Challenge set [3]
and from the Stanford SNAP database [19]. These graphs
are outlined in Table 1.
Experiment Details: To evaluate our algorithms, we
run an identical set of experiments for each graph using
a different set of optimizations. In a preprocessing phase,
50 edges from the original graph are randomly selected and
removed from the original graph. We then proceed to use
a parallel and approximate version of static betweenness
centrality to initialize all the data structures. Each of the
removed edges is then reinserted into the graph, and the BC
scores are modified. We also time the executions. We repeat
the same experiments for varying the thread count from 1,
2, 4, 8, 16, 32, to 64 and average the time taken for each
thread count over all 50 edges. We stop at 64 threads as our
system does not have 128 physical cores, and SMT would
hurt scalability. For the approximation, we have chosen at
uniform K= 256 roots using the technique suggested in
[1] 4. We report results for insertions though we see similar
4Though the approximation method of [9] would work just as well and
simply require changing the static BC algorithm.
results for edge deletions. The self-speedups in Figure 4 are
measured in comparison to a single thread execution. The
speedups in Figure 5 are measured in comparison to [12].
Algorithms tested: We run this experiment on four
different algorithms. The first is the implementation used
in [12]. The second algorithm (Optimization 1) is the im-
plementation in [12] using our modified multi-level queue
data structure. The third algorithm (Optimization 2) is our
second algorithm with the load balancing mechanism used
to order roots, and our last algorithm is the hybrid im-
plementation. This is run with 1%, 3%, 5%, 10%, 30%,
and 50% thresholds. Since the 50% threshold generally
did the best, we plot the hybrid implementation with a
50% threshold as Optimization 3. The implementation with
Optimization 2 uses Optimization 1, and the implementation
for Optimization 3 uses Optimization 1 and Optimization 2.
V. PERFORMANCE ANA LYSI S
Figure 4 depicts the speedups (ordinate) of the various
algorithms as a function of the number of threads used
(abscissa). Note, each algorithm is self-benchmarked in com-
parison to its own single thread execution. In contrast, Figure
5 depicts the speedups of each algorithm using [12] as the
baseline and the execution time is compared with an equal
number of threads.
Initially, consider the speedup as a function of the threads.
Recall, the original implementation of Green’s algorithm
suffers from scalability issues with large thread counts. As a
first step, we substitute the original implementation’s multi-
level queue data structure (Optimization 1) with the one
presented in this paper. This results in at least 1.6Xspeedup
for over the existing parallel speedups over all graphs. For
example, the scalability for in 2004 goes up from 15X on
64 KNL cores to almost 28X.
Fig. 5 shows that our new algorithms (with their respective
optimizations) reduce the execution time by a large factor
over the previous algorithm [12], even when the same
number of threads is used. This difference can lead to an
execution time 4Xsmaller than before.
For the belgium road network, the scalability increases
from 20Xto 50Xand the execution time drops down
by 3.5X. Thus, we can see that the new algorithms scale
better and do less work (leading to reduced execution times).
Lastly, notice that the latter two optimizations have different
execution times, though the difference between them is not
large. For in 2004, the load balancing technique did worse
than the data structure implementation. This could be due
to bad reordering of the roots or even the cost of doing the
reordering. Since in2004 is the densest graph we test, there
are probably cases where our workload estimation does not
account properly for some vertices and edges. In contrast,
asskitter and especially comyoutube are sparse graphs,
and our load balancing implementation is more effective on
sparse graphs. Thus, this figure points out that there is no
clear winner for the optimizations and implies that these
optimizations need to be selected on a per graph basis.
(a) in-2004 (b) as-skitter (c) com-youtube (d) belgium
FIG . 4: Self-speedup of the various algorithms for different thread counts.
(a) in-2004 (b) as-skitter (c) com-youtube (d) belgium
FIG . 5: Speedup of the various algorithms in comparison with [12]. Algorithms are compared for specific thread counts.
(a) in-2004 (b) as-skitter (c) com-youtube (d) belgium
FIG . 6: Speedup of the hybrid implementation with various thresholds in comparison with [12].
Lastly, we check the performance of our dynamic graph
BC algorithm that uses a mix of coarse-grain and fine-
grain parallelism for processing the roots. In general, this
optimization does not seem to add a lot of performance
improvement in comparison to the other two optimizations.
We note that several recent frameworks such as Ligra [25]
and GapBS [4] show slightly better scalability for BC, yet
we also note that these target static graphs only where the
work per root is orders of magnitude higher. In some cases
for dynamic BC, because the work per thread is so little,
we find the performance bottlenecked by the overhead of
launching threads.
Fig. 6 depicts the difference in performance for the hybrid
algorithm with varying threshold values in the range of 1%
to 50%. These are benchmarked against Green’s algorithm.
in 2004,as skitter, and com youtube show that
the performance does not change significantly with differ-
ent thresholds. For belgium, the difference between the
thresholds can be as high as 50%. As the sparsest graph,
the work per thread is likely the smallest. Thus, lower
thresholds would be bottlenecked by launching threads or
synchronization. For all these graphs, several edges have a
single root that requires a significant amount of work and
benefits from using multiple threads for its execution. While
this may warrant additional analysis, it seems that using a
lenient threshold of 10% is good enough for our test cases.
VI. CONCLUSIONS
In this paper, we present several optimizations for
Green’s algorithm for Betweennness Centrality on dynamic
graphs. Green’s algorithm was unable to speed up substan-
tially for large thread counts. We introduce a number of op-
timizations to Green’s algorithm that allow it to scale better
for larger thread counts and run efficiently on larger graphs,
with performance improvements ranging from 1.6X4X.
For future work, we hope to investigate further into a way
to determine granularity for the hybrid-grained implementa-
tion dynamically based on the graph structure. The challenge
lies in designing one that could accurately determine whether
to use coarse-grained or fine-grained granularity while also
remaining computationally cheap. Alternatively, there may
be an efficient method to determine the optimal threshold to
use for a tree, based on the density of the graph.
6
REFERENCES
[1] D. Bader, S. Kintali, K. Madduri, and M. Mihail, “Approximating
betweenness centrality,” in Algorithms and Models for the Web-Graph,
ser. Lecture Notes in Computer Science, 2007, pp. 124–137.
[2] D. Bader and K. Madduri, “Parallel algorithms for evaluating central-
ity indices in real-world networks,” in International Conference on
Parallel Processing (ICPP), 2006.
[3] D. A. Bader, H. Meyerhenke, P. Sanders, and D. Wagner, Eds., Graph
Partitioning and Graph Clustering. 10th DIMACS Implementation
Challenge Workshop, ser. Contemporary Mathematics, no. 588, 2013.
[4] S. Beamer, K. Asanovi´
c, and D. Patterson, “The GAP Benchmark
Suite,” arXiv preprint arXiv:1508.03619, 2015.
[5] E. Bergamini, H. Meyerhenke, and C. L. Staudt, “Approximating
betweenness centrality in large evolving networks,” in Proceedings
of the Meeting on Algorithm Engineering & Expermiments. SIAM,
2015, pp. 133–146.
[6] U. Brandes, “A faster algorithm for betweenness centrality,” Journal
of Mathematical Sociology, vol. 25, no. 2, pp. 163–177, 2001.
[7] G. F. I. C. Demetrescu, “Fully dynamic all pairs shortest paths with
real edge weights,” Journal of Computer and System Sciences, vol. 72,
pp. 813–837, 2006.
[8] D. Ediger, R. McColl, J. Riedy, and D. Bader, “STINGER: High
Performance Data Structure for Streaming Graphs,” in IEEE High Per-
formance Embedded Computing Workshop (HPEC 2012), Waltham,
MA, 2012, pp. 1–5.
[9] R. Geisberger, P. Sanders, and D. Schultes, “Better Approximation of
Betweenness Centrality,” in ALENEX, 2008, pp. 90–100.
[10] O. Green and D. Bader, “Faster Betweenness Centrality Based on Data
Structure Experimentation,” in International Conference on Computa-
tional Science (ICCS). Elsevier, 2013.
[11] O. Green and D. Bader, “cuSTINGER: Supporting Dynamic Graph
Algorithms for GPUS,” in IEEE Proc. High Performance Extreme
Computing (HPEC), Waltham, MA, 2016.
[12] O. Green, R. McColl, and D. Bader, “A Fast Algorithm For Streaming
Betweenness Centrality,” in 4th ASE/IEEE International Conference
on Social Computing (SocialCom), 2012.
[13] O. Green, “High performance computing for irregular algorithms
and applications with an emphasis on big data analytics,” Ph.D.
dissertation, Georgia Institute of Technology, 2014.
[14] F. Jamour, S. Skiadopoulos, and P. Kalnis, “Parallel Algorithm for
Incremental Betweenness Centrality on Large Graphs,” in IEEE Trans-
actions on Parallel and Distributed Systems. IEEE, 2017, pp. 659–
672.
[15] M. Kas, M. Wachs, K. M. Carley, and L. R. Carley, “Incremental
algorithm for updating betweenness centrality in dynamically grow-
ing networks,” in Proceedings of the 2013 IEEE/ACM International
Conference on Advances in Social Networks Analysis and Mining.
IEEE/ACM, 2013, pp. 33–40.
[16] V. King, “Fully Dynamic Algorithms for Maintaining All-Pairs Short-
est Paths and Transitive Closure in Digraphs,” in Proceeding FOCS
’99 Proceedings of the 40th Annual Symposium on Foundations of
Computer Science. ACM, 1999, p. 81.
[17] N. Kourtellis, G. D. F. Morales, and F. Bonchi, “Scalable online
betweenness centrality in evolving graphs,IEEE Transactions on
Knowledge and Data Engineering, vol. 27, no. 9, pp. 2494–2506,
2015.
[18] M.-J. Lee, J. Lee, J. Y. Park, R. H. Choi, and C.-W. Chung, “QUBE: a
Quick algorithm for Updating BEtweenness centrality,” in ACM Int’l
Conf. on World Wide Web, 2012, pp. 351–360.
[19] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network
dataset collection,” http://snap.stanford.edu/data, Jun. 2014.
[20] K. Madduri, D. Ediger., K. Jiang, D. Bader, and D. Chavarria-Miranda,
“A Faster Parallel Algorithm and Efficient Multithreaded Implementa-
tions for Evaluating Betweenness Centrality on Massive Datasets,” in
IEEE International Symposium on Parallel and Distributed Processing
(IPDPS), 2009.
[21] A. McLaughlin and D. Bader, “Revisiting Edge and Node Parallelism
for Dynamic GPU Graph Analytics,” in IEEE International Parallel
and Distributed Processing Symposium Workshops (IPDPSW), 2014,
pp. 1396–1406.
[22] M. Nasre, M. Pontecorvi, and V. Ramachandran, “Betweenness cen-
trality - incremental and faster,” in Mathematical Foundations of
Computer Science 2014 - 39th International Symposium, MFCS 2014,
Budapest, Hungary, August 25-29, 2014. Proceedings, Part II, 2014,
pp. 577–588.
[23] G. Ramalingam and T. Reps, “An Incremental Algorithm for a
Generalization of the Shortest-Path Problem,” Journal of Algorithms,
vol. 21, pp. 267–305, 1996.
[24] D. Sengputa and S. L. Song, “EvoGraph: On-the-Fly Efficient Min-
ing of Evolving Graphs on GPU,” in International Supercomputing
Conference, 2017, pp. 97–119.
[25] J. Shun and G. E. Blelloch, “Ligra: a lightweight graph processing
framework for shared memory,” in 18th ACM SIGPLAN symposium
on Principles and practice of Parallel Programming, 2013, pp. 135–
146.
[26] M. Winter, R. Zayer, and M. Steinberger, “Autonomous, independent
management of dynamic graphs on gpus,” in International Supercom-
puting Conference. Springer, 2017, pp. 97–119.
7
... Second, different works focus on solving a specific graph problem in a streaming setting. Targeted problems include graph clustering [114], mining periodic cliques [190], search for persistent communities [152], [192], tracking conductance [94], event pattern [180] and subgraph [176] discovery, solving ego-centric queries [175], pattern detection [59], [60], [95], [96], [143], [153], [203], [215], densest subgraph identification [124], frequent subgraph mining [20], dense subgraph detection [158], construction and querying of knowledge graphs [58], stream summarization [102], graph sparsification [10], [28], k-core maintenance [12], shortest paths [212], Betweenness Centrality [115], [211], [220], Triangle Counting [160], Katz Centrality [224], mincuts [99], [145] Connected Components [164], or PageRank [61], [107]. ...
Preprint
Full-text available
Graph processing has become an important part of various areas of computing, including machine learning, medical applications, social network analysis, computational sciences, and others. A growing amount of the associated graph processing workloads are dynamic, with millions of edges added or removed per second. Graph streaming frameworks are specifically crafted to enable the processing of such highly dynamic workloads. Recent years have seen the development of many such frameworks. However, they differ in their general architectures (with key details such as the support for the parallel execution of graph updates, or the incorporated graph data organization), the types of updates and workloads allowed, and many others. To facilitate the understanding of this growing field, we provide the first analysis and taxonomy of dynamic and streaming graph processing. We focus on identifying the fundamental system designs and on understanding their support for concurrency and parallelism, and for different graph updates as well as analytics workloads. We also crystallize the meaning of different concepts associated with streaming graph processing, such as dynamic, temporal, online, and time-evolving graphs, edge-centric processing, models for the maintenance of updates, and graph databases. Moreover, we provide a bridge with the very rich landscape of graph streaming theory by giving a broad overview of recent theoretical related advances, and by analyzing which graph streaming models and settings could be helpful in developing more powerful streaming frameworks and designs. We also outline graph streaming workloads and research challenges.
Article
Full-text available
Betweenness centrality quantifies the importance of nodes in a graph in many applications, including network analysis, community detection and identification of influential users. Typically, graphs in such applications evolve over time. Thus, the computation of betweenness centrality should be performed incrementally. This is challenging because updating even a single edge may trigger the computation of all-pairs shortest paths in the entire graph. Existing approaches cannot scale to large graphs: they either require excessive memory (i.e., quadratic to the size of the input graph) or perform unnecessary computations rendering them prohibitively slow. We propose iCentral; a novel incremental algorithm for computing betweenness centrality in evolving graphs. We decompose the graph into biconnected components and prove that processing can be localized within the affected components. iCentral is the first algorithm to support incremental betweeness centrality computation within a graph component. This is done efficiently, in linear space; consequently, iCentral scales to large graphs. We demonstrate with real datasets that the serial implementation of iCentral is up to 3.7 times faster than existing serial methods. Our parallel implementation that scales to large graphs, is an order of magnitude faster than the state-of-the-art parallel algorithm, while using an order of magnitude less computational resources.
Conference Paper
Full-text available
cuSTINGER, a new graph data structure targeting NVIDIA GPUs is designed for streaming graphs that evolve over time. cuSTINGER enables algorithm designers greater productivity and efficiency for implementing GPU-based an-alytics, relieving programmers of managing memory and data placement. In comparison with static graph data structures, which may require transferring the entire graph back and forth between the device and the host memories for each update or require reconstruction on the device, cuSTINGER only requires transferring the updates themselves; reducing the total amount of data transferred. cuSTINGER gives users the flexibility, based on application needs, to update the graph one edge at a time or through batch updates. cuSTINGER supports extremely high update rates, over 1 million updates per second for mid-size batched with 10k updates and 10 million updates per second for large batches with millions of updates.
Article
Full-text available
Betweenness Centrality is a widely used graph analytic that has applications such as finding influential people in social networks, analyzing power grids, and studying protein interactions. However, its complexity makes its exact computation infeasible for large graphs of interest. Furthermore, networks tend to change over time, invalidating previously calculated results and encouraging new analyses regarding how centrality metrics vary with time. While GPUs have dominated regular, structured application domains, their high memory throughput and massive parallelism has made them a suitable target architecture for irregular, unstructured applications as well. In this paper we compare and contrast two GPU implementations of an algorithm for dynamic betweenness centrality. We show that typical network updates affect the centrality scores of a surprisingly small subset of the total number of vertices in the graph. By efficiently mapping threads to units of work we achieve up to a 110x speedup over a CPU implementation of the algorithm and can update the analytic 45x faster on average than a static recomputation on the GPU.
Article
Full-text available
Betweenness centrality ranks the importance of nodes by their participation in all shortest paths of the network. Therefore computing exact betweenness values is impractical in large networks. For static networks, approximation based on randomly sampled paths has been shown to be significantly faster in practice. However, for dynamic networks, no approximation algorithm for betweenness centrality is known that improves on static recomputation. We address this deficit by proposing two incremental approximation algorithms (for weighted and unweighted connected graphs) which provide a provable guarantee on the absolute approximation error. Processing batches of edge insertions, our algorithms yield significant speedups up to a factor of $10^4$ compared to restarting the approximation. This is enabled by investing memory to store and efficiently update shortest paths. As a building block, we also propose an asymptotically faster algorithm for updating the SSSP problem in unweighted graphs. Our experimental study shows that our algorithms are the first to make in-memory computation of a betweenness ranking practical for million-edge semi-dynamic networks. Moreover, our results show that the accuracy is even better than the theoretical guarantees in terms of absolutes errors and the rank of nodes is well preserved, in particular for those with high betweenness.
Conference Paper
Full-text available
The increasing availability of dynamically growing digital data that can be used for extracting social networks has led to an upsurge of interest in the analysis of dynamic social networks. One key aspect of social network analysis is to understand the central nodes in a network. However, dynamic calculation of centrality values for rapidly growing networks might be unfeasibly expensive, especially if it involves recalculation from scratch for each time period. This paper proposes an incremental algorithm that effectively updates betweenness centralities of nodes in dynamic social networks while avoiding re-computations by exploiting information from earlier computations. Our performance results suggest that our incremental betweenness algorithm can achieve substantial performance speedup, on the order of thousands of times, over the state of the art, including the best-performing non-incremental betweenness algorithm and a recently proposed betweenness update algorithm.
Conference Paper
With the prevalence of the World Wide Web and social networks, there has been a growing interest in high performance analytics for constantly-evolving dynamic graphs. Modern GPUs provide massive amount of parallelism for efficient graph processing, but the challenges remain due to their lack of support for the near real-time streaming nature of dynamic graphs. Specifically, due to the current high volume and velocity of graph data combined with the complexity of user queries, traditional processing methods by first storing the updates and then repeatedly running static graph analytics on a sequence of versions or snapshots are deemed undesirable and computational infeasible on GPU. We present EvoGraph, a highly efficient and scalable GPU-based dynamic graph analytics framework that incrementally processes graphs on-the-fly using fixed-sized batches of updates. The runtime realizes this vision with a user friendly programming model, along with a vertex property-based optimization to choose between static and incremental execution; and efficient utilization of all hardware resources using GPU streams, including its computational and data movement engines. Extensive experimental evaluations for a wide variety of graph inputs and algorithms demonstrate that EvoGraph achieves up to 429 million updates/sec and over 232x speedup compared to the competing frameworks such as STINGER.
Article
We present a graph processing benchmark suite targeting shared memory platforms. The goal of this benchmark is to help standardize graph processing evaluations, making it easier to compare different research efforts and quantify improvements. The benchmark not only specifies kernels, input graphs, and evaluation methodologies, but it also provides optimized baseline implementations. These baseline implementations are representative of state-of-the-art performance, and thus new contributions should beat their performance to demonstrate an improvement. This benchmark suite can be used in a variety of settings. Graph framework developers can demonstrate the generality of their programming model by implementing all of the benchmark's kernels and delivering competitive performance on all of the benchmark's graphs. Algorithm designers can use the input graphs and the baseline implementations to demonstrate their contribution. Platform designers and performance analysts can use the suite as a workload representative of graph processing.
Conference Paper
There has been significant recent interest in parallel frameworks for processing graphs due to their applicability in studying social networks, the Web graph, networks in biology, and unstructured meshes in scientific simulation. Due to the desire to process large graphs, these systems have emphasized the ability to run on distributed memory machines. Today, however, a single multicore server can support more than a terabyte of memory, which can fit graphs with tens or even hundreds of billions of edges. Furthermore, for graph algorithms, shared-memory multicores are generally significantly more efficient on a per core, per dollar, and per joule basis than distributed memory systems, and shared-memory algorithms tend to be simpler than their distributed counterparts. In this paper, we present a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write. The framework has two very simple routines, one for mapping over edges and one for mapping over vertices. Our routines can be applied to any subset of the vertices, which makes the framework useful for many graph traversal algorithms that operate on subsets of the vertices. Based on recent ideas used in a very fast algorithm for breadth-first search (BFS), our routines automatically adapt to the density of vertex sets. We implement several algorithms in this framework, including BFS, graph radii estimation, graph connectivity, betweenness centrality, PageRank and single-source shortest paths. Our algorithms expressed using this framework are very simple and concise, and perform almost as well as highly optimized code. Furthermore, they get good speedups on a 40-core machine and are significantly more efficient than previously reported results using graph frameworks on machines with many more cores.