Conference PaperPDF Available

# Logarithmic Radix Binning and Vectorized Triangle Counting

Authors:

## Abstract

Triangle counting is a building block for numerous graph applications and given the fact that graphs continue to grow in size, its scalability is important. As such, numerous algorithms have been designed for triangle counting-some of which are compute-bound rather than memory bound. Even for compute-bound algorithms, one of the key challenges is the limited control flow available on the processor. This is in-part due to the high dependency between the control flow, input data, and limited utilization of vector instructions. Not surprising, compilers are not always able to detect these data dependencies and vectorize the algorithms. Using the branch-avoiding model we show to remove control flow restrictions by replacing branches with an equivalent set of arithmetic operations. More so, we show how these can be vectorized using Intel's AVX-512 instruction set and that our new vectorized algorithms are 2 × −5× faster than scalar counterparts. We also present a new load balancing method, Logarithmic Radix Binning (LRB) that ensures that threads and the vector data lanes execute a near equal amount of work at any given time. Altogether, our algorithm outperforms several 2017 HPEC Graph Challenge Champions such as the KOKKOS framework and a GPU based algorithm by anywhere from 1.5× and up to 14×.
Logarithmic Radix Binning and Vectorized Triangle Counting
Oded Green, James Fox, Alex Watkins, Alok Tripathy, Kasimir Gabert,
Euna Kim, Xiaojing An, Kumar Aatish, and David A. Bader
Computational Science and Engineering, Georgia Institute of Technology - USA
Abstract— Triangle counting is a building block for numerous
graph applications and given the fact that graphs continue to
grow in size, its scalability is important. As such, numerous
algorithms have been designed for triangle counting - some of
which are compute-bound rather than memory bound. Even
for compute-bound algorithms, one of the key challenges is
the limited control ﬂow available on the processor. This is in-
part due to the high dependency between the control ﬂow,
input data, and limited utilization of vector instructions. Not
surprising, compilers are not always able to detect these data
dependencies and vectorize the algorithms. Using the branch-
avoiding model we show to remove control ﬂow restrictions
by replacing branches with an equivalent set of arithmetic
operations. More so, we show how these can be vectorized using
Intel’s AVX-512 instruction set and that our new vectorized
algorithms are 2× −5×faster than scalar counterparts. We
also present a new load balancing method, Logarithmic Radix
Binning (LRB) that ensures that threads and the vector data
lanes execute a near equal amount of work at any given
time. Altogether, our algorithm outperforms several 2017 HPEC
Graph Challenge Champions such as the KOKKOS framework
and a GPU based algorithm by anywhere from 1.5×and up
to 14×.
I. INT ROD UC TI ON
Triangle counting and enumeration is a widely used kernel
for numerous applications. These include clustering coefﬁ-
cients, community detection, email spam detection, Jaccard
indices, and ﬁnding maximal k-trusses. The key building
block of triangle counting is adjacency list intersection. Nu-
merous algorithms have been developed for triangle counting
and these encapsulate a wide range of programming models:
vertex centric, edge centric, gather-apply-scatter (GAS), and
linear algebra. Some approaches require the adjacency arrays
to be sorted whereas other approaches do not. Most previous
algorithms focus on scalar execution as vectorizing these
algorithms is typically challenging. In this paper, we show
several news algorithm for vectorizing the computational
kernels of triangle counting.
We also propose a new load-balancing technique, which
we call Logarithmic Radix Binning (LRB), that ensures that
all the threads and vector units get a near equal amount
of work. LRB improves on previous techniques which only
focus on thread level load-balancing and shows how to
vectorize at the vector lane granularity. Thus, vector unit
utilization.
In this paper, we present several new algorithms for
triangle counting. These algorithm uses techniques developed
for the branch-avoiding model discussed in Green et al. [9],
[7]. Speciﬁcally, Green et al. show that the cost of branch
mis-prediction is high for data dependent applications (such
as graph algorithms) and that these applications can typically
be implemented in an alternative manner that avoids branches
entirely. This eliminates mis-prediction and makes execution
times more consistent, at the cost of additional operations.
Algorithmic Contributions
We develop a two-tiered binning mechanism for triangle
counting. In the ﬁrst tier, for each edge we decide on an
intersection method that will be applied to that edge. Our
current implementation uses two different kernels (though
it can be extended to more). Therefore, this tier consist
of two bins. For the second tier, we present Logarithmic
Radix Binning (LRB). For each intersection kernel, a 2D
set of bins is maintained. Each of these bin stores edges
with similar computational properties. Thus, our vectorized
triangle counting algorithm can grab Kedges, where Kis
the vector width, with similar computational requirements
allowing for good load-balancing and utilization at the vector
lane granularity. In practice, this offers good scalability.
We show how to increase the number of control ﬂows
in software. For a multi-threaded processor with Pphysical
hardware threads and vector instructions Kelements wide,
we show how to execute up to P·Kconcurrent software
threads. Where each of these threads is executing a different
intersections. Thus, our new algorithm increases the control
ﬂow on the processor by a factor of K. For the Intel
Knights Landing processor (discussed in Sec. V) used in our
experiments, P= 272 and K= 16, allows us to support up
to 4352 concurrent software threads.
We show two novel vectorized triangle counting algo-
rithms: 1) sorted set intersection, and 2) binary search. The
ﬁrst of these approaches ﬁnds common neighbors by scan-
ning across the two sorted adjacency arrays being intersected,
similar to a merge. The second approach ﬁnds common
neighbors by searching for elements of one array in the
other. We also show a runtime mechanism for deciding which
algorithm should be selected The LRB method is then used
to ensure good execution.
LRB is architecture agnostic and has been extended to the
GPU [5]. On the GPU, different bins are executed using
a different number of threads, thus ensuring good load-
1
Performance Contributions
We compare our new algorithm against several high
performing triangle counting algorithms, including [23], [31],
[2], [27]. Several of these were the fastest triangle counting
algorithms in the 2017 HPEC Graph Challenge [24]. Our
algorithm is faster than all algorithms for a but small number
of instances across a wide range of test graphs. This includes
outperforming KOKKOS [31] by an average of 4×and as
much as 10×and [2] by an average of 2×and as much as
6×.
From a scalability perspective, we show that our algorithm
scales to large thread counts. This highlights the fact LRB
is frequently able to give each thread a near equal amount
of work and ensure good workload balance.
Our new vectorized algorithms offers 2×-5×performance
speedup over their scalar counterparts (which also use the
LRB load balancing mechanism).
II. RE LATE D WOR K
The applications in which triangle counting and triangle
listing are used is broad. It became an important metric to
data scientists with the introduction of clustering coefﬁcients
[30]. Other applications for triangle counting are: ﬁnding
transitivity [18], spam detection in email networks [1],
ﬁnding tightly knit communities [21], ﬁnding k-trusses [4],
[28], [10], and evaluating the quality of different community
detection algorithms [17], [32]. An extended discussion of
triangle counting applications can also be found in [3].
Triangle counting was also a key kernel for the HPEC Graph
Challenge [22].
a) Computational Approaches: Given a graph, G=
(V, E ), where Vare the vertices and Eare the edges in the
graph, the three simplest and mostly widely used approaches
for counting triangles in a graph [26]; enumerating over
all node triplets O(|V|3), using linear algebra operations
O(|V|w)(where w < 2.376), and adjacency list intersection.
Adjacency list intersection can be completed in multiple
ways: sorted set intersections, hash tables, or binary searches
to look up values. The time complexity of each of these
approaches is data dependent, yet upper bounds can be given.
Triangle can also be completed using the Gather-Apply-
Scatter (GAS) programming models[6], [14]. In this work
we focus on the adjacency intersection approaches.
b) Algorithmic Optimization: Numerous computational
optimizations can be applied to triangle counting algo-
rithms for static graphs to help reduce the overall execution
time. For example, Green & Bader [8] present a combi-
natorial optimization that reduces the number of necessary
intersections–offering a better complexity bound. Green et al.
[12] show a scalable technique for load-balancing the triangle
counting on shared-memory systems. Shun & Tangwongsan
[23], Polak [20], and Pearce [19] show how to reduce the
computational requirements by ﬁnding triangles in a directed
graph rather than the undirected graph. Leist et al. [15],
Green et al. [13], Wang et al. [29], and Fox et al.[5]
show several different strategies for implementing triangle
counting on the GPU.
c) Vector Instruction Sets: Vector instructions have
been an integral part of commodity processors in the last
twenty years, though the history of Single Instruction Multi-
ple Data (SIMD) programs goes back even further. In SIMD
instruction, each datum is placed in a separate lane and each
lane executes the same instruction. Intel’s AVX-512 ISA can
operate on vectors of 512 bits. The AVX-512 instruction set
has numerous conditional instructions referred to as masks
in the AVX-512 ISA. Our algorithm makes extensive use of
these instructions.
III. LOG AR IT HM RA DI X BIN NI NG
In this section, we present Logarithm Radix Binning (LRB
for short), a method that effectively load balances the inter-
sections across the thread, for both intersection algorithms.
This method is efﬁcient and works well for both the scalar
and vectorized algorithms. LRB works by placing edges into
bins based on the logarithmic value of the estimated amount
of work for that edge. For triangle counting, we initially
group the edges into two unique bins, one bin for the sorted
set intersection and one for the binary search. We will then
apply LRB to each to the edges and distribute them over a
2D grid of bins.
a) Initial binning: The intersection of the adjacency
arrays can be done in multiple ways. Two popular approaches
are 1) sorted set intersections and 2) using binary search. In
the sorted set intersection, common elements are found by
moving across the two sort adjacency arrays using a merge-
like access pattern. Sorted set intersection performs well
when the two adjacency arrays are of near equal length–
we call this a balanced intersection. However, when one
adjacency array is extremely large and the other is small,
or the intersection is imbalanced, binary search is more
efﬁcient. For binary search, each element in the smaller array
is looked up in the larger of the two arrays.
To determine which bin to place an edge (u, v)Einto,
we use the following estimations:
I ntersectionW ork(u, v) = du+dv(1)
Binar yW ork(u, v) = du·log(dv).(2)
Intersecting (u, v)and (v, u)will ﬁnd the same triangles,
as such only one of these is necessary. For simplicity and
performance reasons, we choose the edge (u, v)such that
du< dv. If du=dv, we select the vertex with the smaller id.
The intersection method selected is based on the minimum
of Eq. (1) and Eq. (2).
b) Finer grain binning: Fig. 1 depicts an edge list with
the estimated amount of work for that edge. For each edge,
the method with the minimal amount of estimated work is
selected using Eq. 1 and Eq. 2. The yellow boxes denote
edges that use the sorted set intersection approach and the
blue boxes represent edges that use the binary search. The
second row represents an edge ordering where the edges
are placed into two bins—one for each approach. For the
vectorized algorithm, this can lead to signiﬁcant workload
2
Algorithm 2: Branch-avoiding with conditional in-
structions. Variants of the branch-based and branch-
avoiding algorithms can be found in [7].
T ri Counting Br anch Avoiding Conditionals()
ai0;bi0;count 0;
while (ai<|A|and bi<|B|)do
CM P (A[ai], B[bi]));
CADDE Q(count);//Conditional ADD if (A[ai] = B[bi])
Algorithm 3: Vectorized sorted intersection kernel
wh i l e ( co nd ) {
AVec = m m 5 1 2 i 3 2 g a t h e r e p i 3 2 ( i n de xA , EAr r , 4) ;
BVec = m m 5 1 2 i 3 2 g a t h e r ep i 3 2 ( i nd e xB , EAr r , 4 ) ;
cmpL eVec = m m5 1 2 ma s k cm p le e pi 3 2 ma s k ( c on d , AVec , B Vec ) ;
cmpGe Vec = mm 51 2 ma s k cm pg e ep i3 2 ma s k ( c on d , AVec , BV ec ) ;
cmpE qVec = mm 5 12 m as k cm pe q ep i 32 m as k ( c on d , AVec , BV ec ) ;
t r i s = mm 51 2 ma sk a dd e pi 3 2 ( t r i s , cmpEqV ec , t r i s , m ion e3 2 ) ;
in de xA = mm 51 2 ma sk a dd ep i3 2 ( in dex A , cmpLe Vec , i nd exA , mio ne 32 ) ;
in de xB = mm 51 2 ma sk a dd e pi3 2 ( in de xB , cmpG eVec , i nd exB , mi one 32 ) ;
co nd = mm 51 2 ma sk c mp gt e pi 32 ma sk ( co nd , i nde xA St op , in de xA ) ;
co nd = mm 51 2 ma sk c mp gt e pi 32 ma sk ( co nd , i nd ex BS to p , in de xB ) ;
}
the threads get an equal amount of work. This will become
in Section V where the scaling of the algorithm is almost
perfectly linear.
f) Time Complexity Analysis: Phase 1: ﬁnding the
proper bin takes O(|E|)steps for evaluating Eedges. Phase
2: O(B2)to compute the preﬁx matrices. Phase 3) an
additional O(|E|)steps to reorder the edge list. Phase 1 and
Phase 3 are embarrassingly parallel and easily split across
the Pthreads. Phase 2 can also be done in parallel, however,
the cost of the preﬁx operation on arrays of B2is relatively
small in comparison to the other two phases and sequential
implementation is enough. Recall that B32,64 and that
B2<< |E|. The total time complexity is O(|E|+B2) =
O(|E|).
g) Storage Complexity Analysis: We use CSR (com-
pressed sparse row) to represent the original graph and
assume that the adjacency arrays are sorted. For triangle
counting, CSR requires two arrays, one for the offsets of
size O(|V|)and one for the indices (edges) which is of size
O(|E|). The binning technique used by LRB stores the edges
in a different order. We used an array of size O(|E|)to
store these edges. While this new edge list does not increase
the theoretical upper-bound; from a practical perspective, it
does double the memory consumption. The edges in the new
reordered edge-lists determine the order in which the edges
will be intersected. However, the intersection process itself
uses the sorted adjacency arrays in the CSR graph.
IV. VEC TO RI ZE D ALG OR IT HM S
In this section we present our new branch-avoiding and
vectorized triangle counting algorithms. Alg. 2 depicts the
sorted list intersection using the branch-avoiding program-
ming model [9], [7]. Note, the control ﬂow for the branch-
avoiding algorithms is largely independent of input values–
this is a key enabling factor for our vectorized approach. See
Green et al. [9], [7] for additional discussion on the branch-
avoiding programming model.
Algorithm 4: Vectorized binary search kernel
while ( co nd ){
sumV ec = mm 5 1 2 ad d e p i 32 ( l ow , h i g h ) ;
mi d d l e = m m 51 2 m a s kz sr l e p i3 2 ( con d , sum Vec , o n e S h i f t e r ) ;
v a l s = mm 5 1 2 ma s k i 3 2 g a t h e r e p i 3 2 ( v a l s , co nd , mi d dl e , EAr r , 4 ) ;
cmpE qVec = mm 5 1 2 ma s k c m pe q e p i 32 m a s k ( c on d , va l s , k e y s ) ;
cmp LtV ec = mm 5 1 2 m a sk c m p l t ep i 3 2 m a sk ( c on d , v a l s , k e y s ) ;
cmp GtVe c = m m 5 1 2 m a s k cm p g t e p i 3 2 ma s k ( c o nd , v a l s , k e y s ) ;
t r i s = mm 51 2 ma sk a dd e pi 3 2 ( t r i s , c mpEqV ec , t r i s , m io ne 32 ) ;
low = mm 51 2 ma sk a dd e pi 3 2 ( low , c mpLt Vec , m id d le , mio ne 32 ) ;
hi g h = mm 51 2 ma s k ad d ep i3 2 ( hi gh , cmpG tVec , m id d le , m iMon e32 ) ; .
co nd = mm5 1 2 ma sk cm p ge e pi 32 m as k ( c on d & ˜ c mpEqV ec , hi g h , l ow ) ;
}
Alg. 2 depicts a branch-avoiding algorithm for list
intersection—this algorithm uses conditional instructions.
Such instructions do not always exist in all architectures and
are in fact designed for a single control ﬂow systems where a
single instruction is executed based on hardware ﬂags (zero-
ﬂag, carry-ﬂag, and overﬂow-ﬂag). As such, a na¨
ıve vector
implementation might be constrained to a single set of these
ﬂags or would a require a single control ﬂow. We show how
to overcome this hardware constraint using the AVX-512
instruction set. Speciﬁcally we show 1) how to increase the
number of software control ﬂows and 2) how to control the
execution of each lane using masks (even though we do not
have enough hardware ﬂags).
Our experience with conditional instructions is that the
compiler is not able to ﬁgure out how to use them. We
strongly differentiate between compare and branch instruc-
tions. Compare instructions are used for evaluating different
values whereas a branch typically uses compare output to
decide on the next sequence of executable instructions.
a) Vectorized Intersections: We started off by describ-
ing how to increase the control ﬂow. The vectorized triangle
counting can be implemented in a variety of ways using the
branch avoiding model. In the ﬁrst approach the different
lanes work together on the same intersection (consisting of
two arrays). This approach was shown to be effective on the
GPU’s SIMT programming model [11], [13]; however, this
approach is signiﬁcantly more challenging to implement for
vector instructions. In the second approach, each lane in the
vector unit is responsible for a different intersection. Thus,
each vector unit requires 2·Kdifferent adjacency arrays for
Kdifferent intersections.
We choose the second of the approaches–each lane is
responsible for a different intersection. This also removes
the overhead of the partitioning scheme found in[13]. Thus,
the maximal number of concurrent intersections (software
threads) that can be executed on a system is Concurrent =
P·K.
The branch-avoiding algorithm found in Alg. 2 depicts
an initial “recipe” for implementing a vectorized triangle
counting algorithm. Note that the number of data depen-
dent branches has been signiﬁcantly reduced, yet there
still remains one condition in the control ﬂow that is
data dependent–the WHILE loop’s condition which checks
bounds of the two respective arrays. To vectorize the algo-
rithm, this condition also needs to be vectorized and this is
by no means trivial as this condition is responsible for the
entire WHILE loop. The vectorized version of the algorithm
4
TABLE I
NET WOR KS US ED I N OUR E XP ERI ME NTS .
Name |V| |E|
amazon0312 400,727 3,200,440
amazon0505 410,236 3,356,440
amazon0601 403,394 3,387,388
cit-HepTh 27770 352285
cit-Patents 3774768 16518947
email-EuAll 265214 364481
g500-s21-ef16 1243072 31731650
g500-s22-ef16 2393285 64097004
Name |V| |E|
g500-s23-ef16 4606314 129250705
g500-s24-ef16 8860450 260261843
g500-s25-ef16 17043780 523467448
soc-Epinions1 75879 405740
soc-LiveJournal1 4847571 68993773
soc-Slashdot0811 77360 469180
soc-Slashdot0902 82168 504230
ensures that certain data lanes will be ignored if the bounds
of the indices for that lane are exceeded—each of the K
lanes is responsible for managing its own bounds. Similar
restrictions exist for the binary search based intersection
(Alg. 4).
Alg. 3 and Alg. 4 depict the vectorized code for the sorted
set intersection approach and for the binary search approach,
respectively. These algorithms show close-to-real vector code
instructions (using Intel’s AVX-512 instructions set) rather
than pseudo-code. This allows highlighting:
The vectorized algorithms require gathering the elements
for Kintersections (using 2·Karrays) instead of just two
arrays for a scalar execution. The introduction of efﬁcient
gather instructions has greatly simpliﬁed the process of
collecting elements from random locations in memory.
The AVX-512 instruction set introduces masked-vector
instructions. These masks enable operating on a subset of
the vectors lanes and updating the counters for each lane.
Masked instructions are not conditional instructions. Specif-
ically, the masked instructions are always executed across
all the lanes; however, some data lanes might not be updated
based on the value of the mask. Another key difference is that
conditional instructions were designed for a single control
ﬂow (one per thread) whereas the masked operations allow
a vector-wide control ﬂow (for multiple control ﬂows). This
distinction is the reason that a single conditional instruction
is replaced with two masked instructions. For example,
the CADDEQ operations is replaced with vector CMPEQ
instruction followed by a masked add vector operation. While
this obviously incurs a performance penalty, it also enables
increasing the scalability of the algorithm across the vector.
V. PE RF OR MA NC E ANALYSI S
a) Experiment System: The experiments presented in
this paper are primarily executed on an Intel Xeon Phi 7250
processor with 96GB of DRAM memory (102.4 GB/s peak
bandwidth). This processor is part of the Xeon Knights Land-
ing series of processors. In addition to the main memory,
the processor has an additional 16GB of MCDRAM high
bandwidth memory (400 GB/s peak bandwidth) which is
used used as our primary memory - if the graph ﬁts into main
memory the lower latency DRAM memory is not utilized.
The Intel Xeon Phi 7250 has 68 cores with 272 threads (4-
way SMP). These cores run at a 1.3 GHz clock and share a
32MB L2 cache. Given these system parameters and using
our new algorithms, we are able to execute up to 4352
TABLE II
DIFF ERE NT PAR ALL EL VAR IATIO NS O F OUR T RI ANG LE C OUN TI NG
AL GOR IT HMS .DEN OTE S OU R FAST EST I MP LEM EN TATION .
Algorithm name Description
Mixed-EdgeList Simple algorithm that selects intersection method based on edge properties.
lrb-scalar Scalar implementation of our LRB load-balancing.
lrb-scalar-dod Scalar implementation that includes the direction optimized graph.
lrb-vector Vectorized (branch-avoiding) implementation of our LRB load-balancing.
lrb-vector-dod Vectorized (branch-avoiding) implementation including the direction optimized graph.
concurrent intersections 1. We also provide results for a dual
Intel Xeon 8160 Skylake processor, with 48 cores (96 threads
with hyper-threading), 32 MB LLC, and 192GB of DDR4-
2400 memory. All code, on both systems, is compiled with
the Intel Compiler (icc) (version 2017).
b) Inputs: The algorithms are tested using real world
graphs and networks taken from SNAP[16] and the HPEC
Graph Challenge [24], Table I. By default, all graphs are
treated as undirected. Directed graphs are transposed and
duplicate edges created in this phase are removed. Our al-
gorithm can also utilize the optimization of ﬁnding triangles
in a directed graph (where only half the edges exist). This
concept is used in [23], [20], [19] and is referred to as the
DOD graph in [19]—which is the terminology we use in this
paper.
c) LRB Analysis: Our algorithm implementation incor-
porates multiple optimizations. To capture the beneﬁts of
each of these optimization, we execute our algorithm with
several different optimizations. Table II describes the various
optimizations we use.
Fig. 3 depicts various performance characteristics of our
new algorithms and the various optimizations for the soc-
LiveJournal1 graph - similar results were seen for other
graphs. Note the abscissa is log scale for all the sub-ﬁgures.
Fig. 3 (a) depicts the execution time as a function of the
number of threads and Fig. 3 (b) depicts the speedup for
each of these conﬁgurations in comparison with a sequential
execution of a speciﬁc algorithm. For all these conﬁgurations,
the parallel scalability is near linear all the way up to
68 threads which is the number of physical cores on the
KNL system used in our experiments. While there is some
performance improvement beyond 68 threads, the scaling it
is not linear. This is a well known artifact of multiple threads
per core when resources are shared. Yet, it also shows that
LRB is successful as a load-balancing mechanism.
Fig. 3 (c) highlights the contributions of the different
optimizations of our algorithm. For each thread count, all
algorithms are normalized against the “Mixed edge-list”
implementation for a given thread count. The typical speedup
of going from the scalar execution to the vectorized execution
increases performance by roughly 2.5×for both the regular
graph as well the the DOD graph. For other graphs, the
vectorization increased performance by as much as 5×.
Applying all these optimizations together greatly improves
performance over an already optimized algorithm (that se-
lects an ideal intersection kernel for each edge). Speciﬁcally,
for soc-LiveJournal this improves performance by an average
1We note that parallelism may be limited in practice by the number of
vector units. To the best of our knowledge 4 threads (single core) share 2
VPUs [25].
5
counting algorithms - including several HPEC Graph Chal-
lenge champions. On average our algorithm out performed
KOKKOS, a SpMV based implementation that uses vector
instructions, HPEC Graph Challenge Champion by an av-
erage of 2.5×. Our new algorithm is also upto 4×faster
than the fastest algorithm for the GPU (running an NVIDIA
P100 GPU). There are numerous instances where our new
algorithm is also 5× −10×faster than these algorithm.
ACK NOW LE DG ME NT S
Funding was provided in part by the Defense Advanced
Research Projects Agency (DARPA) under Contract Number
FA8750-17-C-0086. This work was partially funded by the
Doctoral Studies Program at Sandia National Laboratories.
Sandia National Laboratories is a multimission laboratory
managed and operated by National Technology & Engineer-
ing Solutions of Sandia, LLC, a wholly owned subsidiary
of Honeywell International Inc., for the U.S. Department
of Energy’s National Nuclear Security Administration under
contract DE-NA0003525. The content of the information in
this document does not necessarily reﬂect the position or
the policy of the Government, and no ofﬁcial endorsement
should be inferred. The U.S. Government is authorized to
reproduce and distribute reprints for Government purposes
notwithstanding any copyright notation here on. The authors
acknowledge the Texas Advanced Computing Center (TACC)
at The University of Texas at Austin for providing HPC re-
sources that have contributed to the research results reported
within this paper.
REF ER EN CE S
[1] L. Becchetti, P. Boldi, C. Castillo, and A. Gionis, “Efﬁcient Semi-
streaming Algorithms for Local Triangle Counting in Massive
Graphs,” in 14th ACM SIGKDD Int’l Conf. on Knowledge Discovery
and Data Mining, 2008, pp. 16–24.
[2] M. Bisson and M. Fatica, “Static graph challenge on gpu,” in High
Performance Extreme Computing Conference (HPEC), 2017 IEEE.
IEEE, 2017, pp. 1–8.
[3] S. Chu and J. Cheng, “Triangle listing in massive networks and its
applications,” in Proceedings of the 17th ACM SIGKDD Int’l Conf.
on Knowledge Discovery and Data Mining, 2011, pp. 672–680.
[4] J. Cohen, “Trusses: Cohesive Subgraphs for Social Network Analysis,
National Security Agency Technical Report, p. 16, 2008.
[5] J. Fox, O. Green, K. Gabert, X. An, and D. Bader, “Fast and Adaptive
List Intersections on the GPU,” in IEEE Proc. High Performance
Extreme Computing (HPEC), Waltham, MA, 2018.
[6] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Pow-
erGraph: Distributed Graph-Parallel Computation on Natural Graphs,
in OSDI, vol. 12, 2012.
[7] O. Green, “When Merging and Branch Predictors Collide,” in IEEE
Fourth Workshop on Irregular Applications: Architectures and Algo-
rithms, 2014, pp. 33–40.
[8] O. Green and D. Bader, “Faster Clustering Coefﬁcients Using Vertex
Covers,” in 5th ASE/IEEE International Conference on Social Com-
puting, ser. SocialCom, 2013.
[9] O. Green, M. Dukhan, and R. Vuduc, “Branch-Avoiding Graph Al-
gorithms,” in 27th ACM on Symposium on Parallelism in Algorithms
and Architectures, 2015, pp. 212–223.
[10] O. Green, J. Fox, E. Kim, F. Busato, N. Bombieri, K. Lakhotia,
S. Zhou, S. Singapura, H. Zeng, R. Kannan, V. Prasanna, and D. Bader,
“Quickly Finding a Truss in a Haystack,” in IEEE Proc. High
Performance Extreme Computing (HPEC), Waltham, MA, 2017.
[11] O. Green, R. McColl, and D. Bader., “GPU Merge Path: A GPU
Merging Algorithm,” in 26th ACM International Conference on Su-
percomputing, 2012, pp. 331–340.
[12] O. Green, L. Munguia, and D. Bader, “Load Balanced Clustering Co-
efﬁcients,” in ACM Workshop on Parallel Programming for Analytics
Applications (PPAA), Feb. 2014.
[13] O. Green, P. Yalamanchili, and L. Mungu´
ıa, “Fast Triangle Counting
on the GPU,” in IEEE Fourth Workshop on Irregular Applications:
Architectures and Algorithms, 2014, pp. 1–8.
[14] F. Khorasani, K. Vora, R. Gupta, and L. Bhuyan, “CuSha: Vertex-
Centric Graph Processing on GPUs,” in 23rd ACM Int’l Symp. on
High-Performance Parallel and Distributed Computing (HPDC), 2014,
pp. 239–252.
[15] A. Leist, K. Hawick, D. Playne, and N. S. Albany, “GPGPU and Multi-
Core Architectures for Computing Clustering Coefﬁcients of Irregular
Graphs,” in Int’l Conf. on Scientiﬁc Computing (CSC’11), 2011.
[16] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford Large Network
Dataset Collection,” http://snap.stanford.edu/data.
[17] J. Leskovec, K. J. Lang, and M. Mahoney, “Empirical comparison of
algorithms for network community detection,” in Proceedings of the
19th Int’l Conf. on World Wide Web. ACM, 2010, pp. 631–640.
[18] M. E. Newman, D. J. Watts, and S. H. Strogatz, “Random Graph
Models of Social Networks,” Proceedings of the National Academy of
Sciences, vol. 99, no. suppl 1, pp. 2566–2572, 2002.
[19] R. Pearce, “Triangle counting for scale-free graphs at scale in dis-
tributed memory,” in High Performance Extreme Computing Confer-
ence (HPEC), 2017 IEEE. IEEE, 2017, pp. 1–4.
[20] A. Polak, “Counting triangles in large graphs on GPU,” arXiv preprint
arXiv:1503.00576, 2015.
[21] A. Prat-P´
erez, D. Dominguez-Sal, J. M. Brunat, and J.-L. Larriba-
Pey, “Shaping Communities out of Triangles,” in Proceedings of the
21st ACM International Conference on Information and Knowledge
Management, ser. CIKM ’12, 2012, pp. 1677–1681.
[22] S. Samsi, V. Gadepally, M. Hurley, M. Jones, E. Kao, S. Mohindra,
P. Monticciolo, A. Reuther, S. Smith, W. Song, D. Staheli, and
J. Kepner, “Static Graph Challenge: Subgraph Isomorphism,” in IEEE
Proc. High Performance Extreme Computing (HPEC), Waltham, MA,
2017.
[23] J. Shun and K. Tangwongsan, “Multicore Triangle Computations
Without Tuning,” in IEEE Int’l Conf. on Data Engineering (ICDE),
2015.
[24] S. Siddharth, V. Gadepally, M. Hurley, M. Jones, E. Kao, S. Mohindra,
P. Monticciolo, A. Reuther, S. Smith, W. Song, D. Staheli, and
J. Kepner, “Static graph challenge: Subgraph isomorphism,” in IEEE
Proc. High Performance Embedded Computing Workshop (HPEC),
Waltham, MA, 2017.
[25] A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod,
S. Chinthamani, S. Hutsell, R. Agarwal, and Y.-C. Liu, “Knights land-
ing: Second-generation intel xeon phi product,” Ieee micro, vol. 36,
no. 2, pp. 34–46, 2016.
[26] T. Schank and D. Wagner, “Finding, Counting and Listing All Tri-
angles in Large Graphs, an Experimental Study,” in Experimental &
Efﬁcient Algorithms. Springer, 2005, pp. 606–609.
[27] A. S. Tom, N. Sundaram, N. K. Ahmed, S. Smith, S. Eyerman,
M. Kodiyath, I. Hur, F. Petrini, and G. Karypis, “Exploring opti-
mizations on shared-memory platforms for parallel triangle counting
algorithms,” in High Performance Extreme Computing Conference
(HPEC), 2017 IEEE. IEEE, 2017, pp. 1–7.
[28] J. Wang and J. Cheng, “Truss Decomposition in Massive Networks,
Proceedings of the VLDB Endowment, vol. 5, no. 9, pp. 812–823,
2012.
[29] L. Wang, Y. Wang, C. Yang, and J. D. Owens, “A comparative study
on exact triangle counting algorithms on the gpu,” in Proceedings of
the ACM Workshop on High Performance Graph Processing. ACM,
2016, pp. 1–8.
[30] D. J. Watts and S. H. Strogatz, “Collective Dynamics of ‘Small-World’
Networks,” Nature, vol. 393, no. 6684, pp. 440–442, 1998.
[31] M. M. Wolf, M. Deveci, J. W. Berry, S. D. Hammond, and S. Rajaman-
ickam, “Fast linear algebra-based triangle counting with kokkosker-
nels,” in High Performance Extreme Computing Conference (HPEC),
2017 IEEE. IEEE, 2017, pp. 1–7.
[32] J. Yang and J. Leskovec, “Deﬁning and evaluating network communi-
ties based on ground-truth,” in Data Mining (ICDM), 2012 IEEE 12th
International Conference on. IEEE, 2012, pp. 745–754.
7
... Previous work has explored set intersection on SIMD [10,15,46,52,120] or GPU [8,39,41,42,50,51,80,81,114,115]. We classify their algorithms into 3 categories: Merge-path [41,42], Binary-search [39,51] and Hash-indexing [81]. ...
... Previous work has explored set intersection on SIMD [10,15,46,52,120] or GPU [8,39,41,42,50,51,80,81,114,115]. We classify their algorithms into 3 categories: Merge-path [41,42], Binary-search [39,51] and Hash-indexing [81]. We have extensively evaluated these methods on GPU, and we find that binary-search works the best since it is less divergent. ...
Preprint
We describe G2Miner, the first Graph Pattern Mining (GPM) framework that runs on multiple GPUs. G2Miner uses pattern-aware, input-aware and architecture-aware search strategies to achieve high efficiency on GPUs. To simplify programming, it provides a code generator that automatically generates pattern-aware CUDA code. G2Miner flexibly supports both breadth-first search (BFS) and depth-first search (DFS) to maximize memory utilization and generate sufficient parallelism for GPUs. For the scalability of G2Miner, we use a customized scheduling policy to balance work among multiple GPUs. Experiments on a V100 GPU show that G2Miner achieves average speedups of 5.4x and 7.2x over two state-of-the-art single-GPU systems, Pangolin and PBE, respectively. In the multi-GPU setting, G2Miner achieves linear speedups from 1 to 8 GPUs, for various patterns and data graphs. We also show that G2Miner on a V100 GPU is 48.3x and 15.2x faster than the state-of-the-art CPU-based system, Peregrine and GraphZero, on a 56-core CPU machine.
... Motifs like triangles represent a quantum of cohesion in graphs and the number of motifs containing an entity (vertex or an edge) acts as an indicator of its local density. Consequently, several recent works have focused on efficiently finding such motifs in the graphs [1,16,22,25,35,54,56,66]. ...
... alg.5, lines[12][13][14][15][16][17][18][19][20][21][22][23][24][25]. Our partitioning mechanism ensures that all butterflies satisfying the two preservation conditions (sec.3.1.2) for a partition , are represented in its BE-Index .Firstly, for an edge ∈ , its link ( , ) with a bloom is preserved in if and only if the twin = ( , ) ∈ such that ≥ (alg.5, lines[19][20]. ...
Preprint
Full-text available
Wing and Tip decomposition construct a hierarchy of butterfly-dense edge and vertex induced bipartite subgraphs, respectively. They have applications in several domains including e-commerce, recommendation systems and document analysis. Existing decomposition algorithms use a bottom-up approach that constructs the hierarchy in an increasing order of subgraph density. They iteratively peel the entities with minimum butterfly count i.e. remove them from the graph and update the butterfly count of other entities. However, the amount of butterflies in large bipartite graphs makes bottom-up peeling computationally demanding. Furthermore, the strict order of peeling entities results in a numerous sequentially dependent iterations, which makes parallelization challenging. In this paper, we propose a novel Parallel Bipartite Network peelinG (PBNG) framework which adopts a two-phased peeling approach to relax the order of peeling, and in turn, reduce synchronization. The first phase divides the decomposition hierarchy into several partitions using very few peeling iterations. The second phase concurrently processes these partitions to generate the final hierarchy, and requires no global synchronization. The two-phased peeling further enables batching optimizations that dramatically improve the computational efficiency. We empirically evaluate PBNG using several real-world bipartite graphs. Compared to the state-of-the-art frameworks and decomposition algorithms, PBNG achieves up to four orders of magnitude reduction in synchronization and two orders of magnitude speedup, respectively. We also present the first decomposition results of some of the largest public real-world datasets, which PBNG can peel in few minutes but existing algorithms fail to process even in several days.
... These loadbalancing mechanisms work at various parallel granularities, ranging from the warps and all the up-to-the whole device. Logarithmic Radix Binning (LRB) is a load-balancing mechanism first introduced in [18], [22] for triangle counting and later generalized for other graph problems and segmented sorting [19]. Our new algorithm also uses LRB for loadbalancing and performance purposes. ...
... Load Balanced Traversals: Our BFS traversal uses the recent Logarithmic Radix Binning (LRB) technique discussed in [19], [22], [18]. LRB groups vertices in the frontier into roughly 32 bins or 64 bins, depending on the graphs' maximal number of vertices (int32 vs. int64). ...
Conference Paper
Full-text available
Breadth-First Search (BFS) traversals appear in a wide range of applications and domains. BFS traversals determine the distance between key vertices and the remaining vertices in the network. The distance between the vertices often called the number of hops, is the shortest path between a root and the remaining vertices in the graph. Given its applicability across multiple domains, BFS has received significant attention from theoretical and applied communities, including many algorithms for parallelizing BFS. BFS traversals are typically conducted on a graph snapshot (commonly referred to as a static graph). In this paper, we show a novel algorithm for executing a BFS traversal. While our algorithm also uses a static graph for its input, we show how to perform a BFS traversal using dynamic graph operations. Specifically, in each BFS level, we remove the subset of the edges that are unnecessary for the next phase. These edges do not impact finding the vertices in the next level and reduce random memory accesses. We show a top-down BFS variation of our new algorithm. While our implementation does not outperform state-of-the-art implementations, its performance is competitive. Furthermore, it shows a novel way to implement BFS and opens up many research opportunities.
... These load-balancing mechanisms work at various parallel granularities, ranging from the warps and all the up-to-the whole device. Logarithmic Radix Binning (LRB) is a load-balancing mechanism first introduced in [23,26] for triangle counting and later generalized for other graph problems and segmented sorting [24]. Our ButterFly BFS algorithm uses LRB for performance purposes. ...
... (e) com-Friendster Load Balanced Traversals Per compute-node. The BFS traversal on GPU uses the recent Logarithmic Radix Binning (LRB) technique discussed in [24,26]. LRB groups vertices in the frontier into roughly 32 bins or 64 bins, depending on the graphs' maximal number of vertices. ...
Preprint
Full-text available
Breadth-First Search (BFS) is a building block used in a wide array of graph analytics and is used in various network analysis domains: social, road, transportation, communication, and much more. Over the last two decades, network sizes have continued to grow. The popularity of BFS has brought with it a need for significantly faster traversals. Thus, BFS algorithms have been designed to exploit shared-memory and shared-nothing systems -- this includes algorithms for accelerators such as the GPU. GPUs offer extremely fast traversals at the cost of processing smaller graphs due to their limited memory size. In contrast, CPU shared-memory systems can scale to graphs with several billion edges but do not have enough compute resources needed for fast traversals. This paper introduces ButterFly BFS, a multi-GPU traversal algorithm that allows analyzing significantly larger networks at high rates. ButterFly BFS scales to the similar-sized graphs processed by shared-memory systems while improving performance by more than 10X compared to CPUs. We evaluate our new algorithm on an NVIDIA DGX-2 server with 16 V100 GPUS and show that our algorithm scales with an increase in the number of GPUS. We show that we can achieve a roughly $70\%$ performance linear speedup, which is non-trivial for BFS. For a scale 29 Kronecker graph and edge factor of 8, our new algorithm traverses the graph at a rate of over 300 GTEP/s. That is a high traversal rate for a single server.
... In recent years, we have witnessed a surge of new triangle counting algorithms due to the HPEC GraphChallenge [39]. We also observed exciting efforts such as matrix multiplication-based triangle counting [40], new loadbalancing mechanisms [41], [42], and subgraph matchingbased triangle counting [43]. We refer the readers to [39] for an extended discussion. ...
Conference Paper
Full-text available
The transitive closure of a graph is a new graph where every vertex is directly connected to all vertices to which it had a path in the original graph. Transitive closures are useful for reachability and relationship querying. Finding the transitive closure can be computationally expensive and requires a large memory footprint as the output is typically larger than the input. Some of the original research on transitive closures assumed that graphs were dense and used dense adjacency matrices. We have since learned that many real-world networks are extremely sparse, and the existing methods do not scale. In this work, we introduce a new algorithm called Anti-section Transitive Closure (ATC) for finding the transitive closure of a graph. We present a new parallel edges operation-anti-sections-for finding new edges to reachable vertices. ATC scales to massively multi-threaded systems such as NVIDIA's GPU with tens of thousands of threads. We show that the anti-section operation shares some traits with the triangle counting intersection operation in graph analysis. Lastly, we view the transitive closure problem as a dynamic graph problem requiring edge insertions. By doing this, our memory footprint is smaller. We also show a method for creating the batches in parallel using two different techniques: dual-round and hash. Using these techniques and the Hornet dynamic graph data structure, we show our new algorithm on an NVIDIA Titan V GPU. We compare with other packages such as NetworkX, SEI-GBTL, SuiteSparse, and cuSparse.
... The pseudocode for wedge iteration set-intersection based common neighbor counting is given in Algorithm 1. Various optimizations and strategies have been developed for the actual set intersection operation, mostly in the context of triangle counting [35], [36], [37], [38], [39], [40]. O(∆) is the best time complexity known for set intersection, therefore the total time complexity for set-intersection based common neighbors is O(m∆ 2 ) [10], [11], [2]. ...
Conference Paper
Full-text available
Counting common neighbors between all vertex pairs in a graph is a fundamental operation, with uses in similarity measures, link prediction, graph compression, community detection, and more. Current shared-memory approaches either rely on set intersections or are not readily parallelizable. We introduce a new efficient and parallelizable algorithm to count common neighbors: starting at a wedge endpoint, we iterate through all wedges in the graph, and increment the common neighbor count for each endpoint pair. This exactly counts the common neighbors between all pairs without using set intersections , and as such attains an asymptotic improvement in runtime. Furthermore, our algorithm is simple to implement and only slight modifications are required for existing implementations to use our results. We provide an OpenMP implementation and evaluate it on real-world and synthetic graphs, demonstrating no loss of scalability and an asymptotic improvement. We show intersections are neither necessary nor helpful for computing all pairs common neighbor counts.
Preprint
Full-text available
The rise of graph analytic systems has created a need for new ways to measure and compare the capabilities of graph processing systems. The MIT/Amazon/IEEE Graph Challenge has been developed to provide a well-defined community venue for stimulating research and highlighting innovations in graph analysis software, hardware, algorithms, and systems. GraphChallenge.org provides a wide range of pre-parsed graph data sets, graph generators, mathematically defined graph algorithms, example serial implementations in a variety of languages, and specific metrics for measuring performance. The triangle counting component of GraphChallenge.org tests the performance of graph processing systems to count all the triangles in a graph and exercises key graph operations found in many graph algorithms. In 2017, 2018, and 2019 many triangle counting submissions were received from a wide range of authors and organizations. This paper presents a performance analysis of the best performers of these submissions. These submissions show that their state-of-the-art triangle counting execution time, $T_{\rm tri}$, is a strong function of the number of edges in the graph, $N_e$, which improved significantly from 2017 ($T_{\rm tri} \approx (N_e/10^8)^{4/3}$) to 2018 ($T_{\rm tri} \approx N_e/10^9$) and remained comparable from 2018 to 2019. Graph Challenge provides a clear picture of current graph analysis systems and underscores the need for new innovations to achieve high performance on very large graphs.
Conference Paper
Conference Paper
Full-text available
List intersections are ubiquitous and can be found in wide range of applications, including triangle counting and finding the maximal k-truss, both of which are part of the HPEC Static Graph Challenge. For many graph based problems it is necessary to find intersections for a very large number of lists-these lists tend to vary greatly in size and are difficult to efficiently load-balance. Numerous parallel algorithms on list intersections for triangle counting have been proposed, but load-balancing decisions are typically made at a global level. In this paper we present an efficient and adaptive approach to load-balancing at a finer granularity. Our approach assigns a different number of threads for different intersections in order to effectively utilize the resources of the GPU. We show the applicability of our load-balancing method to two different intersection methods, one search-based and one merge-based. Our algorithm outperforms several recent triangle counting algorithms, including recent HPEC Graph Challenge Champions.
Conference Paper
Full-text available
The k-truss of a graph is a subgraph such that each edge is tightly connected to the remaining elements in the k-truss. The k-truss of a graph can also represent an important community in the graph. Finding the k-truss of a graph can be done in a polynomial amount of time, in contrast finding other subgraphs such as cliques. While there are numerous formulations and algorithms for finding the maximal k-truss of a graph, many of these tend to be computationally expensive and do not scale well. Many algorithms are iterative and use static graph triangle counting in each iteration of the graph. In this work we present a novel algorithm for finding both the k-truss of the graph (for a given k), as well as the maximal k-truss using a dynamic graph formulation. Our algorithm has two main benefits. 1) Unlike many algorithms that rerun the static graph triangle counting after the removal of non-conforming edges, we use a new dynamic graph formulation that only requires updating the edges affected by the removal. As our updates are local, we only do a fraction of the work compared to the other algorithms. 2) Our algorithm is extremely scalable and is able to concurrently detect deleted triangles in contrast to past sequential approaches. While our algorithm is architecture independent, we show a CUDA based implementation for NVIDIA GPUs. In numerous instances, our new algorithm is anywhere from 100X-10000X faster than the Graph Challenge benchmark. Furthermore, our algorithm shows significant speedups, in some cases over 70X, over a recently developed sequential and highly optimized algorithm.
Article
Full-text available
The rise of graph analytic systems has created a need for ways to measure and compare the capabilities of these systems. Graph analytics present unique scalability difficulties. The machine learning, high performance computing, and visual analytics communities have wrestled with these difficulties for decades and developed methodologies for creating challenges to move these communities forward. The proposed Subgraph Isomorphism Graph Challenge draws upon prior challenges from machine learning, high performance computing, and visual analytics to create a graph challenge that is reflective of many real-world graph analytics processing systems. The Subgraph Isomorphism Graph Challenge is a holistic specification with multiple integrated kernels that can be run together or independently. Each kernel is well defined mathematically and can be implemented in any programming environment. Subgraph isomorphism is amenable to both vertex-centric implementations and array-based implementations (e.g., using the GraphBLAS.org standard). The computations are simple enough that performance predictions can be made based on simple computing hardware models. The surrounding kernels provide the context for each kernel that allows rigorous definition of both the input and the output for each kernel. Furthermore, since the proposed graph challenge is scalable in both problem size and hardware, it can be used to measure and quantitatively compare a wide range of present day and future systems. Serial implementations in C++, Python, Python with Pandas, Matlab, Octave, and Julia have been implemented and their single threaded performance have been measured. Specifications, data, and software are publicly available at GraphChallenge.org.
Conference Paper
Full-text available
We implement exact triangle counting in graphs on the GPU using three different methodologies: subgraph matching to a triangle pattern; programmable graph analytics, with a set-intersection approach; and a matrix formulation based on sparse matrix-matrix multiplies. All three deliver best-of-class performance over CPU implementations and over comparable GPU implementations, with the graph-analytic approach achieving the best performance due to its ability to exploit efficient filtering steps to remove unnecessary work and its high-performance set-intersection core.
Conference Paper