Conference PaperPDF Available

Logarithmic Radix Binning and Vectorized Triangle Counting

Authors:

Abstract

Triangle counting is a building block for numerous graph applications and given the fact that graphs continue to grow in size, its scalability is important. As such, numerous algorithms have been designed for triangle counting-some of which are compute-bound rather than memory bound. Even for compute-bound algorithms, one of the key challenges is the limited control flow available on the processor. This is in-part due to the high dependency between the control flow, input data, and limited utilization of vector instructions. Not surprising, compilers are not always able to detect these data dependencies and vectorize the algorithms. Using the branch-avoiding model we show to remove control flow restrictions by replacing branches with an equivalent set of arithmetic operations. More so, we show how these can be vectorized using Intel's AVX-512 instruction set and that our new vectorized algorithms are 2 × −5× faster than scalar counterparts. We also present a new load balancing method, Logarithmic Radix Binning (LRB) that ensures that threads and the vector data lanes execute a near equal amount of work at any given time. Altogether, our algorithm outperforms several 2017 HPEC Graph Challenge Champions such as the KOKKOS framework and a GPU based algorithm by anywhere from 1.5× and up to 14×.
Logarithmic Radix Binning and Vectorized Triangle Counting
Oded Green, James Fox, Alex Watkins, Alok Tripathy, Kasimir Gabert,
Euna Kim, Xiaojing An, Kumar Aatish, and David A. Bader
Computational Science and Engineering, Georgia Institute of Technology - USA
Abstract Triangle counting is a building block for numerous
graph applications and given the fact that graphs continue to
grow in size, its scalability is important. As such, numerous
algorithms have been designed for triangle counting - some of
which are compute-bound rather than memory bound. Even
for compute-bound algorithms, one of the key challenges is
the limited control flow available on the processor. This is in-
part due to the high dependency between the control flow,
input data, and limited utilization of vector instructions. Not
surprising, compilers are not always able to detect these data
dependencies and vectorize the algorithms. Using the branch-
avoiding model we show to remove control flow restrictions
by replacing branches with an equivalent set of arithmetic
operations. More so, we show how these can be vectorized using
Intel’s AVX-512 instruction set and that our new vectorized
algorithms are 2× 5×faster than scalar counterparts. We
also present a new load balancing method, Logarithmic Radix
Binning (LRB) that ensures that threads and the vector data
lanes execute a near equal amount of work at any given
time. Altogether, our algorithm outperforms several 2017 HPEC
Graph Challenge Champions such as the KOKKOS framework
and a GPU based algorithm by anywhere from 1.5×and up
to 14×.
I. INT ROD UC TI ON
Triangle counting and enumeration is a widely used kernel
for numerous applications. These include clustering coeffi-
cients, community detection, email spam detection, Jaccard
indices, and finding maximal k-trusses. The key building
block of triangle counting is adjacency list intersection. Nu-
merous algorithms have been developed for triangle counting
and these encapsulate a wide range of programming models:
vertex centric, edge centric, gather-apply-scatter (GAS), and
linear algebra. Some approaches require the adjacency arrays
to be sorted whereas other approaches do not. Most previous
algorithms focus on scalar execution as vectorizing these
algorithms is typically challenging. In this paper, we show
several news algorithm for vectorizing the computational
kernels of triangle counting.
We also propose a new load-balancing technique, which
we call Logarithmic Radix Binning (LRB), that ensures that
all the threads and vector units get a near equal amount
of work. LRB improves on previous techniques which only
focus on thread level load-balancing and shows how to
vectorize at the vector lane granularity. Thus, vector unit
utilization.
In this paper, we present several new algorithms for
triangle counting. These algorithm uses techniques developed
for the branch-avoiding model discussed in Green et al. [9],
[7]. Specifically, Green et al. show that the cost of branch
mis-prediction is high for data dependent applications (such
as graph algorithms) and that these applications can typically
be implemented in an alternative manner that avoids branches
entirely. This eliminates mis-prediction and makes execution
times more consistent, at the cost of additional operations.
Algorithmic Contributions
We develop a two-tiered binning mechanism for triangle
counting. In the first tier, for each edge we decide on an
intersection method that will be applied to that edge. Our
current implementation uses two different kernels (though
it can be extended to more). Therefore, this tier consist
of two bins. For the second tier, we present Logarithmic
Radix Binning (LRB). For each intersection kernel, a 2D
set of bins is maintained. Each of these bin stores edges
with similar computational properties. Thus, our vectorized
triangle counting algorithm can grab Kedges, where Kis
the vector width, with similar computational requirements
allowing for good load-balancing and utilization at the vector
lane granularity. In practice, this offers good scalability.
We show how to increase the number of control flows
in software. For a multi-threaded processor with Pphysical
hardware threads and vector instructions Kelements wide,
we show how to execute up to P·Kconcurrent software
threads. Where each of these threads is executing a different
intersections. Thus, our new algorithm increases the control
flow on the processor by a factor of K. For the Intel
Knights Landing processor (discussed in Sec. V) used in our
experiments, P= 272 and K= 16, allows us to support up
to 4352 concurrent software threads.
We show two novel vectorized triangle counting algo-
rithms: 1) sorted set intersection, and 2) binary search. The
first of these approaches finds common neighbors by scan-
ning across the two sorted adjacency arrays being intersected,
similar to a merge. The second approach finds common
neighbors by searching for elements of one array in the
other. We also show a runtime mechanism for deciding which
algorithm should be selected The LRB method is then used
to ensure good execution.
LRB is architecture agnostic and has been extended to the
GPU [5]. On the GPU, different bins are executed using
a different number of threads, thus ensuring good load-
balancing and trading off various overheads.
1
Performance Contributions
We compare our new algorithm against several high
performing triangle counting algorithms, including [23], [31],
[2], [27]. Several of these were the fastest triangle counting
algorithms in the 2017 HPEC Graph Challenge [24]. Our
algorithm is faster than all algorithms for a but small number
of instances across a wide range of test graphs. This includes
outperforming KOKKOS [31] by an average of 4×and as
much as 10×and [2] by an average of 2×and as much as
6×.
From a scalability perspective, we show that our algorithm
scales to large thread counts. This highlights the fact LRB
is frequently able to give each thread a near equal amount
of work and ensure good workload balance.
Our new vectorized algorithms offers 2×-5×performance
speedup over their scalar counterparts (which also use the
LRB load balancing mechanism).
II. RE LATE D WOR K
The applications in which triangle counting and triangle
listing are used is broad. It became an important metric to
data scientists with the introduction of clustering coefficients
[30]. Other applications for triangle counting are: finding
transitivity [18], spam detection in email networks [1],
finding tightly knit communities [21], finding k-trusses [4],
[28], [10], and evaluating the quality of different community
detection algorithms [17], [32]. An extended discussion of
triangle counting applications can also be found in [3].
Triangle counting was also a key kernel for the HPEC Graph
Challenge [22].
a) Computational Approaches: Given a graph, G=
(V, E ), where Vare the vertices and Eare the edges in the
graph, the three simplest and mostly widely used approaches
for counting triangles in a graph [26]; enumerating over
all node triplets O(|V|3), using linear algebra operations
O(|V|w)(where w < 2.376), and adjacency list intersection.
Adjacency list intersection can be completed in multiple
ways: sorted set intersections, hash tables, or binary searches
to look up values. The time complexity of each of these
approaches is data dependent, yet upper bounds can be given.
Triangle can also be completed using the Gather-Apply-
Scatter (GAS) programming models[6], [14]. In this work
we focus on the adjacency intersection approaches.
b) Algorithmic Optimization: Numerous computational
optimizations can be applied to triangle counting algo-
rithms for static graphs to help reduce the overall execution
time. For example, Green & Bader [8] present a combi-
natorial optimization that reduces the number of necessary
intersections–offering a better complexity bound. Green et al.
[12] show a scalable technique for load-balancing the triangle
counting on shared-memory systems. Shun & Tangwongsan
[23], Polak [20], and Pearce [19] show how to reduce the
computational requirements by finding triangles in a directed
graph rather than the undirected graph. Leist et al. [15],
Green et al. [13], Wang et al. [29], and Fox et al.[5]
show several different strategies for implementing triangle
counting on the GPU.
c) Vector Instruction Sets: Vector instructions have
been an integral part of commodity processors in the last
twenty years, though the history of Single Instruction Multi-
ple Data (SIMD) programs goes back even further. In SIMD
instruction, each datum is placed in a separate lane and each
lane executes the same instruction. Intel’s AVX-512 ISA can
operate on vectors of 512 bits. The AVX-512 instruction set
has numerous conditional instructions referred to as masks
in the AVX-512 ISA. Our algorithm makes extensive use of
these instructions.
III. LOG AR IT HM RA DI X BIN NI NG
In this section, we present Logarithm Radix Binning (LRB
for short), a method that effectively load balances the inter-
sections across the thread, for both intersection algorithms.
This method is efficient and works well for both the scalar
and vectorized algorithms. LRB works by placing edges into
bins based on the logarithmic value of the estimated amount
of work for that edge. For triangle counting, we initially
group the edges into two unique bins, one bin for the sorted
set intersection and one for the binary search. We will then
apply LRB to each to the edges and distribute them over a
2D grid of bins.
a) Initial binning: The intersection of the adjacency
arrays can be done in multiple ways. Two popular approaches
are 1) sorted set intersections and 2) using binary search. In
the sorted set intersection, common elements are found by
moving across the two sort adjacency arrays using a merge-
like access pattern. Sorted set intersection performs well
when the two adjacency arrays are of near equal length–
we call this a balanced intersection. However, when one
adjacency array is extremely large and the other is small,
or the intersection is imbalanced, binary search is more
efficient. For binary search, each element in the smaller array
is looked up in the larger of the two arrays.
To determine which bin to place an edge (u, v)Einto,
we use the following estimations:
I ntersectionW ork(u, v) = du+dv(1)
Binar yW ork(u, v) = du·log(dv).(2)
Intersecting (u, v)and (v, u)will find the same triangles,
as such only one of these is necessary. For simplicity and
performance reasons, we choose the edge (u, v)such that
du< dv. If du=dv, we select the vertex with the smaller id.
The intersection method selected is based on the minimum
of Eq. (1) and Eq. (2).
b) Finer grain binning: Fig. 1 depicts an edge list with
the estimated amount of work for that edge. For each edge,
the method with the minimal amount of estimated work is
selected using Eq. 1 and Eq. 2. The yellow boxes denote
edges that use the sorted set intersection approach and the
blue boxes represent edges that use the binary search. The
second row represents an edge ordering where the edges
are placed into two bins—one for each approach. For the
vectorized algorithm, this can lead to significant workload
2
Algorithm 2: Branch-avoiding with conditional in-
structions. Variants of the branch-based and branch-
avoiding algorithms can be found in [7].
T ri Counting Br anch Avoiding Conditionals()
ai0;bi0;count 0;
while (ai<|A|and bi<|B|)do
CM P (A[ai], B[bi]));
CADDE Q(count);//Conditional ADD if (A[ai] = B[bi])
CADDLE Q(ai);//Conditional ADD if (A[ai]B[bi])
CADDGE Q(bi);//Conditional ADD if (A[ai]B[bi])
Algorithm 3: Vectorized sorted intersection kernel
wh i l e ( co nd ) {
AVec = m m 5 1 2 i 3 2 g a t h e r e p i 3 2 ( i n de xA , EAr r , 4) ;
BVec = m m 5 1 2 i 3 2 g a t h e r ep i 3 2 ( i nd e xB , EAr r , 4 ) ;
cmpL eVec = m m5 1 2 ma s k cm p le e pi 3 2 ma s k ( c on d , AVec , B Vec ) ;
cmpGe Vec = mm 51 2 ma s k cm pg e ep i3 2 ma s k ( c on d , AVec , BV ec ) ;
cmpE qVec = mm 5 12 m as k cm pe q ep i 32 m as k ( c on d , AVec , BV ec ) ;
t r i s = mm 51 2 ma sk a dd e pi 3 2 ( t r i s , cmpEqV ec , t r i s , m ion e3 2 ) ;
in de xA = mm 51 2 ma sk a dd ep i3 2 ( in dex A , cmpLe Vec , i nd exA , mio ne 32 ) ;
in de xB = mm 51 2 ma sk a dd e pi3 2 ( in de xB , cmpG eVec , i nd exB , mi one 32 ) ;
co nd = mm 51 2 ma sk c mp gt e pi 32 ma sk ( co nd , i nde xA St op , in de xA ) ;
co nd = mm 51 2 ma sk c mp gt e pi 32 ma sk ( co nd , i nd ex BS to p , in de xB ) ;
}
the threads get an equal amount of work. This will become
in Section V where the scaling of the algorithm is almost
perfectly linear.
f) Time Complexity Analysis: Phase 1: finding the
proper bin takes O(|E|)steps for evaluating Eedges. Phase
2: O(B2)to compute the prefix matrices. Phase 3) an
additional O(|E|)steps to reorder the edge list. Phase 1 and
Phase 3 are embarrassingly parallel and easily split across
the Pthreads. Phase 2 can also be done in parallel, however,
the cost of the prefix operation on arrays of B2is relatively
small in comparison to the other two phases and sequential
implementation is enough. Recall that B32,64 and that
B2<< |E|. The total time complexity is O(|E|+B2) =
O(|E|).
g) Storage Complexity Analysis: We use CSR (com-
pressed sparse row) to represent the original graph and
assume that the adjacency arrays are sorted. For triangle
counting, CSR requires two arrays, one for the offsets of
size O(|V|)and one for the indices (edges) which is of size
O(|E|). The binning technique used by LRB stores the edges
in a different order. We used an array of size O(|E|)to
store these edges. While this new edge list does not increase
the theoretical upper-bound; from a practical perspective, it
does double the memory consumption. The edges in the new
reordered edge-lists determine the order in which the edges
will be intersected. However, the intersection process itself
uses the sorted adjacency arrays in the CSR graph.
IV. VEC TO RI ZE D ALG OR IT HM S
In this section we present our new branch-avoiding and
vectorized triangle counting algorithms. Alg. 2 depicts the
sorted list intersection using the branch-avoiding program-
ming model [9], [7]. Note, the control flow for the branch-
avoiding algorithms is largely independent of input values–
this is a key enabling factor for our vectorized approach. See
Green et al. [9], [7] for additional discussion on the branch-
avoiding programming model.
Algorithm 4: Vectorized binary search kernel
while ( co nd ){
sumV ec = mm 5 1 2 ad d e p i 32 ( l ow , h i g h ) ;
mi d d l e = m m 51 2 m a s kz sr l e p i3 2 ( con d , sum Vec , o n e S h i f t e r ) ;
v a l s = mm 5 1 2 ma s k i 3 2 g a t h e r e p i 3 2 ( v a l s , co nd , mi d dl e , EAr r , 4 ) ;
cmpE qVec = mm 5 1 2 ma s k c m pe q e p i 32 m a s k ( c on d , va l s , k e y s ) ;
cmp LtV ec = mm 5 1 2 m a sk c m p l t ep i 3 2 m a sk ( c on d , v a l s , k e y s ) ;
cmp GtVe c = m m 5 1 2 m a s k cm p g t e p i 3 2 ma s k ( c o nd , v a l s , k e y s ) ;
t r i s = mm 51 2 ma sk a dd e pi 3 2 ( t r i s , c mpEqV ec , t r i s , m io ne 32 ) ;
low = mm 51 2 ma sk a dd e pi 3 2 ( low , c mpLt Vec , m id d le , mio ne 32 ) ;
hi g h = mm 51 2 ma s k ad d ep i3 2 ( hi gh , cmpG tVec , m id d le , m iMon e32 ) ; .
co nd = mm5 1 2 ma sk cm p ge e pi 32 m as k ( c on d & ˜ c mpEqV ec , hi g h , l ow ) ;
}
Alg. 2 depicts a branch-avoiding algorithm for list
intersection—this algorithm uses conditional instructions.
Such instructions do not always exist in all architectures and
are in fact designed for a single control flow systems where a
single instruction is executed based on hardware flags (zero-
flag, carry-flag, and overflow-flag). As such, a na¨
ıve vector
implementation might be constrained to a single set of these
flags or would a require a single control flow. We show how
to overcome this hardware constraint using the AVX-512
instruction set. Specifically we show 1) how to increase the
number of software control flows and 2) how to control the
execution of each lane using masks (even though we do not
have enough hardware flags).
Our experience with conditional instructions is that the
compiler is not able to figure out how to use them. We
strongly differentiate between compare and branch instruc-
tions. Compare instructions are used for evaluating different
values whereas a branch typically uses compare output to
decide on the next sequence of executable instructions.
a) Vectorized Intersections: We started off by describ-
ing how to increase the control flow. The vectorized triangle
counting can be implemented in a variety of ways using the
branch avoiding model. In the first approach the different
lanes work together on the same intersection (consisting of
two arrays). This approach was shown to be effective on the
GPU’s SIMT programming model [11], [13]; however, this
approach is significantly more challenging to implement for
vector instructions. In the second approach, each lane in the
vector unit is responsible for a different intersection. Thus,
each vector unit requires 2·Kdifferent adjacency arrays for
Kdifferent intersections.
We choose the second of the approaches–each lane is
responsible for a different intersection. This also removes
the overhead of the partitioning scheme found in[13]. Thus,
the maximal number of concurrent intersections (software
threads) that can be executed on a system is Concurrent =
P·K.
The branch-avoiding algorithm found in Alg. 2 depicts
an initial “recipe” for implementing a vectorized triangle
counting algorithm. Note that the number of data depen-
dent branches has been significantly reduced, yet there
still remains one condition in the control flow that is
data dependent–the WHILE loop’s condition which checks
bounds of the two respective arrays. To vectorize the algo-
rithm, this condition also needs to be vectorized and this is
by no means trivial as this condition is responsible for the
entire WHILE loop. The vectorized version of the algorithm
4
TABLE I
NET WOR KS US ED I N OUR E XP ERI ME NTS .
Name |V| |E|
amazon0312 400,727 3,200,440
amazon0505 410,236 3,356,440
amazon0601 403,394 3,387,388
cit-HepTh 27770 352285
cit-Patents 3774768 16518947
email-EuAll 265214 364481
g500-s21-ef16 1243072 31731650
g500-s22-ef16 2393285 64097004
Name |V| |E|
g500-s23-ef16 4606314 129250705
g500-s24-ef16 8860450 260261843
g500-s25-ef16 17043780 523467448
soc-Epinions1 75879 405740
soc-LiveJournal1 4847571 68993773
soc-Slashdot0811 77360 469180
soc-Slashdot0902 82168 504230
ensures that certain data lanes will be ignored if the bounds
of the indices for that lane are exceeded—each of the K
lanes is responsible for managing its own bounds. Similar
restrictions exist for the binary search based intersection
(Alg. 4).
Alg. 3 and Alg. 4 depict the vectorized code for the sorted
set intersection approach and for the binary search approach,
respectively. These algorithms show close-to-real vector code
instructions (using Intel’s AVX-512 instructions set) rather
than pseudo-code. This allows highlighting:
The vectorized algorithms require gathering the elements
for Kintersections (using 2·Karrays) instead of just two
arrays for a scalar execution. The introduction of efficient
gather instructions has greatly simplified the process of
collecting elements from random locations in memory.
The AVX-512 instruction set introduces masked-vector
instructions. These masks enable operating on a subset of
the vectors lanes and updating the counters for each lane.
Masked instructions are not conditional instructions. Specif-
ically, the masked instructions are always executed across
all the lanes; however, some data lanes might not be updated
based on the value of the mask. Another key difference is that
conditional instructions were designed for a single control
flow (one per thread) whereas the masked operations allow
a vector-wide control flow (for multiple control flows). This
distinction is the reason that a single conditional instruction
is replaced with two masked instructions. For example,
the CADDEQ operations is replaced with vector CMPEQ
instruction followed by a masked add vector operation. While
this obviously incurs a performance penalty, it also enables
increasing the scalability of the algorithm across the vector.
V. PE RF OR MA NC E ANALYSI S
a) Experiment System: The experiments presented in
this paper are primarily executed on an Intel Xeon Phi 7250
processor with 96GB of DRAM memory (102.4 GB/s peak
bandwidth). This processor is part of the Xeon Knights Land-
ing series of processors. In addition to the main memory,
the processor has an additional 16GB of MCDRAM high
bandwidth memory (400 GB/s peak bandwidth) which is
used used as our primary memory - if the graph fits into main
memory the lower latency DRAM memory is not utilized.
The Intel Xeon Phi 7250 has 68 cores with 272 threads (4-
way SMP). These cores run at a 1.3 GHz clock and share a
32MB L2 cache. Given these system parameters and using
our new algorithms, we are able to execute up to 4352
TABLE II
DIFF ERE NT PAR ALL EL VAR IATIO NS O F OUR T RI ANG LE C OUN TI NG
AL GOR IT HMS .DEN OTE S OU R FAST EST I MP LEM EN TATION .
Algorithm name Description
Mixed-EdgeList Simple algorithm that selects intersection method based on edge properties.
lrb-scalar Scalar implementation of our LRB load-balancing.
lrb-scalar-dod Scalar implementation that includes the direction optimized graph.
lrb-vector Vectorized (branch-avoiding) implementation of our LRB load-balancing.
lrb-vector-dod Vectorized (branch-avoiding) implementation including the direction optimized graph.
concurrent intersections 1. We also provide results for a dual
Intel Xeon 8160 Skylake processor, with 48 cores (96 threads
with hyper-threading), 32 MB LLC, and 192GB of DDR4-
2400 memory. All code, on both systems, is compiled with
the Intel Compiler (icc) (version 2017).
b) Inputs: The algorithms are tested using real world
graphs and networks taken from SNAP[16] and the HPEC
Graph Challenge [24], Table I. By default, all graphs are
treated as undirected. Directed graphs are transposed and
duplicate edges created in this phase are removed. Our al-
gorithm can also utilize the optimization of finding triangles
in a directed graph (where only half the edges exist). This
concept is used in [23], [20], [19] and is referred to as the
DOD graph in [19]—which is the terminology we use in this
paper.
c) LRB Analysis: Our algorithm implementation incor-
porates multiple optimizations. To capture the benefits of
each of these optimization, we execute our algorithm with
several different optimizations. Table II describes the various
optimizations we use.
Fig. 3 depicts various performance characteristics of our
new algorithms and the various optimizations for the soc-
LiveJournal1 graph - similar results were seen for other
graphs. Note the abscissa is log scale for all the sub-figures.
Fig. 3 (a) depicts the execution time as a function of the
number of threads and Fig. 3 (b) depicts the speedup for
each of these configurations in comparison with a sequential
execution of a specific algorithm. For all these configurations,
the parallel scalability is near linear all the way up to
68 threads which is the number of physical cores on the
KNL system used in our experiments. While there is some
performance improvement beyond 68 threads, the scaling it
is not linear. This is a well known artifact of multiple threads
per core when resources are shared. Yet, it also shows that
LRB is successful as a load-balancing mechanism.
Fig. 3 (c) highlights the contributions of the different
optimizations of our algorithm. For each thread count, all
algorithms are normalized against the “Mixed edge-list”
implementation for a given thread count. The typical speedup
of going from the scalar execution to the vectorized execution
increases performance by roughly 2.5×for both the regular
graph as well the the DOD graph. For other graphs, the
vectorization increased performance by as much as 5×.
Applying all these optimizations together greatly improves
performance over an already optimized algorithm (that se-
lects an ideal intersection kernel for each edge). Specifically,
for soc-LiveJournal this improves performance by an average
1We note that parallelism may be limited in practice by the number of
vector units. To the best of our knowledge 4 threads (single core) share 2
VPUs [25].
5
counting algorithms - including several HPEC Graph Chal-
lenge champions. On average our algorithm out performed
KOKKOS, a SpMV based implementation that uses vector
instructions, HPEC Graph Challenge Champion by an av-
erage of 2.5×. Our new algorithm is also upto 4×faster
than the fastest algorithm for the GPU (running an NVIDIA
P100 GPU). There are numerous instances where our new
algorithm is also 5× 10×faster than these algorithm.
ACK NOW LE DG ME NT S
Funding was provided in part by the Defense Advanced
Research Projects Agency (DARPA) under Contract Number
FA8750-17-C-0086. This work was partially funded by the
Doctoral Studies Program at Sandia National Laboratories.
Sandia National Laboratories is a multimission laboratory
managed and operated by National Technology & Engineer-
ing Solutions of Sandia, LLC, a wholly owned subsidiary
of Honeywell International Inc., for the U.S. Department
of Energy’s National Nuclear Security Administration under
contract DE-NA0003525. The content of the information in
this document does not necessarily reflect the position or
the policy of the Government, and no official endorsement
should be inferred. The U.S. Government is authorized to
reproduce and distribute reprints for Government purposes
notwithstanding any copyright notation here on. The authors
acknowledge the Texas Advanced Computing Center (TACC)
at The University of Texas at Austin for providing HPC re-
sources that have contributed to the research results reported
within this paper.
REF ER EN CE S
[1] L. Becchetti, P. Boldi, C. Castillo, and A. Gionis, “Efficient Semi-
streaming Algorithms for Local Triangle Counting in Massive
Graphs,” in 14th ACM SIGKDD Int’l Conf. on Knowledge Discovery
and Data Mining, 2008, pp. 16–24.
[2] M. Bisson and M. Fatica, “Static graph challenge on gpu,” in High
Performance Extreme Computing Conference (HPEC), 2017 IEEE.
IEEE, 2017, pp. 1–8.
[3] S. Chu and J. Cheng, “Triangle listing in massive networks and its
applications,” in Proceedings of the 17th ACM SIGKDD Int’l Conf.
on Knowledge Discovery and Data Mining, 2011, pp. 672–680.
[4] J. Cohen, “Trusses: Cohesive Subgraphs for Social Network Analysis,
National Security Agency Technical Report, p. 16, 2008.
[5] J. Fox, O. Green, K. Gabert, X. An, and D. Bader, “Fast and Adaptive
List Intersections on the GPU,” in IEEE Proc. High Performance
Extreme Computing (HPEC), Waltham, MA, 2018.
[6] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Pow-
erGraph: Distributed Graph-Parallel Computation on Natural Graphs,
in OSDI, vol. 12, 2012.
[7] O. Green, “When Merging and Branch Predictors Collide,” in IEEE
Fourth Workshop on Irregular Applications: Architectures and Algo-
rithms, 2014, pp. 33–40.
[8] O. Green and D. Bader, “Faster Clustering Coefficients Using Vertex
Covers, in 5th ASE/IEEE International Conference on Social Com-
puting, ser. SocialCom, 2013.
[9] O. Green, M. Dukhan, and R. Vuduc, “Branch-Avoiding Graph Al-
gorithms,” in 27th ACM on Symposium on Parallelism in Algorithms
and Architectures, 2015, pp. 212–223.
[10] O. Green, J. Fox, E. Kim, F. Busato, N. Bombieri, K. Lakhotia,
S. Zhou, S. Singapura, H. Zeng, R. Kannan, V. Prasanna, and D. Bader,
“Quickly Finding a Truss in a Haystack, in IEEE Proc. High
Performance Extreme Computing (HPEC), Waltham, MA, 2017.
[11] O. Green, R. McColl, and D. Bader., “GPU Merge Path: A GPU
Merging Algorithm,” in 26th ACM International Conference on Su-
percomputing, 2012, pp. 331–340.
[12] O. Green, L. Munguia, and D. Bader, “Load Balanced Clustering Co-
efficients, in ACM Workshop on Parallel Programming for Analytics
Applications (PPAA), Feb. 2014.
[13] O. Green, P. Yalamanchili, and L. Mungu´
ıa, “Fast Triangle Counting
on the GPU,” in IEEE Fourth Workshop on Irregular Applications:
Architectures and Algorithms, 2014, pp. 1–8.
[14] F. Khorasani, K. Vora, R. Gupta, and L. Bhuyan, “CuSha: Vertex-
Centric Graph Processing on GPUs,” in 23rd ACM Int’l Symp. on
High-Performance Parallel and Distributed Computing (HPDC), 2014,
pp. 239–252.
[15] A. Leist, K. Hawick, D. Playne, and N. S. Albany, “GPGPU and Multi-
Core Architectures for Computing Clustering Coefficients of Irregular
Graphs,” in Int’l Conf. on Scientific Computing (CSC’11), 2011.
[16] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford Large Network
Dataset Collection,” http://snap.stanford.edu/data.
[17] J. Leskovec, K. J. Lang, and M. Mahoney, “Empirical comparison of
algorithms for network community detection,” in Proceedings of the
19th Int’l Conf. on World Wide Web. ACM, 2010, pp. 631–640.
[18] M. E. Newman, D. J. Watts, and S. H. Strogatz, “Random Graph
Models of Social Networks,” Proceedings of the National Academy of
Sciences, vol. 99, no. suppl 1, pp. 2566–2572, 2002.
[19] R. Pearce, “Triangle counting for scale-free graphs at scale in dis-
tributed memory, in High Performance Extreme Computing Confer-
ence (HPEC), 2017 IEEE. IEEE, 2017, pp. 1–4.
[20] A. Polak, “Counting triangles in large graphs on GPU,” arXiv preprint
arXiv:1503.00576, 2015.
[21] A. Prat-P´
erez, D. Dominguez-Sal, J. M. Brunat, and J.-L. Larriba-
Pey, “Shaping Communities out of Triangles,” in Proceedings of the
21st ACM International Conference on Information and Knowledge
Management, ser. CIKM ’12, 2012, pp. 1677–1681.
[22] S. Samsi, V. Gadepally, M. Hurley, M. Jones, E. Kao, S. Mohindra,
P. Monticciolo, A. Reuther, S. Smith, W. Song, D. Staheli, and
J. Kepner, “Static Graph Challenge: Subgraph Isomorphism, in IEEE
Proc. High Performance Extreme Computing (HPEC), Waltham, MA,
2017.
[23] J. Shun and K. Tangwongsan, “Multicore Triangle Computations
Without Tuning, in IEEE Int’l Conf. on Data Engineering (ICDE),
2015.
[24] S. Siddharth, V. Gadepally, M. Hurley, M. Jones, E. Kao, S. Mohindra,
P. Monticciolo, A. Reuther, S. Smith, W. Song, D. Staheli, and
J. Kepner, “Static graph challenge: Subgraph isomorphism, in IEEE
Proc. High Performance Embedded Computing Workshop (HPEC),
Waltham, MA, 2017.
[25] A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod,
S. Chinthamani, S. Hutsell, R. Agarwal, and Y.-C. Liu, “Knights land-
ing: Second-generation intel xeon phi product,” Ieee micro, vol. 36,
no. 2, pp. 34–46, 2016.
[26] T. Schank and D. Wagner, “Finding, Counting and Listing All Tri-
angles in Large Graphs, an Experimental Study,” in Experimental &
Efficient Algorithms. Springer, 2005, pp. 606–609.
[27] A. S. Tom, N. Sundaram, N. K. Ahmed, S. Smith, S. Eyerman,
M. Kodiyath, I. Hur, F. Petrini, and G. Karypis, “Exploring opti-
mizations on shared-memory platforms for parallel triangle counting
algorithms,” in High Performance Extreme Computing Conference
(HPEC), 2017 IEEE. IEEE, 2017, pp. 1–7.
[28] J. Wang and J. Cheng, “Truss Decomposition in Massive Networks,
Proceedings of the VLDB Endowment, vol. 5, no. 9, pp. 812–823,
2012.
[29] L. Wang, Y. Wang, C. Yang, and J. D. Owens, “A comparative study
on exact triangle counting algorithms on the gpu,” in Proceedings of
the ACM Workshop on High Performance Graph Processing. ACM,
2016, pp. 1–8.
[30] D. J. Watts and S. H. Strogatz, “Collective Dynamics of ‘Small-World’
Networks,” Nature, vol. 393, no. 6684, pp. 440–442, 1998.
[31] M. M. Wolf, M. Deveci, J. W. Berry, S. D. Hammond, and S. Rajaman-
ickam, “Fast linear algebra-based triangle counting with kokkosker-
nels,” in High Performance Extreme Computing Conference (HPEC),
2017 IEEE. IEEE, 2017, pp. 1–7.
[32] J. Yang and J. Leskovec, “Defining and evaluating network communi-
ties based on ground-truth,” in Data Mining (ICDM), 2012 IEEE 12th
International Conference on. IEEE, 2012, pp. 745–754.
7
... Another proposed approach, Logarithmic Radix Binning (LRB), first introduced by Green et al. within a triangle counting implementation [36], allows the load balancer to schedule work items with similar amounts of work within the same spatial and temporal region [32]. LRB assigns tasks to bins based on the logarithm of the amount of work required for a given task. ...
... Each thread then processes (i, P +i, 2 * P +i, . . .) work items [32,36]. ...
... A search on the prefix sum array can be used to determine which task is being processed for any given work item. Examples of this process include a Single-Source Shortest Path algorithms [28], Breadth-First Search algorithms [65], various merging algorithms [36], and SpMV [26,35,63,64]. Likewise, the IrGL [74] compiler and the CUDA-quicksort algorithm [60] explicitly use prefix sums to allocate space for work items. ...
Preprint
Full-text available
Fine-grained workload and resource balancing is the key to high performance for regular and irregular computations on the GPUs. In this dissertation, we conduct an extensive survey of existing load-balancing techniques to build an abstraction that addresses the difficulty of scheduling computations on the GPU. We propose a GPU fine-grained load-balancing abstraction that decouples load balancing from work processing and aims to support both static and dynamic schedules with a programmable interface to implement new load-balancing schedules. Prior to our work, the only way to unleash the GPU's potential on irregular problems has been to workload-balance through application-specific, tightly coupled load-balancing techniques. With our open-source framework for load-balancing, we hope to improve programmers' productivity when developing irregular-parallel algorithms on the GPU, and also improve the overall performance characteristics for such applications by allowing a quick path to experimentation with a variety of existing load-balancing techniques. Using our insights from load-balancing irregular workloads, we build Stream-K, a work-centric parallelization of matrix multiplication (GEMM) and related computations in dense linear algebra. Whereas contemporary decompositions are primarily tile-based, our method operates by partitioning an even share of the aggregate inner loop iterations among physical processing elements. This provides a near-perfect utilization of computing resources, regardless of how efficiently the output tiling for any given problem quantizes across the underlying processing elements. On GPU processors, our Stream-K parallelization of GEMM produces a peak speedup of up to 14x and 6.7x, and an average performance response that is both higher and more consistent across 32K GEMM problem geometries than state-of-the-art math libraries such as CUTLASS and cuBLAS.
... In the process of building our abstraction, we identified common load-balancing approaches currently deployed within sparse, irregular applications on GPUs: application-specific frameworks such as GraphIt [5], Gunrock [29], and Graph-BLAST [31]; techniques from low-level CUDA libraries such as ModernGPU [3] and CUB [24]; and other hand-coded implementations of load-balancing algorithms within applications such as SpMV/SpMM [10,14,20], triangle counting [13,16], and breadth-first search [6,21]. We show that with a simple, intuitive, powerful abstraction, these loadbalancing schedules can be extended to support irregular workloads that are more general than the specific problem for which they were designed. ...
... Shortest Path (SSSP) and Breadth-First Search (BFS) respectively [10,21]. Logarithmic Radix Binning (LRB) is a particularly effective technique for binning work based on a logarithmic work estimate, used for the Triangle Counting graph algorithm and more [13,16]. Gunrock, GraphIT, and GraphBLAST are graph analytics libraries that implement several different graph algorithms such as BFS, SSSP, Page-Rank, Graph Coloring, and more, built on these previously mentioned load-balancing techniques [5,29,31]. ...
Preprint
Full-text available
We propose a GPU fine-grained load-balancing abstraction that decouples load balancing from work processing and aims to support both static and dynamic schedules with a programmable interface to implement new load-balancing schedules. Prior to our work, the only way to unleash the GPU's potential on irregular problems has been to workload-balance through application-specific, tightly coupled load-balancing techniques. With our open-source framework for load-balancing, we hope to improve programmers' productivity when developing irregular-parallel algorithms on the GPU, and also improve the overall performance characteristics for such applications by allowing a quick path to experimentation with a variety of existing load-balancing techniques. Consequently, we also hope that by separating the concerns of load-balancing from work processing within our abstraction, managing and extending existing code to future architectures becomes easier.
Chapter
Triangle counting is a graph algorithm that calculates the number of triangles in a graph, the number of triangles is a key metric for a large number of graph algorithms. Traditional triangle counting algorithms are divided into vertex-iterator and edge-iterator when traversing the graph. As the scale of graph data grows, the use of CPU with other architectural platforms for triangle counting has become mainstream. Our accelerating method proposes an algorithm for triangle counting on a single machine GPU, and performs a two-dimensional partition algorithm for large-scale graph data, in order to ensure that large graph data can be correctly loaded into GPU memory and the independence of each partition to obtain the right result. According to the high concurrency of GPU, a bit-wise operation intersection algorithm is proposed. We experiment with our algorithm to verify that our method effectively speeds up triangle counting algorithm on a single-machine GPU.KeywordsTriangle CountingGPUBit-Wise
Article
Wing and Tip decomposition are motif-based analytics for bipartite graphs, that construct a hierarchy of butterfly (2,2-biclique) dense edge and vertex induced subgraphs, respectively. They have applications in several domains including e-commerce, recommendation systems, document analysis and others. Existing decomposition algorithms use a bottom-up approach that constructs the hierarchy in an increasing order of the subgraph density. They iteratively select the edges or vertices with minimum butterfly count peel i.e. remove them along with their butterflies. The amount of butterflies in real-world bipartite graphs makes bottom-up peeling computationally demanding. Furthermore, the strict order of peeling entities results in a large number of sequentially dependent iterations. Consequently, parallel algorithms based on bottom up peeling incur heavy synchronization and poor scalability. In this paper, we propose a novel Parallel Bipartite Network peelinG (PBNG) framework which adopts a two-phased peeling approach to relax the order of peeling, and in turn, dramatically reduce synchronization. The first phase divides the decomposition hierarchy into few partitions, and requires little synchronization. The second phase concurrently processes all partitions to generate individual levels of the hierarchy, and requires no global synchronization. The two-phased peeling further enables batching optimizations that dramatically improve the computational efficiency of PBNG. We empirically evaluate PBNG using several real-world bipartite graphs and demonstrate radical improvements over the existing approaches. On a shared-memory 36 core server, PBNG achieves up to 19.7 × self-relative parallel speedup. Compared to the state-of-the-art parallel framework P ar B utterfly , PBNG reduces synchronization by up to 15260 × and execution time by up to 295 ×. Furthermore, it achieves up to 38.5 × speedup over state-of-the-art algorithms specifically tuned for wing decomposition. Our source code is made available at https://github.com/kartiklakhotia/RECEIPT.
Conference Paper
Full-text available
List intersections are ubiquitous and can be found in wide range of applications, including triangle counting and finding the maximal k-truss, both of which are part of the HPEC Static Graph Challenge. For many graph based problems it is necessary to find intersections for a very large number of lists-these lists tend to vary greatly in size and are difficult to efficiently load-balance. Numerous parallel algorithms on list intersections for triangle counting have been proposed, but load-balancing decisions are typically made at a global level. In this paper we present an efficient and adaptive approach to load-balancing at a finer granularity. Our approach assigns a different number of threads for different intersections in order to effectively utilize the resources of the GPU. We show the applicability of our load-balancing method to two different intersection methods, one search-based and one merge-based. Our algorithm outperforms several recent triangle counting algorithms, including recent HPEC Graph Challenge Champions.
Conference Paper
Full-text available
The k-truss of a graph is a subgraph such that each edge is tightly connected to the remaining elements in the k-truss. The k-truss of a graph can also represent an important community in the graph. Finding the k-truss of a graph can be done in a polynomial amount of time, in contrast finding other subgraphs such as cliques. While there are numerous formulations and algorithms for finding the maximal k-truss of a graph, many of these tend to be computationally expensive and do not scale well. Many algorithms are iterative and use static graph triangle counting in each iteration of the graph. In this work we present a novel algorithm for finding both the k-truss of the graph (for a given k), as well as the maximal k-truss using a dynamic graph formulation. Our algorithm has two main benefits. 1) Unlike many algorithms that rerun the static graph triangle counting after the removal of non-conforming edges, we use a new dynamic graph formulation that only requires updating the edges affected by the removal. As our updates are local, we only do a fraction of the work compared to the other algorithms. 2) Our algorithm is extremely scalable and is able to concurrently detect deleted triangles in contrast to past sequential approaches. While our algorithm is architecture independent, we show a CUDA based implementation for NVIDIA GPUs. In numerous instances, our new algorithm is anywhere from 100X-10000X faster than the Graph Challenge benchmark. Furthermore, our algorithm shows significant speedups, in some cases over 70X, over a recently developed sequential and highly optimized algorithm.
Article
Full-text available
The rise of graph analytic systems has created a need for ways to measure and compare the capabilities of these systems. Graph analytics present unique scalability difficulties. The machine learning, high performance computing, and visual analytics communities have wrestled with these difficulties for decades and developed methodologies for creating challenges to move these communities forward. The proposed Subgraph Isomorphism Graph Challenge draws upon prior challenges from machine learning, high performance computing, and visual analytics to create a graph challenge that is reflective of many real-world graph analytics processing systems. The Subgraph Isomorphism Graph Challenge is a holistic specification with multiple integrated kernels that can be run together or independently. Each kernel is well defined mathematically and can be implemented in any programming environment. Subgraph isomorphism is amenable to both vertex-centric implementations and array-based implementations (e.g., using the GraphBLAS.org standard). The computations are simple enough that performance predictions can be made based on simple computing hardware models. The surrounding kernels provide the context for each kernel that allows rigorous definition of both the input and the output for each kernel. Furthermore, since the proposed graph challenge is scalable in both problem size and hardware, it can be used to measure and quantitatively compare a wide range of present day and future systems. Serial implementations in C++, Python, Python with Pandas, Matlab, Octave, and Julia have been implemented and their single threaded performance have been measured. Specifications, data, and software are publicly available at GraphChallenge.org.
Conference Paper
Full-text available
We implement exact triangle counting in graphs on the GPU using three different methodologies: subgraph matching to a triangle pattern; programmable graph analytics, with a set-intersection approach; and a matrix formulation based on sparse matrix-matrix multiplies. All three deliver best-of-class performance over CPU implementations and over comparable GPU implementations, with the graph-analytic approach achieving the best performance due to its ability to exploit efficient filtering steps to remove unnecessary work and its high-performance set-intersection core.