Content uploaded by Oded Green
Author content
All content in this area was uploaded by Oded Green on Sep 11, 2018
Content may be subject to copyright.
Logarithmic Radix Binning and Vectorized Triangle Counting
Oded Green, James Fox, Alex Watkins, Alok Tripathy, Kasimir Gabert,
Euna Kim, Xiaojing An, Kumar Aatish, and David A. Bader
Computational Science and Engineering, Georgia Institute of Technology - USA
Abstract— Triangle counting is a building block for numerous
graph applications and given the fact that graphs continue to
grow in size, its scalability is important. As such, numerous
algorithms have been designed for triangle counting - some of
which are compute-bound rather than memory bound. Even
for compute-bound algorithms, one of the key challenges is
the limited control flow available on the processor. This is in-
part due to the high dependency between the control flow,
input data, and limited utilization of vector instructions. Not
surprising, compilers are not always able to detect these data
dependencies and vectorize the algorithms. Using the branch-
avoiding model we show to remove control flow restrictions
by replacing branches with an equivalent set of arithmetic
operations. More so, we show how these can be vectorized using
Intel’s AVX-512 instruction set and that our new vectorized
algorithms are 2× −5×faster than scalar counterparts. We
also present a new load balancing method, Logarithmic Radix
Binning (LRB) that ensures that threads and the vector data
lanes execute a near equal amount of work at any given
time. Altogether, our algorithm outperforms several 2017 HPEC
Graph Challenge Champions such as the KOKKOS framework
and a GPU based algorithm by anywhere from 1.5×and up
to 14×.
I. INT ROD UC TI ON
Triangle counting and enumeration is a widely used kernel
for numerous applications. These include clustering coeffi-
cients, community detection, email spam detection, Jaccard
indices, and finding maximal k-trusses. The key building
block of triangle counting is adjacency list intersection. Nu-
merous algorithms have been developed for triangle counting
and these encapsulate a wide range of programming models:
vertex centric, edge centric, gather-apply-scatter (GAS), and
linear algebra. Some approaches require the adjacency arrays
to be sorted whereas other approaches do not. Most previous
algorithms focus on scalar execution as vectorizing these
algorithms is typically challenging. In this paper, we show
several news algorithm for vectorizing the computational
kernels of triangle counting.
We also propose a new load-balancing technique, which
we call Logarithmic Radix Binning (LRB), that ensures that
all the threads and vector units get a near equal amount
of work. LRB improves on previous techniques which only
focus on thread level load-balancing and shows how to
vectorize at the vector lane granularity. Thus, vector unit
utilization.
In this paper, we present several new algorithms for
triangle counting. These algorithm uses techniques developed
for the branch-avoiding model discussed in Green et al. [9],
[7]. Specifically, Green et al. show that the cost of branch
mis-prediction is high for data dependent applications (such
as graph algorithms) and that these applications can typically
be implemented in an alternative manner that avoids branches
entirely. This eliminates mis-prediction and makes execution
times more consistent, at the cost of additional operations.
Algorithmic Contributions
•We develop a two-tiered binning mechanism for triangle
counting. In the first tier, for each edge we decide on an
intersection method that will be applied to that edge. Our
current implementation uses two different kernels (though
it can be extended to more). Therefore, this tier consist
of two bins. For the second tier, we present Logarithmic
Radix Binning (LRB). For each intersection kernel, a 2D
set of bins is maintained. Each of these bin stores edges
with similar computational properties. Thus, our vectorized
triangle counting algorithm can grab Kedges, where Kis
the vector width, with similar computational requirements
allowing for good load-balancing and utilization at the vector
lane granularity. In practice, this offers good scalability.
•We show how to increase the number of control flows
in software. For a multi-threaded processor with Pphysical
hardware threads and vector instructions Kelements wide,
we show how to execute up to P·Kconcurrent software
threads. Where each of these threads is executing a different
intersections. Thus, our new algorithm increases the control
flow on the processor by a factor of K. For the Intel
Knights Landing processor (discussed in Sec. V) used in our
experiments, P= 272 and K= 16, allows us to support up
to 4352 concurrent software threads.
•We show two novel vectorized triangle counting algo-
rithms: 1) sorted set intersection, and 2) binary search. The
first of these approaches finds common neighbors by scan-
ning across the two sorted adjacency arrays being intersected,
similar to a merge. The second approach finds common
neighbors by searching for elements of one array in the
other. We also show a runtime mechanism for deciding which
algorithm should be selected The LRB method is then used
to ensure good execution.
•LRB is architecture agnostic and has been extended to the
GPU [5]. On the GPU, different bins are executed using
a different number of threads, thus ensuring good load-
balancing and trading off various overheads.
1
Performance Contributions
•We compare our new algorithm against several high
performing triangle counting algorithms, including [23], [31],
[2], [27]. Several of these were the fastest triangle counting
algorithms in the 2017 HPEC Graph Challenge [24]. Our
algorithm is faster than all algorithms for a but small number
of instances across a wide range of test graphs. This includes
outperforming KOKKOS [31] by an average of 4×and as
much as 10×and [2] by an average of 2×and as much as
6×.
•From a scalability perspective, we show that our algorithm
scales to large thread counts. This highlights the fact LRB
is frequently able to give each thread a near equal amount
of work and ensure good workload balance.
•Our new vectorized algorithms offers 2×-5×performance
speedup over their scalar counterparts (which also use the
LRB load balancing mechanism).
II. RE LATE D WOR K
The applications in which triangle counting and triangle
listing are used is broad. It became an important metric to
data scientists with the introduction of clustering coefficients
[30]. Other applications for triangle counting are: finding
transitivity [18], spam detection in email networks [1],
finding tightly knit communities [21], finding k-trusses [4],
[28], [10], and evaluating the quality of different community
detection algorithms [17], [32]. An extended discussion of
triangle counting applications can also be found in [3].
Triangle counting was also a key kernel for the HPEC Graph
Challenge [22].
a) Computational Approaches: Given a graph, G=
(V, E ), where Vare the vertices and Eare the edges in the
graph, the three simplest and mostly widely used approaches
for counting triangles in a graph [26]; enumerating over
all node triplets O(|V|3), using linear algebra operations
O(|V|w)(where w < 2.376), and adjacency list intersection.
Adjacency list intersection can be completed in multiple
ways: sorted set intersections, hash tables, or binary searches
to look up values. The time complexity of each of these
approaches is data dependent, yet upper bounds can be given.
Triangle can also be completed using the Gather-Apply-
Scatter (GAS) programming models[6], [14]. In this work
we focus on the adjacency intersection approaches.
b) Algorithmic Optimization: Numerous computational
optimizations can be applied to triangle counting algo-
rithms for static graphs to help reduce the overall execution
time. For example, Green & Bader [8] present a combi-
natorial optimization that reduces the number of necessary
intersections–offering a better complexity bound. Green et al.
[12] show a scalable technique for load-balancing the triangle
counting on shared-memory systems. Shun & Tangwongsan
[23], Polak [20], and Pearce [19] show how to reduce the
computational requirements by finding triangles in a directed
graph rather than the undirected graph. Leist et al. [15],
Green et al. [13], Wang et al. [29], and Fox et al.[5]
show several different strategies for implementing triangle
counting on the GPU.
c) Vector Instruction Sets: Vector instructions have
been an integral part of commodity processors in the last
twenty years, though the history of Single Instruction Multi-
ple Data (SIMD) programs goes back even further. In SIMD
instruction, each datum is placed in a separate lane and each
lane executes the same instruction. Intel’s AVX-512 ISA can
operate on vectors of 512 bits. The AVX-512 instruction set
has numerous conditional instructions referred to as masks
in the AVX-512 ISA. Our algorithm makes extensive use of
these instructions.
III. LOG AR IT HM RA DI X BIN NI NG
In this section, we present Logarithm Radix Binning (LRB
for short), a method that effectively load balances the inter-
sections across the thread, for both intersection algorithms.
This method is efficient and works well for both the scalar
and vectorized algorithms. LRB works by placing edges into
bins based on the logarithmic value of the estimated amount
of work for that edge. For triangle counting, we initially
group the edges into two unique bins, one bin for the sorted
set intersection and one for the binary search. We will then
apply LRB to each to the edges and distribute them over a
2D grid of bins.
a) Initial binning: The intersection of the adjacency
arrays can be done in multiple ways. Two popular approaches
are 1) sorted set intersections and 2) using binary search. In
the sorted set intersection, common elements are found by
moving across the two sort adjacency arrays using a merge-
like access pattern. Sorted set intersection performs well
when the two adjacency arrays are of near equal length–
we call this a balanced intersection. However, when one
adjacency array is extremely large and the other is small,
or the intersection is imbalanced, binary search is more
efficient. For binary search, each element in the smaller array
is looked up in the larger of the two arrays.
To determine which bin to place an edge (u, v)∈Einto,
we use the following estimations:
I ntersectionW ork(u, v) = du+dv(1)
Binar yW ork(u, v) = du·log(dv).(2)
Intersecting (u, v)and (v, u)will find the same triangles,
as such only one of these is necessary. For simplicity and
performance reasons, we choose the edge (u, v)such that
du< dv. If du=dv, we select the vertex with the smaller id.
The intersection method selected is based on the minimum
of Eq. (1) and Eq. (2).
b) Finer grain binning: Fig. 1 depicts an edge list with
the estimated amount of work for that edge. For each edge,
the method with the minimal amount of estimated work is
selected using Eq. 1 and Eq. 2. The yellow boxes denote
edges that use the sorted set intersection approach and the
blue boxes represent edges that use the binary search. The
second row represents an edge ordering where the edges
are placed into two bins—one for each approach. For the
vectorized algorithm, this can lead to significant workload
2
Algorithm 2: Branch-avoiding with conditional in-
structions. Variants of the branch-based and branch-
avoiding algorithms can be found in [7].
T ri −Counting −Br anch −Avoiding −Conditionals()
ai←0;bi←0;count ←0;
while (ai<|A|and bi<|B|)do
CM P (A[ai], B[bi]));
CADDE Q(count);//Conditional −ADD if (A[ai] = B[bi])
CADDLE Q(ai);//Conditional −ADD if (A[ai]≤B[bi])
CADDGE Q(bi);//Conditional −ADD if (A[ai]≥B[bi])
Algorithm 3: Vectorized sorted intersection kernel
wh i l e ( co nd ) {
AVec = m m 5 1 2 i 3 2 g a t h e r e p i 3 2 ( i n de xA , EAr r , 4) ;
BVec = m m 5 1 2 i 3 2 g a t h e r ep i 3 2 ( i nd e xB , EAr r , 4 ) ;
cmpL eVec = m m5 1 2 ma s k cm p le e pi 3 2 ma s k ( c on d , AVec , B Vec ) ;
cmpGe Vec = mm 51 2 ma s k cm pg e ep i3 2 ma s k ( c on d , AVec , BV ec ) ;
cmpE qVec = mm 5 12 m as k cm pe q ep i 32 m as k ( c on d , AVec , BV ec ) ;
t r i s = mm 51 2 ma sk a dd e pi 3 2 ( t r i s , cmpEqV ec , t r i s , m ion e3 2 ) ;
in de xA = mm 51 2 ma sk a dd ep i3 2 ( in dex A , cmpLe Vec , i nd exA , mio ne 32 ) ;
in de xB = mm 51 2 ma sk a dd e pi3 2 ( in de xB , cmpG eVec , i nd exB , mi one 32 ) ;
co nd = mm 51 2 ma sk c mp gt e pi 32 ma sk ( co nd , i nde xA St op , in de xA ) ;
co nd = mm 51 2 ma sk c mp gt e pi 32 ma sk ( co nd , i nd ex BS to p , in de xB ) ;
}
the threads get an equal amount of work. This will become
in Section V where the scaling of the algorithm is almost
perfectly linear.
f) Time Complexity Analysis: Phase 1: finding the
proper bin takes O(|E|)steps for evaluating Eedges. Phase
2: O(B2)to compute the prefix matrices. Phase 3) an
additional O(|E|)steps to reorder the edge list. Phase 1 and
Phase 3 are embarrassingly parallel and easily split across
the Pthreads. Phase 2 can also be done in parallel, however,
the cost of the prefix operation on arrays of B2is relatively
small in comparison to the other two phases and sequential
implementation is enough. Recall that B∈32,64 and that
B2<< |E|. The total time complexity is O(|E|+B2) =
O(|E|).
g) Storage Complexity Analysis: We use CSR (com-
pressed sparse row) to represent the original graph and
assume that the adjacency arrays are sorted. For triangle
counting, CSR requires two arrays, one for the offsets of
size O(|V|)and one for the indices (edges) which is of size
O(|E|). The binning technique used by LRB stores the edges
in a different order. We used an array of size O(|E|)to
store these edges. While this new edge list does not increase
the theoretical upper-bound; from a practical perspective, it
does double the memory consumption. The edges in the new
reordered edge-lists determine the order in which the edges
will be intersected. However, the intersection process itself
uses the sorted adjacency arrays in the CSR graph.
IV. VEC TO RI ZE D ALG OR IT HM S
In this section we present our new branch-avoiding and
vectorized triangle counting algorithms. Alg. 2 depicts the
sorted list intersection using the branch-avoiding program-
ming model [9], [7]. Note, the control flow for the branch-
avoiding algorithms is largely independent of input values–
this is a key enabling factor for our vectorized approach. See
Green et al. [9], [7] for additional discussion on the branch-
avoiding programming model.
Algorithm 4: Vectorized binary search kernel
while ( co nd ){
sumV ec = mm 5 1 2 ad d e p i 32 ( l ow , h i g h ) ;
mi d d l e = m m 51 2 m a s kz sr l e p i3 2 ( con d , sum Vec , o n e S h i f t e r ) ;
v a l s = mm 5 1 2 ma s k i 3 2 g a t h e r e p i 3 2 ( v a l s , co nd , mi d dl e , EAr r , 4 ) ;
cmpE qVec = mm 5 1 2 ma s k c m pe q e p i 32 m a s k ( c on d , va l s , k e y s ) ;
cmp LtV ec = mm 5 1 2 m a sk c m p l t ep i 3 2 m a sk ( c on d , v a l s , k e y s ) ;
cmp GtVe c = m m 5 1 2 m a s k cm p g t e p i 3 2 ma s k ( c o nd , v a l s , k e y s ) ;
t r i s = mm 51 2 ma sk a dd e pi 3 2 ( t r i s , c mpEqV ec , t r i s , m io ne 32 ) ;
low = mm 51 2 ma sk a dd e pi 3 2 ( low , c mpLt Vec , m id d le , mio ne 32 ) ;
hi g h = mm 51 2 ma s k ad d ep i3 2 ( hi gh , cmpG tVec , m id d le , m iMon e32 ) ; .
co nd = mm5 1 2 ma sk cm p ge e pi 32 m as k ( c on d & ˜ c mpEqV ec , hi g h , l ow ) ;
}
Alg. 2 depicts a branch-avoiding algorithm for list
intersection—this algorithm uses conditional instructions.
Such instructions do not always exist in all architectures and
are in fact designed for a single control flow systems where a
single instruction is executed based on hardware flags (zero-
flag, carry-flag, and overflow-flag). As such, a na¨
ıve vector
implementation might be constrained to a single set of these
flags or would a require a single control flow. We show how
to overcome this hardware constraint using the AVX-512
instruction set. Specifically we show 1) how to increase the
number of software control flows and 2) how to control the
execution of each lane using masks (even though we do not
have enough hardware flags).
Our experience with conditional instructions is that the
compiler is not able to figure out how to use them. We
strongly differentiate between compare and branch instruc-
tions. Compare instructions are used for evaluating different
values whereas a branch typically uses compare output to
decide on the next sequence of executable instructions.
a) Vectorized Intersections: We started off by describ-
ing how to increase the control flow. The vectorized triangle
counting can be implemented in a variety of ways using the
branch avoiding model. In the first approach the different
lanes work together on the same intersection (consisting of
two arrays). This approach was shown to be effective on the
GPU’s SIMT programming model [11], [13]; however, this
approach is significantly more challenging to implement for
vector instructions. In the second approach, each lane in the
vector unit is responsible for a different intersection. Thus,
each vector unit requires 2·Kdifferent adjacency arrays for
Kdifferent intersections.
We choose the second of the approaches–each lane is
responsible for a different intersection. This also removes
the overhead of the partitioning scheme found in[13]. Thus,
the maximal number of concurrent intersections (software
threads) that can be executed on a system is Concurrent =
P·K.
The branch-avoiding algorithm found in Alg. 2 depicts
an initial “recipe” for implementing a vectorized triangle
counting algorithm. Note that the number of data depen-
dent branches has been significantly reduced, yet there
still remains one condition in the control flow that is
data dependent–the WHILE loop’s condition which checks
bounds of the two respective arrays. To vectorize the algo-
rithm, this condition also needs to be vectorized and this is
by no means trivial as this condition is responsible for the
entire WHILE loop. The vectorized version of the algorithm
4
TABLE I
NET WOR KS US ED I N OUR E XP ERI ME NTS .
Name |V| |E|
amazon0312 400,727 3,200,440
amazon0505 410,236 3,356,440
amazon0601 403,394 3,387,388
cit-HepTh 27770 352285
cit-Patents 3774768 16518947
email-EuAll 265214 364481
g500-s21-ef16 1243072 31731650
g500-s22-ef16 2393285 64097004
Name |V| |E|
g500-s23-ef16 4606314 129250705
g500-s24-ef16 8860450 260261843
g500-s25-ef16 17043780 523467448
soc-Epinions1 75879 405740
soc-LiveJournal1 4847571 68993773
soc-Slashdot0811 77360 469180
soc-Slashdot0902 82168 504230
ensures that certain data lanes will be ignored if the bounds
of the indices for that lane are exceeded—each of the K
lanes is responsible for managing its own bounds. Similar
restrictions exist for the binary search based intersection
(Alg. 4).
Alg. 3 and Alg. 4 depict the vectorized code for the sorted
set intersection approach and for the binary search approach,
respectively. These algorithms show close-to-real vector code
instructions (using Intel’s AVX-512 instructions set) rather
than pseudo-code. This allows highlighting:
•The vectorized algorithms require gathering the elements
for Kintersections (using 2·Karrays) instead of just two
arrays for a scalar execution. The introduction of efficient
gather instructions has greatly simplified the process of
collecting elements from random locations in memory.
•The AVX-512 instruction set introduces masked-vector
instructions. These masks enable operating on a subset of
the vectors lanes and updating the counters for each lane.
Masked instructions are not conditional instructions. Specif-
ically, the masked instructions are always executed across
all the lanes; however, some data lanes might not be updated
based on the value of the mask. Another key difference is that
conditional instructions were designed for a single control
flow (one per thread) whereas the masked operations allow
a vector-wide control flow (for multiple control flows). This
distinction is the reason that a single conditional instruction
is replaced with two masked instructions. For example,
the CADDEQ operations is replaced with vector CMPEQ
instruction followed by a masked add vector operation. While
this obviously incurs a performance penalty, it also enables
increasing the scalability of the algorithm across the vector.
V. PE RF OR MA NC E ANALYSI S
a) Experiment System: The experiments presented in
this paper are primarily executed on an Intel Xeon Phi 7250
processor with 96GB of DRAM memory (102.4 GB/s peak
bandwidth). This processor is part of the Xeon Knights Land-
ing series of processors. In addition to the main memory,
the processor has an additional 16GB of MCDRAM high
bandwidth memory (400 GB/s peak bandwidth) which is
used used as our primary memory - if the graph fits into main
memory the lower latency DRAM memory is not utilized.
The Intel Xeon Phi 7250 has 68 cores with 272 threads (4-
way SMP). These cores run at a 1.3 GHz clock and share a
32MB L2 cache. Given these system parameters and using
our new algorithms, we are able to execute up to 4352
TABLE II
DIFF ERE NT PAR ALL EL VAR IATIO NS O F OUR T RI ANG LE C OUN TI NG
AL GOR IT HMS .†DEN OTE S OU R FAST EST I MP LEM EN TATION .
Algorithm name Description
Mixed-EdgeList Simple algorithm that selects intersection method based on edge properties.
lrb-scalar Scalar implementation of our LRB load-balancing.
lrb-scalar-dod Scalar implementation that includes the direction optimized graph.
lrb-vector Vectorized (branch-avoiding) implementation of our LRB load-balancing.
lrb-vector-dod †Vectorized (branch-avoiding) implementation including the direction optimized graph.
concurrent intersections 1. We also provide results for a dual
Intel Xeon 8160 Skylake processor, with 48 cores (96 threads
with hyper-threading), 32 MB LLC, and 192GB of DDR4-
2400 memory. All code, on both systems, is compiled with
the Intel Compiler (icc) (version 2017).
b) Inputs: The algorithms are tested using real world
graphs and networks taken from SNAP[16] and the HPEC
Graph Challenge [24], Table I. By default, all graphs are
treated as undirected. Directed graphs are transposed and
duplicate edges created in this phase are removed. Our al-
gorithm can also utilize the optimization of finding triangles
in a directed graph (where only half the edges exist). This
concept is used in [23], [20], [19] and is referred to as the
DOD graph in [19]—which is the terminology we use in this
paper.
c) LRB Analysis: Our algorithm implementation incor-
porates multiple optimizations. To capture the benefits of
each of these optimization, we execute our algorithm with
several different optimizations. Table II describes the various
optimizations we use.
Fig. 3 depicts various performance characteristics of our
new algorithms and the various optimizations for the soc-
LiveJournal1 graph - similar results were seen for other
graphs. Note the abscissa is log scale for all the sub-figures.
Fig. 3 (a) depicts the execution time as a function of the
number of threads and Fig. 3 (b) depicts the speedup for
each of these configurations in comparison with a sequential
execution of a specific algorithm. For all these configurations,
the parallel scalability is near linear all the way up to
68 threads which is the number of physical cores on the
KNL system used in our experiments. While there is some
performance improvement beyond 68 threads, the scaling it
is not linear. This is a well known artifact of multiple threads
per core when resources are shared. Yet, it also shows that
LRB is successful as a load-balancing mechanism.
Fig. 3 (c) highlights the contributions of the different
optimizations of our algorithm. For each thread count, all
algorithms are normalized against the “Mixed edge-list”
implementation for a given thread count. The typical speedup
of going from the scalar execution to the vectorized execution
increases performance by roughly 2.5×for both the regular
graph as well the the DOD graph. For other graphs, the
vectorization increased performance by as much as 5×.
Applying all these optimizations together greatly improves
performance over an already optimized algorithm (that se-
lects an ideal intersection kernel for each edge). Specifically,
for soc-LiveJournal this improves performance by an average
1We note that parallelism may be limited in practice by the number of
vector units. To the best of our knowledge 4 threads (single core) share 2
VPUs [25].
5
counting algorithms - including several HPEC Graph Chal-
lenge champions. On average our algorithm out performed
KOKKOS, a SpMV based implementation that uses vector
instructions, HPEC Graph Challenge Champion by an av-
erage of 2.5×. Our new algorithm is also upto 4×faster
than the fastest algorithm for the GPU (running an NVIDIA
P100 GPU). There are numerous instances where our new
algorithm is also 5× −10×faster than these algorithm.
ACK NOW LE DG ME NT S
Funding was provided in part by the Defense Advanced
Research Projects Agency (DARPA) under Contract Number
FA8750-17-C-0086. This work was partially funded by the
Doctoral Studies Program at Sandia National Laboratories.
Sandia National Laboratories is a multimission laboratory
managed and operated by National Technology & Engineer-
ing Solutions of Sandia, LLC, a wholly owned subsidiary
of Honeywell International Inc., for the U.S. Department
of Energy’s National Nuclear Security Administration under
contract DE-NA0003525. The content of the information in
this document does not necessarily reflect the position or
the policy of the Government, and no official endorsement
should be inferred. The U.S. Government is authorized to
reproduce and distribute reprints for Government purposes
notwithstanding any copyright notation here on. The authors
acknowledge the Texas Advanced Computing Center (TACC)
at The University of Texas at Austin for providing HPC re-
sources that have contributed to the research results reported
within this paper.
REF ER EN CE S
[1] L. Becchetti, P. Boldi, C. Castillo, and A. Gionis, “Efficient Semi-
streaming Algorithms for Local Triangle Counting in Massive
Graphs,” in 14th ACM SIGKDD Int’l Conf. on Knowledge Discovery
and Data Mining, 2008, pp. 16–24.
[2] M. Bisson and M. Fatica, “Static graph challenge on gpu,” in High
Performance Extreme Computing Conference (HPEC), 2017 IEEE.
IEEE, 2017, pp. 1–8.
[3] S. Chu and J. Cheng, “Triangle listing in massive networks and its
applications,” in Proceedings of the 17th ACM SIGKDD Int’l Conf.
on Knowledge Discovery and Data Mining, 2011, pp. 672–680.
[4] J. Cohen, “Trusses: Cohesive Subgraphs for Social Network Analysis,”
National Security Agency Technical Report, p. 16, 2008.
[5] J. Fox, O. Green, K. Gabert, X. An, and D. Bader, “Fast and Adaptive
List Intersections on the GPU,” in IEEE Proc. High Performance
Extreme Computing (HPEC), Waltham, MA, 2018.
[6] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Pow-
erGraph: Distributed Graph-Parallel Computation on Natural Graphs,”
in OSDI, vol. 12, 2012.
[7] O. Green, “When Merging and Branch Predictors Collide,” in IEEE
Fourth Workshop on Irregular Applications: Architectures and Algo-
rithms, 2014, pp. 33–40.
[8] O. Green and D. Bader, “Faster Clustering Coefficients Using Vertex
Covers,” in 5th ASE/IEEE International Conference on Social Com-
puting, ser. SocialCom, 2013.
[9] O. Green, M. Dukhan, and R. Vuduc, “Branch-Avoiding Graph Al-
gorithms,” in 27th ACM on Symposium on Parallelism in Algorithms
and Architectures, 2015, pp. 212–223.
[10] O. Green, J. Fox, E. Kim, F. Busato, N. Bombieri, K. Lakhotia,
S. Zhou, S. Singapura, H. Zeng, R. Kannan, V. Prasanna, and D. Bader,
“Quickly Finding a Truss in a Haystack,” in IEEE Proc. High
Performance Extreme Computing (HPEC), Waltham, MA, 2017.
[11] O. Green, R. McColl, and D. Bader., “GPU Merge Path: A GPU
Merging Algorithm,” in 26th ACM International Conference on Su-
percomputing, 2012, pp. 331–340.
[12] O. Green, L. Munguia, and D. Bader, “Load Balanced Clustering Co-
efficients,” in ACM Workshop on Parallel Programming for Analytics
Applications (PPAA), Feb. 2014.
[13] O. Green, P. Yalamanchili, and L. Mungu´
ıa, “Fast Triangle Counting
on the GPU,” in IEEE Fourth Workshop on Irregular Applications:
Architectures and Algorithms, 2014, pp. 1–8.
[14] F. Khorasani, K. Vora, R. Gupta, and L. Bhuyan, “CuSha: Vertex-
Centric Graph Processing on GPUs,” in 23rd ACM Int’l Symp. on
High-Performance Parallel and Distributed Computing (HPDC), 2014,
pp. 239–252.
[15] A. Leist, K. Hawick, D. Playne, and N. S. Albany, “GPGPU and Multi-
Core Architectures for Computing Clustering Coefficients of Irregular
Graphs,” in Int’l Conf. on Scientific Computing (CSC’11), 2011.
[16] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford Large Network
Dataset Collection,” http://snap.stanford.edu/data.
[17] J. Leskovec, K. J. Lang, and M. Mahoney, “Empirical comparison of
algorithms for network community detection,” in Proceedings of the
19th Int’l Conf. on World Wide Web. ACM, 2010, pp. 631–640.
[18] M. E. Newman, D. J. Watts, and S. H. Strogatz, “Random Graph
Models of Social Networks,” Proceedings of the National Academy of
Sciences, vol. 99, no. suppl 1, pp. 2566–2572, 2002.
[19] R. Pearce, “Triangle counting for scale-free graphs at scale in dis-
tributed memory,” in High Performance Extreme Computing Confer-
ence (HPEC), 2017 IEEE. IEEE, 2017, pp. 1–4.
[20] A. Polak, “Counting triangles in large graphs on GPU,” arXiv preprint
arXiv:1503.00576, 2015.
[21] A. Prat-P´
erez, D. Dominguez-Sal, J. M. Brunat, and J.-L. Larriba-
Pey, “Shaping Communities out of Triangles,” in Proceedings of the
21st ACM International Conference on Information and Knowledge
Management, ser. CIKM ’12, 2012, pp. 1677–1681.
[22] S. Samsi, V. Gadepally, M. Hurley, M. Jones, E. Kao, S. Mohindra,
P. Monticciolo, A. Reuther, S. Smith, W. Song, D. Staheli, and
J. Kepner, “Static Graph Challenge: Subgraph Isomorphism,” in IEEE
Proc. High Performance Extreme Computing (HPEC), Waltham, MA,
2017.
[23] J. Shun and K. Tangwongsan, “Multicore Triangle Computations
Without Tuning,” in IEEE Int’l Conf. on Data Engineering (ICDE),
2015.
[24] S. Siddharth, V. Gadepally, M. Hurley, M. Jones, E. Kao, S. Mohindra,
P. Monticciolo, A. Reuther, S. Smith, W. Song, D. Staheli, and
J. Kepner, “Static graph challenge: Subgraph isomorphism,” in IEEE
Proc. High Performance Embedded Computing Workshop (HPEC),
Waltham, MA, 2017.
[25] A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod,
S. Chinthamani, S. Hutsell, R. Agarwal, and Y.-C. Liu, “Knights land-
ing: Second-generation intel xeon phi product,” Ieee micro, vol. 36,
no. 2, pp. 34–46, 2016.
[26] T. Schank and D. Wagner, “Finding, Counting and Listing All Tri-
angles in Large Graphs, an Experimental Study,” in Experimental &
Efficient Algorithms. Springer, 2005, pp. 606–609.
[27] A. S. Tom, N. Sundaram, N. K. Ahmed, S. Smith, S. Eyerman,
M. Kodiyath, I. Hur, F. Petrini, and G. Karypis, “Exploring opti-
mizations on shared-memory platforms for parallel triangle counting
algorithms,” in High Performance Extreme Computing Conference
(HPEC), 2017 IEEE. IEEE, 2017, pp. 1–7.
[28] J. Wang and J. Cheng, “Truss Decomposition in Massive Networks,”
Proceedings of the VLDB Endowment, vol. 5, no. 9, pp. 812–823,
2012.
[29] L. Wang, Y. Wang, C. Yang, and J. D. Owens, “A comparative study
on exact triangle counting algorithms on the gpu,” in Proceedings of
the ACM Workshop on High Performance Graph Processing. ACM,
2016, pp. 1–8.
[30] D. J. Watts and S. H. Strogatz, “Collective Dynamics of ‘Small-World’
Networks,” Nature, vol. 393, no. 6684, pp. 440–442, 1998.
[31] M. M. Wolf, M. Deveci, J. W. Berry, S. D. Hammond, and S. Rajaman-
ickam, “Fast linear algebra-based triangle counting with kokkosker-
nels,” in High Performance Extreme Computing Conference (HPEC),
2017 IEEE. IEEE, 2017, pp. 1–7.
[32] J. Yang and J. Leskovec, “Defining and evaluating network communi-
ties based on ground-truth,” in Data Mining (ICDM), 2012 IEEE 12th
International Conference on. IEEE, 2012, pp. 745–754.
7