DataPDF Available
Multicore Triangle Computations
Without Tuning
Julian Shun and Kanat Tangwongsan
Presentation is based on paper published in International
Conference on Data Engineering (ICDE), 2015 1
Triangle Counting
Other variants:
Triangle listing
Local triangle counting/clustering coefficients
Triangle enumeration
Approximate counting
Analogs on directed graphs
Numerous applications…
Social network analysis, Web structure, spam detection, outlier
detection, dense subgraph mining, 3-way database joins, etc.
2
Triangle Computations
Alice Bob
Carol David
Eve
Fred Greg
Hannah
Count = 3
Need fast triangle computation algorithms!
Sequential algorithms for exact counting/listing
Naïve algorithm of trying all triplets
O(V3) work
Node-iterator algorithm [Schank]
O(VE) work
Edge-iterator algorithm [Itai-Rodeh]
O(VE) work
Tree-lister [Itai-Rodeh], forward/compact-forward [Schank-Wagner,
Lapaty]
O(E1.5) work
Sequential algorithms via matrix multiplication
O(V2.37) work compute A3, where A is the adjacency matrix
O(E1.41) work [Alon-Yuster-Zwick]
These require superlinear space
3
Sequential Triangle Computation
Algorithms V = # vertices E = # edges
4
Sequential Triangle Computation
Algorithms
Source: “Algorithmic Aspects of Triangle-Based Network
Analysis”, Dissertation by Thomas Schank
Most designed for distributed memory
MapReduce algorithms [Cohen ’09, Suri-Vassilvitskii ‘11, Park-
Chung ‘13, Park et al. ‘14]
MPI algorithms [Arifuzzaman et al. ‘13, Graphlab]
5
Parallel Triangle Computation Algorithms
Multicores are everywhere!
Node-iterator algorithm [Green et al. ‘14]
O(VE) work in worst case
Can we obtain an O(E1.5) work shared-memory multicore
algorithm?
6
Triangle Computation:
Challenges for Shared Memory Machines
Irregular
computation
1Deep memory
hierarchy
2
Cache Complexity Model
9
Cache
Main Memory
CPU
Unit cost for transferring
line of size B
Free
Main
Memory
Disk
CPU
External Memory Model
Complexity = # cache misses disk accesses
Size M
Cache-aware (external-memory) algorithms: have
knowledge of M and B
Cache-oblivious algorithms: no knowledge of parameters
Size M
Cache Oblivious Model [Frigo et al. ‘99]
10
Cache
Main Memory
CPU
Size M
Algorithm works well regardless of
cache parameters
Works well on multi-level hierarchies
Parallel Cache Oblivious Model for
hierarchies of shared and private
caches [Blelloch et al. ‘11]
Block size B
L3 Cache Size M3
L2 Cache Size M2
Block size B3
Block size B2
L1 Cache Size M1
Block size B1
CPU
Primitive
Work
Depth
Cache Complexity
Scan/filter/merge
O(n
)
O(log n)
O(n
/B)
Sort
O(n log n)
O(log
2n)
O((n/B)
log(M/B)(n/B))
All previous algorithms are sequential
External-memory (cache-aware) algorithms
Natural-join O(E3/(M2 B)) I/O’s
Node-iterator [Dementiev ’06] O((E1.5/B) logM/B(E/B)) I/O’s
Compact-forward [Menegola ‘10] O(E + E1.5/B) I/O’s
[Chu-Cheng ’11, Hu et al. ‘13] O(E2/(MB) + #triangles/B) I/O’s
External-memory and cache-oblivious
[Pagh-Silvestri ‘14] O(E1.5/(M0.5 B)) I/O’s or cache misses
Parallel cache-oblivious algorithms?
11
External-Memory and Cache-Oblivious
Triangle Computation
12
Algorithm
Work
Depth
Cache Complexity
TC
-Merge
O(E
1.5)
O(log
2E)
O(E + E
1.5/B)
TC
-Hash
O(V log V + αE)
O(log
2E)
O(sort(V) + αE
)
Par.
Pagh-Silvestri
O(E
1.5)
O(log
3E)
O(E
1.5/(M0.5 B))
1
Our Contributions
Parallel Cache-Oblivious Triangle Counting Algs
2
Extensive Experimental Study
3
Extensions to Other Triangle Computations:
Enumeration, Listing, Local Counting/Clustering Coefficients,
Approx. Counting, Variants on Directed Graphs
V = # vertices E = # edges α = arboricity (at most E0.5)
M = cache size B = line size sort(n) = (n/B) logM/B(n/B)
Sequential Triangle Counting (Exact)
13
1 32
0
4
Rank vertices by degree (sorting)
Return A[v] for all vstoring higher
ranked neighbors
for each vertex v:
for each win A[v]:
count += intersect(A[v], A[w])
Work = O(E1.5)
[Schank-Wagner ‘05, Latapy ‘08]
Gives all triangles (v, w, x) where
rank(v) < rank(w) < rank(x)
1
2
(Forward/compact-forward algorithm)
Proof of O(E1.5) work bound when intersect
uses merging
14
1 32
0
4
Rank vertices by degree (sorting)
Return A[v] for all vstoring higher
ranked neighbors
for each vertex v:
for each win A[v]:
count += intersect(A[v], A[w])
Step 1: O(E+V log V) work
Step 2:
For each edge (v,w), intersect does O(d+(v) + d+(w)) work
For all v, d+(v) ≤ E0.5
If d+(v) > E0.5, each of its higher degree neighbors also
have degree > E0.5 and total number of directed edges > E,
Total work = E * O(E0.5) = O(E1.5)
1
2
Parallel Triangle Counting (Exact)
15
Rank vertices by degree (sorting)
Return A[v] for all vstoring higher
ranked neighbors
for each vertex v:
for each win A[v]:
count += intersect(A[v], A[w])
Parallel sort
and filter
parallel_
parallel_
Parallel reduction
Parallel merge (TC-Merge)
or
Parallel hash table (TC-Hash)
1
2
Step 1
Work = O(E+V log V)
Depth = O(log2V)
Cache = O(E+sort(V))
parfor v V!
parfor w A!
parfor w A!
parfor w A!
parfor w A!
parfor w A!
v = 0!
v = 1!
v = 2!v = 3!v = 4!
intersect(
(A , A )!
+ +
intersect(
(A , A )!
+ +
intersect(
(A , A )!
+ +
intersect(
(A , A )!
+ +
intersect(
(A , A )!
+ +
intersect(
(A , A )!
+ +
safe to
run all in
parallel
TC-Merge and TC-Hash Details
16
for each vertex v:
for each win A[v]:
count += intersect(A[v], A[w])
parallel_
parallel_
Parallel reduction
Parallel merge (TC-Merge)
or
Parallel hash table (TC-Hash)
Step 2: TC-Merge
Work = O(E1.5)
Depth = O(log2E)
Cache = O(E+E1.5/B)
2
Step 2: TC-Hash
Work = O(αE)
Depth = O(log E)
Cache = O(αE)
TC-Merge
Intersect: use a parallel and cache-oblivious merge based on divide-
and-conquer [Blelloch et al. ‘11]
TC-Hash
Preprocessing: for each vertex, create parallel hash table storing
edges [Shun-Blelloch ‘14]
Intersect: scan smaller list, querying hash table of larger list in parallel
(α = arboricity (at most E0.5))
17
Work
Depth
Cache Complexity
-Merge
O(E
1.5)
O(log
2E)
O(E + E
1.5/B) (oblivious)
-Hash
O(V log V + αE)
O(log
2E)
O(sort(V) + αE
) (oblivious)
Pagh-Silvestri
O(E
1.5)
O(log
3E)
O(E
1.5/(M0.5 B))
(oblivious)
-Cheng ‘11,
et al. ‘13
O(E log E + E
2
/M
+ αE)
O(E
2/(MB) + #triangles/B)
(aware)
-Silvestri ‘14
O(E
1.5)
O(E
1.5/(M0.5 B))
(oblivious)
14
O(VE)
O(log E)
Comparison of Complexity Bounds
V = # vertices E = # edges α = arboricity (at most E0.5)
M = cache size B = line size sort(n) = (n/B) logM/B(n/B)
18
Algorithm
Work
Depth
Cache Complexity
TC
-Merge
O(E
1.5)
O(log
2E)
O(E + E
1.5/B)
TC
-Hash
O(V log V + αE)
O(log
2E)
O(sort(V) + αE
)
Par.
Pagh-Silvestri
O(E
1.5)
O(log
3E)
O(E
1.5/(M0.5 B))
1
Our Contributions
Parallel Cache-Oblivious Triangle Counting Algs
Extensive Experimental Study
3
V = # vertices E = # edges α = arboricity (at most E0.5)
M = cache size B = line size sort(n) = (n/B) logM/B(n/B)
2Extensions to Other Triangle Computations:
Enumeration, Listing, Local Counting/Clustering Coefficients,
Approx. Counting, Variants on Directed Graphs
Extensions of Exact Counting Algorithms
19
Triangle enumeration
Call emit function whenever triangle is found
Listing: add to hash table to list; return contents at the end
Local counting/clustering coefficients: atomically increment
count of three triangle endpoints
Directed triangle counting/enumeration
Keep separate counts for different types of triangles
Approximate counting
Use colorful triangle sampling scheme to create smaller sub-graph
[Pagh-Tsourakakis ‘12]
Run TC-Merge or TC-Hash on sub-graph with pE edges (0 < p < 1)
and return #triangles/p2 as estimate
Approximate Counting
20
Colorful triangle counting [Pagh-Tsourakakis ’12]
Assign random color in {1, …, 1/p}
to each vertex 1
Sampling: Keep edges whose
endpoints have the same color 2
Run exact triangle counting on
sampled graph, return Δsampled/p23
Parallel scan
Parallel filter
Use TC-Merge
or TC-Hash
Steps 1 & 2
Work = O(E)
Depth = O(log E)
Cache = O(E/B)
Step 3: TC-Merge
Work = O((pE)1.5)
Depth = O(log2E)
Cache = O(pE+(pE)1.5/B)
Step 3: TC-Hash
Work = O(V log V + αpE)
Depth = O(log E)
Cache = O(sort(V)+pαE)
Expected # edges = pE
Sampling rate: 0 < p < 1
21
Algorithm
Work
Depth
Cache Complexity
TC
-Merge
O(E
1.5)
O(log
2E)
O(E + E
1.5/B)
TC
-Hash
O(V log V + αE)
O(log
2E)
O(sort(V) + αE
)
Par.
Pagh-Silvestri
O(E
1.5)
O(log
3E)
O(E
1.5/(M0.5 B))
1
Our Contributions
Parallel Cache-Oblivious Triangle Counting Algs
Extensive Experimental Study
3
V = # vertices E = # edges α = arboricity (at most E0.5)
M = cache size B = line size sort(n) = (n/B) logM/B(n/B)
2Extensions to Other Triangle Computations:
Enumeration, Listing, Local Counting/Clustering Coefficients,
Approx. Counting, Variants on Directed Graphs
Experimental Setup
22
Implementations using Intel Cilk Plus
40-core Intel Nehalem machine (with 2-way hyper-threading)
4 sockets, each with 30MB shared L3 cache, 256KB private L2 caches
Sequential TC-Merge as baseline (faster than existing
sequential implementations)
Other multicore implementations: Green et al. and GraphLab
Our parallel Pagh-Silvestri algorithm was not competitive
Variety of real-world and artificial graphs
Both TC-Merge and TC-Hash scale well
with # of cores:
23
LiveJournal
4M vtxes, 34.6M edges
~ 27x ~ 48x
Orkut
3M vtxes, 117M edges
24
0
5
10
15
20
25
30
35
40
45
50
random (V=100M, E=500M)
rMat (V=134M, E=500M)
3D-grid (V=100M, E=300M)
soc-LJ (V=5M, E=43M)
Patents (V=3.7M, E=17M)
com-LJ (V=4M, E=35M)
Orkut (V=3M, E=117M)
Speedup over
sequential TC-Merge
TC-Merge
TC-Hash
Green et al.
GraphLab
TC-Merge always faster than TC-Hash (by 1.3—2.5x)
TC-Merge always faster than Green et al. or GraphLab
(by 2.1—5.2x)
Why is TC-Merge faster than TC-Hash?
25
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Running
Time
L3 Cache
Misses
L2 Cache
Missses
# Ops for
Intersect
Normalized to TC-Merge
soc-LJ
TC-Merge
TC-Hash
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Running
Time
L3 Cache
Misses
L2 Cache
Misses
# Ops for
Intersect
Orkut
TC-Hash less cache-efficient than TC-Merge
Running time more correlated with cache misses than work
Comparison to existing counting algs.
26
0246810 12 14 16 18 20
TC-Merge (40 cores)
GraphLab (40 cores)
GraphLab (MPI, 64 nodes, 1024 cores)
PATRIC (MPI, 1200 cores)
Park and Chung (MapReduce, 47 nodes)
Suri and Vassilvitskii (MapReduce, 1636 nodes)
Minutes
Twitter graph (41M vertices, 1.2B undirected edges, 34.8B triangles)
(213 minutes)
(423 minutes)
Yahoo graph (1.4B vertices, 6.4B edges, 85.8B triangles)
on 40 cores: TC-Merge takes 78 seconds
Approximate counting algorithm achieves 99.6% accuracy in 9.1
seconds
Approximate counting
28
p=1/25
Accuracy
T
approx
T
approx/Texact
Orkut
(V=3M, E=117M)
99.8%
0.067sec
0.035
99.9%
2.4sec
0.043
Yahoo (V=1.4B, E=6.4B)
99.6%
9.1sec
0.117
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5
soc-LJ
com-LJ
Orkut
p
Tapprox/Texact
Simple multicore algorithms for triangle computations are
provably work-efficient, low-depth and cache-friendly
Implementations require no load-balancing or tuning for
cache
Experimentally outperforms existing multicore and
distributed algorithms
Future work: Design a practical parallel algorithm
achieving O(E1.5/(M0.5 B)) cache complexity
29
Conclusion
Algorithm
Work
Depth
Cache Complexity
TC
-Merge
O(E
1.5)
O(log
2E)
O(E + E
1.5/B)
TC
-Hash
O(V log V + αE)
O(log
2E)
O(sort(V) + αE
)
Par.
Pagh-Silvestri
O(E
1.5)
O(log
3E)
O(E
1.5/(M0.5 B))