Content uploaded by Santosh Pandey
Author content
All content in this area was uploaded by Santosh Pandey on Aug 29, 2020
Content may be subject to copyright.
Multicore Triangle Computations
Without Tuning
Julian Shun and Kanat Tangwongsan
Presentation is based on paper published in International
Conference on Data Engineering (ICDE), 2015 1
•Triangle Counting
•Other variants:
•Triangle listing
•Local triangle counting/clustering coefficients
•Triangle enumeration
•Approximate counting
•Analogs on directed graphs
•Numerous applications…
•Social network analysis, Web structure, spam detection, outlier
detection, dense subgraph mining, 3-way database joins, etc.
2
Triangle Computations
Alice Bob
Carol David
Eve
Fred Greg
Hannah
✔✔
✔
Count = 3
Need fast triangle computation algorithms!
•Sequential algorithms for exact counting/listing
•Naïve algorithm of trying all triplets
O(V3) work
•Node-iterator algorithm [Schank]
O(VE) work
•Edge-iterator algorithm [Itai-Rodeh]
O(VE) work
•Tree-lister [Itai-Rodeh], forward/compact-forward [Schank-Wagner,
Lapaty]
O(E1.5) work
•Sequential algorithms via matrix multiplication
•O(V2.37) work compute A3, where A is the adjacency matrix
•O(E1.41) work [Alon-Yuster-Zwick]
•These require superlinear space
3
Sequential Triangle Computation
Algorithms V = # vertices E = # edges
4
Sequential Triangle Computation
Algorithms
What about parallel algorithms?
Source: “Algorithmic Aspects of Triangle-Based Network
Analysis”, Dissertation by Thomas Schank
•Most designed for distributed memory
•MapReduce algorithms [Cohen ’09, Suri-Vassilvitskii ‘11, Park-
Chung ‘13, Park et al. ‘14]
•MPI algorithms [Arifuzzaman et al. ‘13, Graphlab]
5
Parallel Triangle Computation Algorithms
•What about shared-memory multicore?
•Multicores are everywhere!
•Node-iterator algorithm [Green et al. ‘14]
•O(VE) work in worst case
•Can we obtain an O(E1.5) work shared-memory multicore
algorithm?
6
Triangle Computation:
Challenges for Shared Memory Machines
Irregular
computation
1Deep memory
hierarchy
2
Cache Complexity Model
9
Cache
Main Memory
CPU
Unit cost for transferring
line of size B
Free
Main
Memory
Disk
CPU
External Memory Model
Complexity = # cache misses disk accesses
Size M
Cache-aware (external-memory) algorithms: have
knowledge of M and B
Cache-oblivious algorithms: no knowledge of parameters
Size M
Cache Oblivious Model [Frigo et al. ‘99]
10
Cache
Main Memory
CPU
Size M
•Algorithm works well regardless of
cache parameters
•Works well on multi-level hierarchies
•Parallel Cache Oblivious Model for
hierarchies of shared and private
caches [Blelloch et al. ‘11]
Block size B
L3 Cache Size M3
L2 Cache Size M2
Block size B3
Block size B2
L1 Cache Size M1
Block size B1
CPU
Primitive
Work
Depth
Cache Complexity
Scan/filter/merge
O(n
)
O(log n)
O(n
/B)
Sort
O(n log n)
O(log
2n)
O((n/B)
log(M/B)(n/B))
•All previous algorithms are sequential
•External-memory (cache-aware) algorithms
•Natural-join O(E3/(M2 B)) I/O’s
•Node-iterator [Dementiev ’06] O((E1.5/B) logM/B(E/B)) I/O’s
•Compact-forward [Menegola ‘10] O(E + E1.5/B) I/O’s
•[Chu-Cheng ’11, Hu et al. ‘13] O(E2/(MB) + #triangles/B) I/O’s
•External-memory and cache-oblivious
•[Pagh-Silvestri ‘14] O(E1.5/(M0.5 B)) I/O’s or cache misses
•Parallel cache-oblivious algorithms?
11
External-Memory and Cache-Oblivious
Triangle Computation
12
Algorithm
Work
Depth
Cache Complexity
TC
-Merge
O(E
1.5)
O(log
2E)
O(E + E
1.5/B)
TC
-Hash
O(V log V + αE)
O(log
2E)
O(sort(V) + αE
)
Par.
Pagh-Silvestri
O(E
1.5)
O(log
3E)
O(E
1.5/(M0.5 B))
1
Our Contributions
Parallel Cache-Oblivious Triangle Counting Algs
2
Extensive Experimental Study
3
Extensions to Other Triangle Computations:
Enumeration, Listing, Local Counting/Clustering Coefficients,
Approx. Counting, Variants on Directed Graphs
V = # vertices E = # edges α = arboricity (at most E0.5)
M = cache size B = line size sort(n) = (n/B) logM/B(n/B)
Sequential Triangle Counting (Exact)
13
1 32
0
4
Rank vertices by degree (sorting)
Return A[v] for all vstoring higher
ranked neighbors
for each vertex v:
for each win A[v]:
count += intersect(A[v], A[w])
Work = O(E1.5)
[Schank-Wagner ‘05, Latapy ‘08]
Gives all triangles (v, w, x) where
rank(v) < rank(w) < rank(x)
1
2
(Forward/compact-forward algorithm)
Proof of O(E1.5) work bound when intersect
uses merging
14
1 32
0
4
Rank vertices by degree (sorting)
Return A[v] for all vstoring higher
ranked neighbors
for each vertex v:
for each win A[v]:
count += intersect(A[v], A[w])
•Step 1: O(E+V log V) work
•Step 2:
• For each edge (v,w), intersect does O(d+(v) + d+(w)) work
•For all v, d+(v) ≤ E0.5
•If d+(v) > E0.5, each of its higher degree neighbors also
have degree > E0.5 and total number of directed edges > E,
a contradiction
• Total work = E * O(E0.5) = O(E1.5)
1
2
Parallel Triangle Counting (Exact)
15
Rank vertices by degree (sorting)
Return A[v] for all vstoring higher
ranked neighbors
for each vertex v:
for each win A[v]:
count += intersect(A[v], A[w])
Parallel sort
and filter
parallel_
parallel_
Parallel reduction
Parallel merge (TC-Merge)
or
Parallel hash table (TC-Hash)
1
2
Step 1
Work = O(E+V log V)
Depth = O(log2V)
Cache = O(E+sort(V))
parfor v ∈ V!
parfor w ∈ A[0]!
parfor w ∈ A[1]!
parfor w ∈ A[2]!
parfor w ∈ A[3]!
parfor w ∈ A[4]!
v = 0!
v = 1!
v = 2!v = 3!v = 4!
intersect(
(A [0], A [1])!
+ +
intersect(
(A [0], A [3])!
+ +
intersect(
(A [2], A [1])!
+ +
intersect(
(A [3], A [1])!
+ +
intersect(
(A [4], A [1])!
+ +
intersect(
(A [4], A [3])!
+ +
safe to
run all in
parallel
TC-Merge and TC-Hash Details
16
for each vertex v:
for each win A[v]:
count += intersect(A[v], A[w])
parallel_
parallel_
Parallel reduction
Parallel merge (TC-Merge)
or
Parallel hash table (TC-Hash)
Step 2: TC-Merge
Work = O(E1.5)
Depth = O(log2E)
Cache = O(E+E1.5/B)
2
Step 2: TC-Hash
Work = O(αE)
Depth = O(log E)
Cache = O(αE)
•TC-Merge
•Preprocessing: sort adjacency lists
•Intersect: use a parallel and cache-oblivious merge based on divide-
and-conquer [Blelloch et al. ‘11]
•TC-Hash
•Preprocessing: for each vertex, create parallel hash table storing
edges [Shun-Blelloch ‘14]
•Intersect: scan smaller list, querying hash table of larger list in parallel
(α = arboricity (at most E0.5))
17
Algorithm
Work
Depth
Cache Complexity
TC
-Merge
O(E
1.5)
O(log
2E)
O(E + E
1.5/B) (oblivious)
TC
-Hash
O(V log V + αE)
O(log
2E)
O(sort(V) + αE
) (oblivious)
Par.
Pagh-Silvestri
O(E
1.5)
O(log
3E)
O(E
1.5/(M0.5 B))
(oblivious)
Chu
-Cheng ‘11,
Hu
et al. ‘13
O(E log E + E
2
/M
+ αE)
O(E
2/(MB) + #triangles/B)
(aware)
Pagh
-Silvestri ‘14
O(E
1.5)
O(E
1.5/(M0.5 B))
(oblivious)
Green et al.
’14
O(VE)
O(log E)
Comparison of Complexity Bounds
V = # vertices E = # edges α = arboricity (at most E0.5)
M = cache size B = line size sort(n) = (n/B) logM/B(n/B)
18
Algorithm
Work
Depth
Cache Complexity
TC
-Merge
O(E
1.5)
O(log
2E)
O(E + E
1.5/B)
TC
-Hash
O(V log V + αE)
O(log
2E)
O(sort(V) + αE
)
Par.
Pagh-Silvestri
O(E
1.5)
O(log
3E)
O(E
1.5/(M0.5 B))
1
Our Contributions
Parallel Cache-Oblivious Triangle Counting Algs
Extensive Experimental Study
3
V = # vertices E = # edges α = arboricity (at most E0.5)
M = cache size B = line size sort(n) = (n/B) logM/B(n/B)
2Extensions to Other Triangle Computations:
Enumeration, Listing, Local Counting/Clustering Coefficients,
Approx. Counting, Variants on Directed Graphs
Extensions of Exact Counting Algorithms
19
•Triangle enumeration
•Call emit function whenever triangle is found
•Listing: add to hash table to list; return contents at the end
•Local counting/clustering coefficients: atomically increment
count of three triangle endpoints
•Directed triangle counting/enumeration
•Keep separate counts for different types of triangles
•Approximate counting
•Use colorful triangle sampling scheme to create smaller sub-graph
[Pagh-Tsourakakis ‘12]
•Run TC-Merge or TC-Hash on sub-graph with pE edges (0 < p < 1)
and return #triangles/p2 as estimate
Approximate Counting
20
•Colorful triangle counting [Pagh-Tsourakakis ’12]
Assign random color in {1, …, 1/p}
to each vertex 1
Sampling: Keep edges whose
endpoints have the same color 2
Run exact triangle counting on
sampled graph, return Δsampled/p23
Parallel scan
Parallel filter
Use TC-Merge
or TC-Hash
Steps 1 & 2
Work = O(E)
Depth = O(log E)
Cache = O(E/B)
Step 3: TC-Merge
Work = O((pE)1.5)
Depth = O(log2E)
Cache = O(pE+(pE)1.5/B)
Step 3: TC-Hash
Work = O(V log V + αpE)
Depth = O(log E)
Cache = O(sort(V)+pαE)
Expected # edges = pE
Sampling rate: 0 < p < 1
21
Algorithm
Work
Depth
Cache Complexity
TC
-Merge
O(E
1.5)
O(log
2E)
O(E + E
1.5/B)
TC
-Hash
O(V log V + αE)
O(log
2E)
O(sort(V) + αE
)
Par.
Pagh-Silvestri
O(E
1.5)
O(log
3E)
O(E
1.5/(M0.5 B))
1
Our Contributions
Parallel Cache-Oblivious Triangle Counting Algs
Extensive Experimental Study
3
V = # vertices E = # edges α = arboricity (at most E0.5)
M = cache size B = line size sort(n) = (n/B) logM/B(n/B)
2Extensions to Other Triangle Computations:
Enumeration, Listing, Local Counting/Clustering Coefficients,
Approx. Counting, Variants on Directed Graphs
Experimental Setup
22
•Implementations using Intel Cilk Plus
•40-core Intel Nehalem machine (with 2-way hyper-threading)
•4 sockets, each with 30MB shared L3 cache, 256KB private L2 caches
•Sequential TC-Merge as baseline (faster than existing
sequential implementations)
•Other multicore implementations: Green et al. and GraphLab
•Our parallel Pagh-Silvestri algorithm was not competitive
•Variety of real-world and artificial graphs
Both TC-Merge and TC-Hash scale well
with # of cores:
23
LiveJournal
4M vtxes, 34.6M edges
~ 27x ~ 48x
Orkut
3M vtxes, 117M edges
40-core (with hyper-threading) Performance
24
0
5
10
15
20
25
30
35
40
45
50
random (V=100M, E=500M)
rMat (V=134M, E=500M)
3D-grid (V=100M, E=300M)
soc-LJ (V=5M, E=43M)
Patents (V=3.7M, E=17M)
com-LJ (V=4M, E=35M)
Orkut (V=3M, E=117M)
Speedup over
sequential TC-Merge
TC-Merge
TC-Hash
Green et al.
GraphLab
•TC-Merge always faster than TC-Hash (by 1.3—2.5x)
•TC-Merge always faster than Green et al. or GraphLab
(by 2.1—5.2x)
Why is TC-Merge faster than TC-Hash?
25
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Running
Time
L3 Cache
Misses
L2 Cache
Missses
# Ops for
Intersect
Normalized to TC-Merge
soc-LJ
TC-Merge
TC-Hash
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Running
Time
L3 Cache
Misses
L2 Cache
Misses
# Ops for
Intersect
Orkut
•TC-Hash less cache-efficient than TC-Merge
•Running time more correlated with cache misses than work
Comparison to existing counting algs.
26
0246810 12 14 16 18 20
TC-Merge (40 cores)
GraphLab (40 cores)
GraphLab (MPI, 64 nodes, 1024 cores)
PATRIC (MPI, 1200 cores)
Park and Chung (MapReduce, 47 nodes)
Suri and Vassilvitskii (MapReduce, 1636 nodes)
Minutes
Twitter graph (41M vertices, 1.2B undirected edges, 34.8B triangles)
(213 minutes)
(423 minutes)
•Yahoo graph (1.4B vertices, 6.4B edges, 85.8B triangles)
on 40 cores: TC-Merge takes 78 seconds
– Approximate counting algorithm achieves 99.6% accuracy in 9.1
seconds
Approximate counting
28
p=1/25
Accuracy
T
approx
T
approx/Texact
Orkut
(V=3M, E=117M)
99.8%
0.067sec
0.035
Twitter (V=41M, E=1.2B)
99.9%
2.4sec
0.043
Yahoo (V=1.4B, E=6.4B)
99.6%
9.1sec
0.117
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5
soc-LJ
com-LJ
Orkut
p
Tapprox/Texact
•Simple multicore algorithms for triangle computations are
provably work-efficient, low-depth and cache-friendly
•Implementations require no load-balancing or tuning for
cache
•Experimentally outperforms existing multicore and
distributed algorithms
•Future work: Design a practical parallel algorithm
achieving O(E1.5/(M0.5 B)) cache complexity
29
Conclusion
Algorithm
Work
Depth
Cache Complexity
TC
-Merge
O(E
1.5)
O(log
2E)
O(E + E
1.5/B)
TC
-Hash
O(V log V + αE)
O(log
2E)
O(sort(V) + αE
)
Par.
Pagh-Silvestri
O(E
1.5)
O(log
3E)
O(E
1.5/(M0.5 B))