Content uploaded by Heng Zhang

Author content

All content in this area was uploaded by Heng Zhang on May 30, 2022

Content may be subject to copyright.

Content uploaded by Dingwen Tao

Author content

All content in this area was uploaded by Dingwen Tao on May 14, 2022

Content may be subject to copyright.

Bring Orders into Uncertainty: Enabling Eicient Uncertain

Graph Processing via Novel Path Sampling

on Multi-Accelerator Systems

Heng Zhang∗

Institute of Software, CAS

FSA lab, University of Sydney

Lingda Li

Brookhaven National Laboratory

Hang Liu

Stevens Institute of Technology

Donglin Zhuang

FSA lab, University of Sydney

Rui Liu

University of Chicago

Chengying Huan

Tsinghua University

Shuang Song

Meta

Dingwen Tao

Washington State University

Yongchao Liu

Ant Financial

Charles He

Ant Financial

Yanjun Wu

Institution of Software, CAS

Shuaiwen Leon Song†

FSA lab, University of Sydney

Abstract

Uncertain or probabilistic graphs have been ubiquitously used to

represent noisy, incomplete, and inaccurate linked data in many

emerging big-data mining and analytics applications. It is imprac-

tical to solve uncertain graph problems exactly as it requires to

evaluate an exponential number of certain instances (or “possible

worlds”) generated from an uncertain graph. Previously, several

CPU-based techniques were proposed to use sampling for uncer-

tain graph processing. However, we observe that (1) they suer

from low computation eciency and large memory overhead due

to unnecessary edge sampling at runtime; (2) they cannot leverage

the massive parallelism provided by modern general-purpose accel-

erators; and (3) there lacks a general programming framework for

high-performance uncertain graph processing. To tackle these chal-

lenges, we propose a novel runtime path sampling method, which

is able to identify and eliminate unnecessary edge sampling via

incremental path identication and ltering, resulting in signicant

reduction in computation and data movement. Centered around

this idea, we introduce a general uncertain graph processing frame-

work for multi-GPU systems, named

BPGraph1

.

BPGraph

provides

general support for users to design and optimize a wide-range of

uncertain graph algorithms and applications without concerning

about the underlying complexity. Extensive evaluation on a variety

of real-world uncertain graph applications demonstrates an average

speedup of 26

×

(up to 43

×

) and better scalability from

BPGraph

over

the state-of-the-art frameworks.

CCS Concepts

•Computing methodologies →Parallel computing method-

ologies.

∗This work is conducted during Heng’s visit to FSA lab at University of Sydney.

†Corresponding author.

1Beta version can be found at https://github.com/bpgraph/bpgraph

Permission to make digital or hard copies of part or all of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for third-party components of this work must be honored.

For all other uses, contact the owner/author(s).

ICS ’22, June 28–30, 2022, Virtual Event, USA

©2022 Copyright held by the owner/author(s).

ACM ISBN 978-1-4503-9281-5/22/06.

https://doi.org/10.1145/3524059.3532379

Keywords

Uncertain Graph; Sampling; GPU; Performance

ACM Reference Format:

Heng Zhang, Lingda Li, Hang Liu, Donglin Zhuang, Rui Liu, Chengying

Huan, Shuang Song, Dingwen Tao, Yongchao Liu, Charles He, Yanjun Wu,

and Shuaiwen Leon Song. 2022. Bring Orders into Uncertainty: Enabling

Ecient Uncertain Graph Processing via Novel Path Sampling on Multi-

Accelerator Systems. In 2022 International Conference on Supercomputing

(ICS ’22), June 28–30, 2022, Virtual Event, USA. ACM, New York, NY, USA,

14 pages. https://doi.org/10.1145/3524059.3532379

1 Introduction

The use of large-scale graph-structured data has exploded in sci-

entic, data mining and analytics applications in recent years.

Some community eorts [

14

,

28

,

38

,

45

,

53

,

54

,

59

] have been made

to eciently process and analyze these data via exploiting high-

performance accelerators under a heterogeneous scale-up and scale-

out setup which has become mainstream node architecture for

Top500 supercomputers. Among these important graph analytics,

uncertainty is often intrinsic to a wide spectrum of graph applica-

tions, which applies to graph data such as noisy measurement in

inter-node connection in supercomputing center [

38

,

55

], database

querying [

7

,

12

,

25

,

26

,

29

], probability in peer-to-peer network [

25

],

bioinformatics [

3

,

26

,

42

], relationship inuence in social networks

[

2

,

10

,

11

], congestion prediction in trac network [

24

], etc. In the

literature, uncertain graphs (also known as probabilistic graphs)

have been widely utilized to represent these uncertainties [

5

,

47

]. In

an uncertain graph, the existence of connection between two nodes

is supposed to be independently indeterminate, and is formulated

as a probabilistic edge which is assigned with an uncertainty value.

Figure 1 illustrates a sensor network, which encodes the network

connectivity probabilities into edge uncertainties. After instanti-

ating the uncertainty of each edge, a set of “possible worlds” are

generated to represent all possible instances of the uncertain graph

and their probabilities. For instance, for the bottom rightmost possi-

ble world in Figure 1(b), which represents the case where all 3 edges

exist, its probability is calculated by multiplying the connection

probabilities of all the edges, i.e., 0

.

8

×

0

.

5

×

0

.

3

=

0

.

12. The number

of possible worlds equals to 2

|𝐸|

, where

|𝐸|

is the number of edges.

Conventionally, nding the exact solution of an uncertain graph

problem requires to iterate through all its possible worlds. As the

ICS ’22, June 28–30, 2022, Virtual Event, USA Heng, Lingda, Hang, Donglin, Rui, et al.

p(e1)=0.8

S2

S0

S1

(0.07) (0.28) 0.28 (0.03)

(0.03) (0.07) (0.12) (0.12)

(a) Sensor Packet Delivery Network (b) Possible World Graphs

Connection Probablity

p(e3)=0.3

p(e2)=0.5

Figure 1: An example of sensor network with its (a) uncertain graph

representation and (b) eight “possible worlds”.

number of possible worlds grows exponentially to the number of

edges, it is often unrealistic to get an exact solution. For example,

computing the probability of whether there is a path between two

vertices on an uncertain graph is #P-hard [

26

,

47

]. The rapidly

increasing scale of data in these applications has hindered many

eorts to seek ecient algorithms for traversing, processing, and

mining large-scale uncertain graphs.

Existing Approaches and Limitations.

Previous works on

uncertain graph analysis have sought sampling-based methods to

nd approximate solutions on uncertain graphs [

7

,

13

,

17

,

37

,

47

].

These methods sample the entire or a part of an uncertain graph to

obtain possible world samples. The theory behind these methods is

based on the assumption that with a reasonable amount of samples,

an approximate solution of the original uncertain graph can be

estimated with a certain accuracy guarantee. However, there are

three main limitations for applying the existing approaches in HPC

environments.

First, unnecessary edge sampling results in poor computation and

memory eciency. Our evaluation has shown that the existing

approaches suer from signicant performance and memory over-

head, especially for large uncertain graphs. Detailed analysis reveals

that the main factor behind these overheads is the large amount of

unnecessary edges sampled during the execution.

Second, they lack the support of modern heterogeneous HPC and

datacenter architectures which are commonly integrated with one

or more accelerators (e.g., GPUs). To the best of our knowledge,

existing techniques are all built on CPU-based systems. Packed with

massive parallelism and high-bandwidth memory, modern GPUs

are attractive accelerators for uncertain graph processing [

15

,

16

,

23

,

40

,

41

,

54

]. There have been works proposed for deterministic graph

processing on multiple GPUs [

4

,

27

,

46

,

59

]. However, these existing

systems cannot handle the probabilistic nature of the uncertain

graphs.

Third, they lack the support of programming API for users to write

ecient uncertain graph applications. Previous uncertain graph pro-

cessing solutions only provide ad-hoc optimizations on one or a few

applications, but fail to provide a general programmable interface

for users to eectively implement a wide range of applications. It

is unwise to implement dierent processing strategies and opti-

mizations for dierent uncertain graph applications which may

share many fundamental features. Thus, an ecient, scalable, and

programmable uncertain graph processing framework is desirable.

Approach and Contributions.

To eliminate unnecessary edge

sampling and improve computation and memory eciency, we

propose a novel path sampling method. It eectively identies

unnecessary edges prior to sampling, and only considers edges that

are on a path between the source and target as useful (Section 3).

Centered around this sampling strategy, we present

BPGraph

, an

ecient, scalable, and programmable uncertain graph processing

framework that eectively implements path sampling on multi-

GPU based HPC systems (Section 4) and scales up heterogeneous

node-level computation eciency.

BPGraph

provides a general pro-

gramming interface so that users can implement various uncertain

graph processing applications with ease. Additionally, it optimizes

the data organization and computation patterns to better map un-

certain graph processing onto GPUs. To the best of our knowledge,

BPGraph

is the rst system design to provide general support for

developing and optimizing uncertain graph analytics on multi-GPU

based heterogeneous architectures.

Extensive experiments are conducted on eight real-world un-

certain graphs to evaluate

BPGraph

, and the results demonstrate

that

BPGraph

achieves an up to 43

×

speedup (26

×

on average) over

the state-of-the-art approaches (i.e., ProbTree [

37

] and BitEdge-

Sampling [

69

]).

BPGraph

also scales well with the number of GPUs.

2 Background & Motivation

2.1 Uncertain Graph Basics

First, we give a denition for uncertain graph as follows.

Denition 1. (Uncertain Graph)

Let

𝐺=(𝑉 , 𝐸)

be a determin-

istic graph where

𝑉

is a set of vertices, and

𝐸⊆𝑉×𝑉

is a set of edges

among vertices. An uncertain graph is dened as a triple

G=(𝑉 , 𝐸, 𝑃 )

,

where

𝑃

is a function on edges. For any

𝑒∈𝐸

,

𝑃(𝑒)

represents the

existence probability of 𝑒. It is obvious that 0<𝑃(𝑒) ≤ 1.

We refer

𝐺

as the corresponding deterministic graph of

G

. Clearly,

𝐺

is a special uncertain graph, where

𝑃(𝑒)=

1for any

𝑒

. Note that

the edge probabilities are independent of each other following by

the previous literature. The number of vertices and edges in

G

or

𝐺

can be denoted as the size of vertex list

|𝑉|

and edge set

|𝐸|

,

respectively. It is also worth noting that the probability function

𝑃

is dierent from edge weights of deterministic graphs. Without

losing generality, we assume all edges have the same weight 1 in

this paper. To solve uncertain graph problems, we introduce:

Denition 2. (Possible World)

By instantiating an uncertain

graph, we denote a possible world

𝐺′=(𝑉 , 𝐸 ′)

as a certain instance of

the uncertain graph

G

, i.e.,

𝐺′⊑G

. The edge set

𝐸′⊆𝐸

is obtained by

executing independent sampling operations on

𝐸

, following the prob-

ability function

𝑃

. Thus, each uncertain graph

G

yields 2

|𝐸|

possible

worlds, based on which edges are selected. Particularly, the possibility

of observing a possible world graph

𝐺′

is calculated by multiplying

the probabilities that every edge gets selected or unselected:

Pr(𝐺′)=Ö

𝑒∈𝐸′

𝑃(𝑒)Ö

𝑒∈𝐸\𝐸′

(1−𝑃(𝑒))

Many probabilistic problems in data analytic, machine learning,

and many other areas employ uncertain graphs to model the in-

accurate relationships on datasets of interest. Here, we introduce

reliability as an application example of uncertain graphs. A wide

variety of applications, e.g., network routing [

43

], network detec-

tion [

39

,

52

], route planning [

24

], web crawling [

2

,

50

], can benet

from a high-performance reliability computation. Given two arbi-

trary vertices

𝑠

and

𝑡

in an uncertain graph, there are four typical

variations of 𝑠−𝑡reliability:

(1) Reachability [7]. Compute the probability from 𝑠to 𝑡.

Enabling Eicient Uncertain Graph Processing via Novel Path Sampling ICS ’22, June 28–30, 2022, Virtual Event, USA

06

(b) Sample K=3 Possible Worlds with P(E) (c) Traverse Each Possible World and Generate Paths

Path1 in G1 :

50 2 6

S-T Distance

3

Reliability

2/3

06

51/3

06

(d) Evaluate Paths with Reliability in Possible Worlds

Path4 in G3 :

Path5 in G3 :

21/3

06

S-T Pair

Possible World (G1) Possible World (G3)

G1G3

Path2 in G1 :50 64

50 62

0 61

Uncertain

Graph G

Distance

3

2

3

3

5

Possible World (G2)

G2

Path3 in G2 :40 52 3 6

Distance-Constraint Reliability

1

5

3

4

06

(0.05)

(0.51)

(0.21)

(0.28)

(0.26)

(0.3)

(0.63)

2

(0.46)

(0.13)

(0.25)

(0.32)

sT

(a) A Toy Uncertain Graph With Probability

7

(0.25)

(0.45)

Figure 2: Execution ow of possible world sampling using the toy graph in (a).

(2)

Distance-Constraint Reachability [

25

]. Given a probability

threshold

𝛿∈ (

0

,

1

)

, returns a set of distance

𝐷(𝑠→𝑡)

and

the corresponding reachable probability

𝑃(𝑠→𝑡)

, with the

constraint that 𝑃(𝑠→𝑡) ≥ 𝛿.

(3)

Expected Shortest Distance [

63

,

66

]. On top of the distance-

constraint reachability, nd the shortest path from 𝑠to 𝑡.

(4)

User-Dened-Constraint Reachability. [

26

,

70

]On top of the

distance-constraint reachability, nd eligible paths based on

a user-dened constraint. E.g., users may want to lter out

paths with too long distance.

In order to compute the exact reliability of an uncertain graph

G

,

we need to iterate through all its possible worlds. Then the reliability

of

G

can be derived from the reachability between

𝑠

and

𝑡

of each

individual possible world. Recall that there are 2

|𝐸|

possible worlds

in total, making this reliability computation method impractical.

Section 2.2 will introduce how previous work uses sampling to nd

approximate solutions for uncertain graph problems.

2.2 Uncertain Graph Sampling

As we state above, it is extremely expensive computationally to

get the exact solution of an uncertain graph, which requires to

enumerate every possible world. In practice, researchers propose

to use sampling methods to get approximate solutions. By solving

the problem on randomly selected samples of an uncertain graph,

and averaging the solutions on them, an approximate solution of

the uncertain graph is obtained. Based on the sampling granularity,

existing work can be classied as either 1) entirety sampling or 2)

partition sampling.

Entirety Sampling.

These methods, e.g., Monte Carlo sampling

[

48

,

69

] and recursive sampling [

25

,

35

], randomly sample a certain

number of possible worlds from the entire uncertain graph.

Figure 2 illustrates an example of using entirety sampling to

solve distance-constraint reachability between

𝑉0

and

𝑉6

. Assume

three possible worlds are sampled from the uncertain graph

G

based on the independent edge probabilities (Figure 2(b)). After

that, we traverse each possible world to gure out all paths from

𝑉0

to

𝑉6

, as shown in Figure 2(c). The distances of these paths are also

calculated, which equal to their hop counts. Finally, the distance-

constraint reachability between

𝑉0

and

𝑉6

can be summarized by

combining the distances of all paths found. E.g., 2 possible worlds

include paths with distance=3 (

𝐺1

and

𝐺2

) out of the total 3 possible

worlds, and thus the reliability of distance=3 is calculated as 2/3.

As we will discuss in Section 2.3, entirety sampling results in a

lot of redundant sampling overhead due to the similarity between

possible worlds, which further causes signicant computation and

memory overhead.

90.23%

89.36%

91.59%

98.17%

97.98%

Evaluation Cost Traversal Cost Sample Cost

Exe. BreakDown (%)

0

50

100

Dataset

netHept gnutellaP2P coauthor kron-logn20kron-logn21

(a) Execution Breakdown

Dataset

Memory Usage

0

5

10

netHept

gnutellaP2P

Raw Size

DistR

ProbTree

0

2×103

4×103

6×103

8×103

coauthor

kron-logn20

kron-logn21

(b) Memory Usage (MB)

Figure 3: Performance analysis of uncertain graph processing. We

illustrate the results from evaluating a state-of-the-art Monte-Carlo

sampling method [69].

Partition Sampling.

Instead of sampling the entire uncertain

graph to generate possible worlds, partition sampling methods

break down the graph into partitions and try to prune useless par-

titions [

8

,

37

,

62

]. These methods organize the mutual dependency

between partitions in a tree index structure. Given a reachability

query, they nd out all partitions that are on the way from the spe-

cic source to target vertices, using the tree index structure. Then,

a subgraph

𝐺𝑞

is created by combining these relevant partitions

together, and sampling is performed on

𝐺𝑞

instead of the entire

uncertain graph. By this way, partition sampling does not sample

irrelevant partitions and thus reduces workloads.

The major drawback of these methods is that 1) they still do

unnecessary edge sampling because there are useless edges within

useful partitions, 2) they require to build the index tree, which is

very time consuming, and 3) maintaining the redundant large index-

ing tree data in memory is dicult and costly, which is unacceptable

for GPU platforms which have limited memory capacity.

2.3 Challenges & Opportunities

Unnecessary Edge Sampling.

Figure 3(a) shows the execution

time breakdown of an optimized Monte-Carlo sampling method

[

69

], a state-of-the-art entirety sampling based method, running on

a 20-core Intel(R) Xeon(R) CPU E5-2698 (512GB memory, detailed

conguration in Section 5). The experimental results include the

sampling cost (Figure 2(b)), the traversal cost (Figure 2(c)), and the

nal evaluation cost (Figure 2(d)). It illustrates that more than 90%

ICS ’22, June 28–30, 2022, Virtual Event, USA Heng, Lingda, Hang, Donglin, Rui, et al.

1

5

3

4

0

2

6 5

16

4

(a) Breadth-First Order Traversal Tree

10 6

40 65 3

40 65 3 1

40 652

0 652

40 652 3 1

40 632 5

s-t pair

0-6

(b) Path Identification

6

(d) Path Sampling & Constraints (e) Sampled Paths with Constrains

0 6

Distance-Reliability

Pair

(2, 1/3)

(3, 2/3)

(f) Reliability Results

(4,1/3)

7

7

50 632

40 65

primary

secondary

path0

path1

path2

path3

path4

path5

path6

path7

path8

Path Status

Bits

path0001

Constraints

(σ>0.5)

0.6

path1101 0.7

path2100 0.3

path3000 0.8

path4100 0.7

path5000 0.2

path6000 0.4

path7010 0.1

10 (011)

20 (111)

40 (100)

52 (101)

54 (110)

42 (111)

35 (110)

13 (000)

1 6

(001)

5 6

(101)

3 6

(011)

(c) Edge Sampling

path8000 0.8

10 6

40 65 3

40 65 3 1

40 652

0 652

40 652 3 1

40 632 5

50 632

40 65

Figure 4: The workow of our proposed path sampling.

time is spent on the sampling. For larger-scale uncertain graphs,

the cost of sampling becomes even more signicant.

The reason why the sampling phase takes so much time is that

existing methods sample many useless edges which do not contribute

to the results. On one hand, possible world sampling needs to

sample every edge of the uncertain graph, no matter whether it is

possible for them to be on a path from source to target. For instance,

in Figure 2(a), it is obvious to see that none of the paths through

edges

𝑉1→𝑉7

and

𝑉6→𝑉7

. On the other hand, although partition

sampling does not need to sample edges in the pruned partitions,

there are still useless edges in the useful partitions. As a result,

partition sampling cannot eliminate unnecessary edge sampling

within the partitions.

Unnecessary edge sampling has bad consequences: waste of both

computation and memory resources because it requires extra mem-

ory to cache useless edge sampling results. Figure 3(b) depicts that

the memory consumption of two state-of-the-art approaches, DistR

[

8

] that uses entirety sampling and ProbTree [

37

] that uses parti-

tion sampling. The results shows that their memory requirement is

more than 5.31

×

larger than the raw structural data in kron_g500-

logn20 graph, and even 6.57

×

larger than that in gnutellaP2P graph.

The partition sampling methods consume more memory because

they require extra space to store the index tree. Due to massive

caching memory space overhead, existing approaches cannot scale

to large-scale uncertain graphs.

Traversal Redundancy.

Besides, in entirety and partition sam-

pling, there exist a lot of redundant traversals due to the similarity

of dierent possible worlds. For instance, in Figure 2(b),

𝑃𝑎𝑡ℎ1

in

𝐺1

and

𝑃𝑎𝑡ℎ5

in

𝐺3

are exactly the same. Instead of letting dierent

possible worlds traverse these common paths separately, it will be

much more ecient if they are traversed only once.

Poor Programmability and Generality.

The state-of-the-art

methods focus on solving specic uncertain graph problems. For

instance, [

24

] is designed for trac prediction, while [

2

] focuses on

social network analysis. A general, programmable framework is in

need to support the implementation of a wide range of uncertain

graph applications.

Lack of Utilizing State-of-the-art GPU.

Existing uncertain

graph processing frameworks are all built upon CPU platforms.

Compared to CPU, GPU has shown great potentials for determinis-

tic graph algorithms because of their superior parallel capability

[

27

,

28

,

45

,

59

]. GPU-accelerated data analytic technology has been

mainly used for certain graph analysis by now [

4

,

14

,

59

], very

few literature research on uncertain graph processing. Due to the

fact of that SIMD architecture is t for repetitive computation on

regular data, accelerating of uncertain graph processing via GPUs

is still challenging, i) how to organize massive possible world in

GPU-resident memory via dataset reformation, ii) how to easily

express the probability feature of uncertain graph program on par-

allel SIMD-aware GPUs. This motivates us to leverage GPU’s high

computation throughput for uncertain graph processing.

2.4 Our Goal

To address unnecessary edge sampling and redundant traversals, we

aim to identify and lter out useless edges before sampling. Section

3 will introduce our novel path sampling method for this purpose. To

address the programmability and hardware utilization challenges,

we aim to propose a general multi-GPU based uncertain graph

processing framework which centers around our path sampling

method. Section 4 will discuss the design of our framework.

3 Novel Path Sampling

Inspired by our observation that possible worlds share many com-

mon paths, we propose a novel path sampling strategy. While en-

tirety and partition sampling traverses every possible world after

sampling, our path sampling approach requires only a one-time

traversal of the uncertain graph before sampling to nd all possible

paths between the source and target vertices. As a result, the shared

paths among dierent possible worlds are sampled only once which

solves the traversal redundancy challenge discussed in Section 2.3.

Furthermore, our path sampling only samples edges on possible

paths and completely avoids sampling other useless edges. This can

address the unnecessary edge sampling challenge in Section 2.3.

3.1 Overview

Figure 4 shows how our proposed path sampling works for the

uncertain graph in Figure 2(a).

Path Identication.

First, we traverse the uncertain graph

(shown in Figure 4(a)) from the source vertex

𝑉0

to nd all possible

paths that lead to the target vertex

𝑉6

. To nd paths from

𝑉0

to

𝑉6

,

we align all vertices along the breadth-rst order tree and mark

their out edges, then do a bottom-up traversal to nd the temporal

paths from

𝑉0

to

𝑉6

. After recursively expanding other paths among

the inter vertices, all possible paths from

𝑉0

to

𝑉6

are fetched. For

example, after getting the rst path

𝑉0→𝑉2→𝑉5→𝑉6

, we

repeatedly add other paths

𝑉0→𝑉4→𝑉5

,

𝑉0→𝑉2→𝑉4→𝑉5

from the bottom-up results of

𝑉0→𝑉5

and

𝑉0→𝑉2

until no

new vertex is added into the path. Note that, the depth of a path is

bounded by the diameter of the graph, which prevents the lengths

Enabling Eicient Uncertain Graph Processing via Novel Path Sampling ICS ’22, June 28–30, 2022, Virtual Event, USA

of generated paths from getting too large or skewed. Figure 4(b)

shows that all paths are found.

Edge Sampling.

Second, we only sample edges that are on those

identied paths. This is the core dierence between our path sam-

pling and entirety/partition sampling, which also samples other

useless edges. Figure 4(c) shows the sampled edges.

The sampling of dierent edges are performed independently.

A bitmap is used to store the sampling result for each edge. It has

𝐾

bits (

𝐾=

3in Figure 4), and the

𝑖𝑡ℎ

bit represents the result of

the

𝑖𝑡ℎ

sampling. If the

𝑖𝑡ℎ

bit

𝑏𝑖𝑡𝑖=

0, it represents this edge does

not exist in the

𝑖𝑡ℎ

possible world

𝐺𝑖

. Otherwise,

𝑏𝑖𝑡𝑖=

1repre-

sents it exists in

𝐺𝑖

. Each edge is sampled

𝐾

times by generating

𝐾

random numbers

{𝑟1, 𝑟2, ..., 𝑟 𝐾}

where each of them distributes

uniformly in the range of

(

0

,

1

]

. We compare the edge’s probability

with

𝑟1, 𝑟2, ..., 𝑟 𝐾

. If the probability is larger than

𝑟𝑛

, we update the

𝑛𝑡ℎ

bit of the bitmap to 1to mark the edge’s existence, otherwise

to 0. E.g., the status of edge

𝑉0→𝑉1

is 011 means it exists in the

2nd and 3rd sample but does not in the 1st sample in Figure 4(c).

Path Sampling.

Third, we combine the edge sampling results

to compute the path sampling results. Intuitively, a path only exists

when every edge on it exists. Similarly, a bitmap is used for each

path to represent its sampling result as shown in Status Bits of

Figure 4(d). The path bitmap is calculated by logically ANDing the

bitmap of every edge on this path. For example, the bitmap of

𝑝𝑎𝑡ℎ0

equals to the bitmap of edge

𝑉0→𝑉1

AND that of

𝑉1→𝑉6

, which

further equals to 011

∧

001

=

001. Note that the sampling result of

each path can be computed independently.

Constraint based Filtering.

Fourth, candidate paths are l-

tered based on the path sampling results and user-dened con-

straints. There is a two-level lter to evaluate the existence of paths.

(1) The rst lter prunes paths that do not exist in any samples. If

all sampling result bits are 0 for a path (Status Bits in Figure 4(d)),

it is ltered out. In Figure 4(d),

𝑝𝑎𝑡ℎ3

,

𝑝𝑎𝑡ℎ5

,

𝑝𝑎𝑡ℎ6

, and

𝑝𝑎𝑡ℎ8

are

pruned by this lter. (2) The second lter prunes paths that do not

qualify according to the user-dened constraint criteria.

𝑝𝑎𝑡ℎ2

and

𝑝𝑎𝑡ℎ7are pruned by the second lter in Figure 4(d).

Result Computation.

Finally, we obtain the distance proba-

bilistic results (Figure 4(f)) from leftover paths (Figure 4(e)). This

step is same as that of entirety sampling described in Section 2.2.

3.2 Computational & Memory Overhead

We model the principle computational and memory overhead of

entirety sampling, partition sampling, and our proposed path sam-

pling. Given an uncertain graph

𝐺

, let the number of vertices and

edges be

𝑛

and

𝑚

respectively, i.e.,

𝑛=|𝑉|

and

𝑚=|𝐸|

, and let the

number of possible worlds be 𝐾.

Entirety Sampling.

These methods sample each edge of

𝐺

by

𝐾

times, and thus the number of total sampling operations is

𝐾×𝑚

.

To memorize the whole set of possible worlds, it consumes

𝐾×𝑚×𝑓

of memory space, where

𝑓

is the average fraction of edges sampled

in all possible worlds and 0<𝑓<1.

Partition Sampling.

The partition sampling tries to organize

the uncertain graph into

𝑃

partitions, and connect them via a tra-

versal index tree. The s-t query task nds out all partitions along

the tree and combine them into one subgraph. The subgraph needs

to contain all the traversal paths to give a precise answer for the

query. The sampling overhead depends on the edge number of the

reduced subgraph

𝑚𝑝

. The number of sampling equals to

𝐾×𝑚𝑝

.

Meanwhile, due to the mutual connection between the

𝑃

partitions,

the memory cost of partition sampling will be much larger than

entirety sampling, i.e., 𝐾×𝑚𝑝×𝑓+2𝑃.

Our Path Sampling.

Dierent from sampling the entire graph

or a partial subgraph, path sampling achieves the minimal edge

sampling number. Its sampling number depends on the number

of useful edges

𝑚′

, which equals to

𝐾×𝑚′

. In the power-law dis-

tributed real-world graphs,

𝑚′

is far less than

𝑚

in most cases.

Similarly, the memory cost of the path sampling is proportional to

𝐾×𝑚′

. Section 5.2 further evaluates the memory cost on real-world

graphs and demonstrates why our method consumes signicantly

less memory compared to the other methods.

4BPGraph Framework

Building upon our core path sampling method, we propose an ef-

cient GPU-accelerated uncertain graph processing framework,

called

BPGraph

, to provide high-performance, scalability and pro-

grammability on multi-GPU systems. In this section, rst, we de-

scribe the proposed general programming API in

BPGraph

which

allows users to easily dene uncertain graph applications (Section

4.1). Then we discuss the implementation aspects of

BPGraph

, a

fast GPU-based design of one-pass path identication and path

sampling (Section 4.2), and intra-GPU path sampling optimization,

multi-GPU scaling (Section 4.3). Together, they provide an ecient

parallel GPU implementation of path sampling.

4.1 Path-Sampling Centric Programming

To easily express and debug uncertain graph applications over GPU

accelerators, a unied programming API supporting is proposed in

BPGraph

. Generally, starting from inputting source vertices, uncer-

tain graph applications recursively perform reliability evaluation

operations on the set of identied paths until achieving a global

reliable result. User-dened parameters and API functions are pro-

vided by

BPGraph

for user involvements. The parameter option is a

simple user involvement which includes the number of samples

𝐾

,

accuracy bound, value reliability, etc. User-dened API functions

are more expressive which let users to describe the control logic of

uncertain graph applications of their interests.

Application Programming Interfaces (APIs).

The following

four API functions are proposed in which stages they are invoked

as described in Section 3.1.

•Path Identication.

We propose a

DispelEntity

function

to dene activate lter operation of active structural ver-

tices/edges [

59

], enabling identication of source-to-target

paths from graphs.

•Path Sampling.

We propose a

Initialize

function to ini-

tialize the distance of an empty path from identied paths,

and dene the sampling method utilized. And another func-

tion

Expand

is proposed to be repeatedly invoked on all

edges of a path to calculate the distance and probability of

that path.

•Filtering & Result Computation.

We propose a function

ReduceVertex

to combine the distance and probability of

all available paths on the target vertex.

ICS ’22, June 28–30, 2022, Virtual Event, USA Heng, Lingda, Hang, Donglin, Rui, et al.

Listing 1 illustrates the execution ow of

BPGraph

using these

API functions. The input of the execution ow is an uncertain graph,

and source and target vertex pairs. The path-centric execution ow

is processed under two stages: expand and update the value of paths

along the edges (line 5-9), and reduce and calculate the distance

and probability values of target vertices (line 12-13).

Taking Figure 4 as an example, given a source vertex 0and

a target vertex 6in the toy uncertain graph

𝑔

(Figure 2(a)), the

reachability traversal algorithm in Figure 4 aims to answer the

distance between them and the reliability value (i.e., the existing

probability of paths from 0to 6). This algorithm is widely used in

high performance data-center or sensor network for delivering data

packages [51][44].

1void MainProc(Graph g,Vertex srcs[],Vertex tgts[]) {

2/∗Path Identifcation Stage ∗/

3 ConvertGraphToPaths(g, DispelEntity, srcs, tgts);

4/∗Path Sampling Stage ∗/

5parallel−for Path p in g.getPaths(srcs, tgts)

6Initialize(p);

7// Nested parallel processing edge along paths

8parallel−for Edge e in p.getEdges()

9Expand(p, e);

10 synchronize; //Synchronize cooperative threads

11 /∗Result Computation:Reduce from all paths ∗/

12 parallel−for Vertex (s, v) in (srcs, tgts)

13 ReduceVertex(g.getPaths(s, v), v);

14 synchronize; //Synchronize cooperative threads

15 }

Listing 1: Pseudo code for path-based execution ow.

1/∗Path Intial Identify Phase ∗/

2extern void DispelEntity(){

3 DistanceConstrain=5;ReliabilityThreashold=0.1;}

4/∗Path Sampling Phase ∗/

5__device__ void Initialize(Path p) {

6 p.distance = 0; }

7__device__ void Expand(Path p, Edge e) {

8 p.distance += e.distance;

9 p.prob = OP_AND(p.prob,e.prob); }

10 /∗Result Computation Phase ∗/

11 __device__ void ReduceVertex(Path pArray[], Vertex v) {

12 for path p in pArray {

13 if((p.distance < DistanceConstrain)

14 && (p.prob > ReliabilityThreashold))

15 // Use expected−reliable formulation

16 atomicAdd(v.distance,

17 OP_MUL(p.prob, p.distance));

18 } }

Listing 2: Source-to-target query implementation in BPGraph.

Listing 2 exhibits how to implement the reachability traversal

algorithm in

BPGraph

. First, before execution, the following global

constraints are dened through parameter-based user involvement,

i.e., ReliabilityThreshold=0.5 and DistanceConstrain=5 to lter out

unqualied paths. Subsequently, as shown in Listing 2, we rst

dene the initial distance of an empty path to be 0(line 3). In this

example, the distance of a path is dened as the summation of all

edges’ distances (line 7), e.g., the distance of path [0 1 6] is calcu-

lated as 2 in Figure 4(d). Besides, a path would exist in a possible

world only if all its edges exist in that possible world (line 8), which

is exploited to lter the status bits in Figure 4(d). Finally, by aggre-

gating the results of corresponding paths, ReduceVertex is used to

calculate the distances and probabilities of vertices via multiplying

distance and probability for an expected-reliable result, in which

1

5

3

4

0

2

6 5

16

4

6

7

7

primary

secondary

P2

P3

P6

P0 @ GPU cg_0

P2

P1

P0P3

P4

P5P6

P1 @ GPU cg_1

P2 @ GPU cg_2

P3 @ GPU cg_3

P4 @ GPU cg_4

P5 @ GPU cg_5

P6 @ GPU cg_6

(a) Pre-partition Traversal Tree (b) Generated dependent paths executed on GPUs

0256

0167

1 7

045

536

2 4

3 1

Figure 5: Dependency generation in path identication.

the paths need to satisfy the global probability and distance con-

straints (line 12-17). The nal expected-reliable distance of vertex 6

for reachability traversal from 0is 2

∗1

3+

3

∗2

3+

4

∗1

3=

4(Figure

4(f)).

Generality of our APIs.

To demonstrate the expressiveness

of our APIs, in addition to the aforementioned source-to-target

query application, we have further developed a number of other

uncertain graph applications with our APIs, which include source-

target query (a.k.a, s-t query), k-nearest neighbor search, breadth-

rst search, any-pair shortest paths. These algorithms are all built

upon the path sampling model.

4.2 GPU-based Design for Asynchronous Path

Identication and Sampling

As we mentioned in Section 3.2, path-sampling methodology achieves

the minimal sampling overhead and consumes signicantly less

memory compared to other methods. Building a parallel uncer-

tain graph processing framework over GPUs is still challenging

[

33

,

34

,

56

]. This subsection focuses on the implementation of chal-

lenging phases, which are parallel path identication (Figure 4(a))

and sampling (Figure 4(b)). To enable fast path identication, in-

spired by iterative graph processing methods [

32

,

37

,

64

],

BPGraph

maintains the entire uncertain graph under a structure of breadth-

rst ordered tree (Figure 5) to help optimize graph locality.

Asynchronous Path Identify via Dependency.

As we can

see from the breadth-rst ordered tree, a large number of paths of

varying lengths will result in space explosion. Before identifying

massive paths,

BPGraph

introduces two steps to pre-partition paths

via a dependency tree building technique: (a) breath-rst layer-

aware decomposition: we rst perform a partition stage to divide the

ordered tree into linked paths (as shown in Figure 5); (b) construct

dependency tree of linking paths: connecting the corresponding in-

neighbors of the start vertex to create dependency between the

linked paths. For example, according to the dependency relationship

of the start vertices 0

,

2

,

3, both of

𝑃1, 𝑃4

and

𝑃5

are dependent on

𝑃0

. Furthermore, these bridging vertices are marked with two ags

primary and secondary (e.g., marked vertex 6 in Figure 5(a)), which

are used for next inter-path state synchronization.

Following the generation of the dependency tree,

BPGraph

drives

GPU threads to execute asynchronously over consecutive accessing

of edges along paths. In particular, when executing source-to-target

query requests, each path is asynchronously dispatched to GPUs as

a single processing workload unit. GPU thread or cooperative group

sequentially checks the corresponding paths, and synchronizes

Enabling Eicient Uncertain Graph Processing via Novel Path Sampling ICS ’22, June 28–30, 2022, Virtual Event, USA

along the dependency (Figure 5(b)). For example, during processing

s-t pair (0-6), GPU threads propagate the starting vertex along paths,

and identify 0->2->5->6and 0->1->6during rst round. Under the

following round, other parts of path among 0-6are identied after

synchronizing primary and secondary vertices, e.g., paths 0->4->5-

>3->6are along

𝑃

3and

𝑃

6during 2nd round , and path 0->2->4->5-

>3->6are along

𝑃

0,

𝑃

4and

𝑃

5during 3rd round. Note that, same as

previous uncertain graph processing work [

8

,

25

,

37

,

47

,

69

], these

simple paths with length shorter than the diameter of the graph

ensure no vertex that appears more than once in the sequence,

naturally eliminating circles along the paths.

In addition, we introduce a consecutive formation as three arrays

for caching identied paths: 1) the increasing number of paths of

varying lengths; 2) each path’s rst edge oset; 3) edges storing

along paths, with the source and destination of edges represented

by two consecutive vertices; e.g., 4edges of path 0->2->5->6is

stored as [0256].

Incremental Path Sampling.

Figure 4 has shown the overall

path sampling process in

BPGraph

. From path identication (b)

to edge sampling (c) in Figure 4, dierent paths share common

edges, for example, edges (0,2), (2,4), and (4,5) are in both

𝑝𝑎𝑡ℎ7

and

𝑝𝑎𝑡ℎ8

. If each path is sampled independently, it will result

in duplicate sampling. A straightforward method is to group the

edges of previously identied paths into a new edge list and sample

these edges in parallel. However, re-organizing identied paths into

edges brings extra random accessing overhead and necessitates pre-

allocating large cache space for the edges.

Instead, we design an incremental hierarchical path sampling

strategy to reap the benets of GPU cooperative group program-

ming. Other than sampling every path or edge, this strategy elimi-

nates many redundant sampling workloads from early edge lters

(e.g., edge (3,1)). Cooperative group (cg) allows kernels to dynami-

cally organize groups of threads, ensuring synchronize groups of

threads smaller than thread blocks, and software reuse in the form

of “collective" group-wide function interfaces [

21

]. The core idea

of our incremental path sampling is to sample incrementally and

judge whether to add the sampled equilong sub-paths to paths. The

lengths of these sub-paths correspond to the GPU multi-threading

resources available at runtime, i.e., cg tiles. In particular, to store

the common sub-paths, a shared cached array is kept in global

memory to mark whether or not the sub-paths have been sampled.

Each thread block is partitioned into multiple “tiles” via cooperative

group function tiled_partition(), in which the template parameter of

this function is determined by the lengths of sub-paths. One path

sampling workload is assigned to one group. If the sub-paths are

not sampled, threads in each tile sample edges based on their prob-

ability and adds their partial status bits to other threads by nding

sh() operation. The rank thread 0 sums up the distance and relia-

bility value of this path. Before cooperative group synchronization

(sync()),

BPGraph

generates the existence bits of each path one by

one by merging and prex sum over edges. After ltering the avail-

able paths, by executing user-dened kernel functions in parallel

(Listing 1), the reliability value and weights of vertex from source

vertex to the target vertex of paths are updated asynchronously.

To this end, the application results and their reliability value are

achieved via a selective method to aggregate values along dierent

length of distance paths.

(1) Thread-Level Path Sampling.

When the length of paths

are in small size (<32), the sub-warp kernel processes several

identied paths in a single warp and requires fewer threads

(32, 16, 4) with tiled_partition<num_thread>.

(2) Warp-Level Path Sampling.

When the length of paths are

in medium size (<1024), sampling kernel requires less than

the maximum thread block size (1,024) with coalesced_threads.

(3) Grid-Level Path Sampling.

When the length of paths are

in large size (> 1024), the grid kernel processes paths in

several thread blocks and sampling requires more than 1,024

threads with thread_block and even grid_group on devices.

4.3 Scalable BPGraph Implementation on

Multiple GPUs

With the size of uncertain graph increasing, scaling

BPGraph

to

multi-GPU systems (shown in Figure 6(a)) will become more de-

sirable and benecial.

BPGraph

tackles the scalability challenge for

processing large-scale uncertain graph via a streaming graph parti-

tion and workload migration design. To fully utilize the aggregated

GPU memory,

BPGraph

distributes the entire uncertain graph into

the multi-GPU memories. Figure 6 illustrates the essential com-

munication and synchronization requirements of the multi-GPU

version of BPGraph.

Streaming Uncertain Graph Partition and Allocation (❶

in Figure 6).

There are many well-known certain graph parti-

tioning approaches, such as vertex partition in GraphX[

61

] and

GraphLab[

19

], Metis [

31

] and grid partition in GridGraph [

68

]. In

BPGraph

, we uses edge partitioning method [

27

] to achieve the fast

and exible uncertain graph partition.

Assuming there are

𝐷

devices, all edges are partitioned into

#

𝐺𝑃𝑈

disjoint partitions

𝑃𝑖(

1

≤𝑖≤𝐷)

, where

𝐸=Ò𝐷

𝑖=1𝑃𝑖

. Us-

ing the edge partitioning approach, CSR-formatted inputs can be

eciently partitioned via fast scanning the row indices of graphs.

In our experiments, partitioning million-edge graph only takes

1.5-4.2 milliseconds. The corresponding partition of distance and

edge probability value are cached in each GPU, which allows each

GPU synchronizes their own copy of vertex array using GPUDi-

rect PeerToPeer communication by only updating their own potion

of results. By exploiting the optimized CUDA I/O primitives (cu-

daMemAdvise and cudaMemPrefetchAsync), loading of uncertain

graph partitions in streams benets from ecient data prefeching

and a signicant reduction in page fault before kernel launching.

Further, following the previous design, the primitives of coopera-

tive group enable global synchronization patterns across multiple

GPUs within CUDA . We design multi-device cooperative group for

multi-accelerator management via the initialization cudaLaunchCo-

operativeKernelMultiDevice and enabling synchronization of thread

groups.

Dynamic Workload Migration (❷in Figure 6).

Due to the

irregularity and probabilistic nature of uncertain graphs, GPU-

accelerated systems have a dicult time balancing workloads: 1)

the path identication workload will require dynamically ltering

redundant unavailable vertices and edges; 2) the path sampling

workload cannot be balanced across dierent devices due to path

length dierences. These types of workloads distributed across

devices and SMXs may result in side loading and some devices being

ICS ’22, June 28–30, 2022, Virtual Event, USA Heng, Lingda, Hang, Donglin, Rui, et al.

Partition #3

Partition Partition Partition Partition

GPU0GPU1GPU2GPUN

PathSet0

Sampled Edges-0

PathSet1

Sampled Edges-1

PathSet2

Sampled Edges-2

PathSetN

Sampled Edges-N

❷ Dynamic Workload Migration

Reliability Result

❸ Coordinated

O!oading

M N P

|E|/D |E| / D |E|/D

Partition #1 Partition #2

vertexO"set

Edge

Host

Memory (0,1,6:2), (0,2,5,6:3)

(0.3, 0.6, 0.3)

Results

Paths

Reliability

…

…

❶ Streaming Partition

& Allocation

Figure 6: Data management and Inter-GPU communication over

multi-GPU systems.

released earlier than others at runtime. To address these scalability

issues, we develop a dynamic workload migration strategy.

During the path identication stage (Figure 4(a)), we design

a dual buer to handle the workload balancing, inspired by only

transferring active workload strategy [

27

]. In particular, each thread

block of GPU constructs two dynamic-sized buers, i.e., in-buer

and out-buer, which is utilized for synchronization and migration.

For example, if we need to traverse other edges stored in other

devices and migrate the redundant edges of paths from GPU-1 to

GPU-D, the buer of GPU-1 will ll into the path’s current desti-

nation vertex id after determining whether any of the warp lanes

has a responding workload to transfer. Through ring topological

synchronization, the buers of devices are synchronized and the

updated messages are sent along GPU-1, GPU-2,..., GPU-D.

During the path sampling stage (Figure 4(c)), the host CPU es-

timates the sampling workload for each GPU by calculating the

number of edges that it needs to sample (

|𝐸|𝑠𝑎𝑚𝑝𝑙𝑒

). Then,

BPGraph

balances the sampling workload by moving sampling edges from

overloaded GPUs to under-utilized ones.

Coordinated Data Oloading (❸in Figure 6).

Finally, to

maintain the identied paths and reliability results, we construct a

shared zero-copy buer to store a separate path array for each GPU

after synchronizing sampled edge set. The path evaluations are

independent with each other. Thus, each GPU samples edges and

lters paths asynchronously during the sample runtime.

BPGraph

chooses shared zero-copy buers to hold the path, sampled edges,

and the evaluated vertex values (distance and reliability values).

During runtime, GPU threads directly update the

𝐸𝑑𝑔𝑒𝑖𝑑

in paths

and lter the pertained part of edges. At the end, all evaluation

results of vertices are collected from individual GPU and written to

the CPU buer.

5 Evaluation

Platform.

Table 2 shows detailed platform conguration. We per-

form our GPU evaluation on GPU evaluations an NVIDIA DGX

server with 8 NVIDIA V100 GPUs. The host system of DGX server

consists of two 20-core Intel(R) Xeon(R) CPU E5-2698 v4, and 512GB

DDR4 main memory, running with Ubuntu 18.04 (kernel 4.15.0)

and CUDA 11.0.

Datasets.

Table 1 shows the real-world graphs with a broad

range of sizes and features from Stanford Large Network Dataset

Collection

2

, Network Data Repository

3

, etc. The soc-twitter and

com-friendster [

1

] are collected from the real-world social network.

2http://snap.stanford.edu/data/

3http://networkrepository.com/index.php

Table 1: Real-world graph datasets used in this paper. 𝑆𝐺

represents the raw graph size with probabilistic edge list format.

Dataset Vertices

|𝑉|

Edges

|𝐸|

Size

𝑆𝐺

Prob

Pr(𝐸)

Avg. Degree

𝐷

netHept[1] 15,233 62,774 921KB 0.04±0.04 4

gnutellaP2P[1] 62,586 147,892 1.8MB 0.23±0.20 2

coauthor-DBLP[50] 540,486 15,245,729 331MB 0.11±0.09 28

kron-logn21[50] 1,544,088 91,042,012 2.3GB 0.33±0.28 58

soc-twitter[30] 28,504,110 531,000,244 14GB 0.46±0.28 18

uk-2005[1] 39,454,748 936,364,284 20GB 0.32±0.25 24

com-friendster[1] 65,608,366 1,806,067,135 41GB 0.52±0.25 29

Table 2: Platform Specication.

NVIDIA DGX Server (8 x GPU V100)

Shading Unit 5120 @ 1530MHz

On-chip Storage L1 Cache/Shared: 48KB x 80

L2 Cache: 6MB

Default Memory HBM, 32GB, 320GB/s

We use 7 publicly available real-world graphs. The edge probabilities

in the rst 2 datasets come from real-world applications, using the

same conguration in [

37

,

62

], while the probabilities in the latter

5 datasets are randomly assigned within the specied value range.

Parameter Setting.

To generate fair s-t query pairs, we select

10 dierent source vertices, uniformly at random from the datasets.

Next, the target vertices are chosen from

𝑛

hops from the source

vertices, uniformly at random, in which

𝑛

is randomly selected

between 2 and the graph diameter. The reported results of s-t query

are calculated by averaging those of all pairs. Initially, the value

𝐾

, i.e., # samples, is 100. It increases at a step of 200 till the results

converge.

Benchmarks.

We adopt the three benchmarks, i.e., source-to-

target query, k-nearest neighbors, and any-pair shortest path, which

are popular benchmarks in previous uncertain graph studies. Given

an uncertain graph

G(𝑉 , 𝐸, 𝑃 )

, a possible world

𝐺⊑G

, and a

distance function

𝑑

, we give the formulas of these problems as the

following.

(1) Source-to-target query (ST) is dened as returning the distance

with probability greater than a reliability threshold between the

given two vertices

𝑠

and

𝑡

. ST query aims to compute the distance

between 𝑠and 𝑡based on the distance function.

(2) K-nearest neighbors (KNN) is dened as returning top-

𝑘

ver-

tices with minimum distances and reliable probability from the

given vertex

𝑠

. Given a node

𝑠(𝑠∈𝑉)

, an integer

𝑘>

0and a

reliability threshold

𝜎

, the

𝑘

-nearest neighbors query (k-NN) aims

to nd a set of nodes

𝐶

such that for any

𝑟∈𝐶

, the value of

𝑑(𝑠, 𝑟 )

(marked as

D

) is in the top-

𝑘

list w.r.t. the function

𝑑

and their

probability 𝑝𝑑(𝑠,𝑟 )(D)=Í𝐺|𝑑𝐺(𝑠,𝑟 )=D𝑃𝑟 (𝐺)>𝜎[49].

(3) Any-pair shortest path (APSP) is dened as returning paths

with the minimum distance and reliable probability. Given any-

pair nodes

𝑆

and

𝑇

(S,T

⊂

V) and a reliability threshold

𝜎

, it aims to

compute the set of the shortest distance paths between

∀𝑠∈𝑆

and

∀𝑡∈𝑇

based on distance function

𝑑

. Meanwhile, the probability of

each path need to t the reliability threshold, i.e.,

𝑝𝑝𝑎𝑡ℎ (𝑠 ,𝑡)(D)>𝜎

.

For the KNN problem, BPGraph rstly generates multiple sets

of neighbors of node

𝑠

and then evaluates them with reliability and

shortest distance until nding a full set of top-

𝑘

neighbors. The

APSP problem in the uncertain graphs is dierent from certain

BitEdge

ProbTree

BPGraph w/o GPU

BPGraph

Time (s)

0

2

4

6

8

10

# Sample instances

200

400

600

800

1000

(a) netHept

BitEdge

ProbTree

BPGraph w/o GPU

BPGraph

Time (s)

0

5

10

15

# Sample instances

200

400

600

800

1000

(b) gnutellaP2P

BitEdge

ProbTree

BPGraph w/o GPU

BPGraph

Time (s)

0

50

100

150

200

# Sample instances

200

400

600

800

1000

(c) coauthor-DBLP

BitEdge

ProbTree

BPGraph w/o GPU

BPGraph

Time (s)

0

500

1000

# Sample instances

200

400

600

800

1000

(d) kron-logn21

ProbTree

BPGraph w/o GPU

BPGraph

Time (s)

0

5,000

10,000

# Sample instances

200

400

600

800

1000

(e) soc-twier

Figure 7: GPU-accelerated performance comparison. Lower is beer.

ones. BPGraph counts all input source-to-target pairs. User-dened

constraint criteria is utilized to lter active paths, e.g., in Figure

4(d), the third column values of paths need to be greater than 0.5.

User-dened ltering criteria is congured as a distance constraint

to limit the number of edges along paths.

State-of-the-arts.

We evaluate our system by comparing with

three state-of-the-art methods, and report the number of samples

required for convergence, running time, and memory usage for

all systems. Our datasets and source code will be publicly avail-

able. The four state-of-the-art algorithms are as following Table

3. For comprehensive and more fair comparisons, we also imple-

ment GPU-accelerated MC-Sampling, BitEdge Sampling method,

in which we generate and traverse the

𝐾

possible worlds resided

in the GPU memory. Meanwhile, the state-of-the-art CPU-based

algorithm BitEdge, ProbTree, and DistR is enhanced using OpenMP

to leverage multi-core CPUs on our evaluation platform which are

able to execute 40 threads simultaneously.

Table 3: Methods in Evaluation.

Abbr. Framework

MonteCarlo (MC)

Monte-Carlo sampling regarding to uncer-

tain graph processing, we report the basic

entirety sampling based on this sampling

method [47].

BitEdge (BE)

Simultaneous sampling method simultane-

ously processing massive possible worlds

with compact bit-aware marked edges [

69

].

ProbTree (PT)

ProbTree sampling method by partitioning

graphs and generates a small uncertain sub-

graph for querying purposes [37].

DistR (DR)

Distributed reliability-aware uncertain

graph sampling method (DistR) based on

partition sampling [8].

5.1 Comparison with State-of-the-art

Table 4 illustrates the performance of

BPGraph

and state-of-the-arts.

We compare their performance for three applications: s-t reliabil-

ity query, k-nearest neighbor (KNN), and any-pair shortest path.

BPGraph

runs on a single GPU in these experiments while others

run on multi-core CPUs.

As Table 4 shows,

BPGraph

signicantly outperforms other meth-

ods in all applications. For the s-t query,

BPGraph

is on average 39

×

and 30

×

faster than entirety sampling methods, i.e., MC sampling

and BitEdge Sampling, respectively. Compared to the partition sam-

pling method (ProbTree),

BPGraph

is 26

×

faster. For KNN and short-

est path,

BPGraph

presents even higher speedups, e.g., it achieves

an average speedup of 69

×

and 43

×

compared to MC sampling and

ProbTree respectively. It is because these two applications have

much larger workloads due to recursive traversal and sampling.

For the reason of

BPGraph

’s superior performance, it is because

1)

BPGraph

only traverses the graph once and also samples useful

edges only, 2) the high-parallelism computing capability of GPU

over multi-core CPUs, and 3) the high memory bandwidth of the

NVIDIA V100 GPU over the CPU.

To further compare to ProbTree, we evaluate its index building

overhead, and breakdown the execution time of ProbTree. The

evaluation of ProbTree on netHept and gnutellaP2P shows that the

index building takes 79% and 84% of the total execution time, respec-

tively. Since ProbTree partitions raw graphs into fully-connected

cliques, the index building overhead comes from both partitioning

and the reliability information re-computation, which is signi-

cantly reduced by the simple path identication in BPGraph.

Moreover, to show the benets of path sampling, we implement

a version of CPU-based

BPGraph

without utilizing GPUs. Figure 7

compares the performance of

BPGraph

on GPU with one source-

target pair query,

BPGraph

on CPU, BitEdge (BA) and ProbTree (PT),

which performs best among three state-of-the-art methods. Even

without GPU acceleration,

BPGraph

on CPU still achieves better

performance than ProbTree, which indicates the eectiveness and

eciency of our proposed path sampling method. Also, we can see

that the performance of

BPGraph

on GPU achieves over 6.15-23.5

×

improvement compared with BPGraph on CPU.

5.2 Evaluation on Memory Cost

Figure 8 studies the memory consumption in

BPGraph

, which has

the lowest memory overhead compared with other methods. For

instance,

BPGraph

consumes 16GB memory space on twitter graph,

which is 32% of the memory overhead of ProbTree. This is because

BPGraph

only stores useful edges and paths. Compared to the en-

tirety and partition methods,

BPGraph

does not need to cache all the

possible paths, and we only store the traversal paths in the breadth

rst ordered tree via an indexing way, and format each path as a

consecutive array storing the successive vertex IDs, which signif-

icantly reduces the caching space. For kron-logn21, path caching

consisting of the structural and probabilistic data totally consumes

4.2GB memory (13.12% resident memory). For largest graphs uk-

2005, it requires 30.4GB of GPU memory in BPGraph and therefore

ICS ’22, June 28–30, 2022, Virtual Event, USA Heng, Lingda, Hang, Donglin, Rui, et al.

Table 4: End-to-end performance comparison between BPGraph and the state-of-the-art approaches. We report the execution time on 100

samples (K=100). (We report the time in seconds and mark the speedup of BPGraph over best algorithms in parentheses, and the index

building time of PT is also considered for fair comparison of end-to-end performance.).)

Graph source-target query k-nearest neighbors any-pair shortest path

MC BA PT BPGraph MC BA PT BPGraph MC BA PT BPGraph

netHept 24.3 25.1 13.2 2.4 (5.5) 49.8 45.2 6.4 2.8 (2.3) 35.1 16.8 13.1 1.2 (15.5)

gnutellaP2P 38.2 23.9 11.4 4.3 (2.6) 117.5 74.3 35.9 6.1 (5.9) 45.9 19.6 26.4 2.7 (7.2)

coauthor 368.6 268.2 289.4 10.8 (24.8) 1793.0 1288.6 1031.2 43.1 (23.9) 286.2 238.0 286.0 4.3 (55.3)

kron-logn21 2471.1 1085.4 1635.7 30.6 (35.5) 3303.2 2372.9 5293.5 138.0 (17.2) 2580.5 1294.3 1229.4 20.8 (59.0)

soc-twitter 38816.2 12835.5 9494.2 486.3 (19.5) 43291.6 17416.9 6486.9 1245.2 (5.2) - 13985.4 9620.4 325.1 (29.6)

uk-2005 - - 21081.0 607.6 (34.7) - - 12113.8 1501.2 (8.51) - - 23940.2 894.2 (26.7)

Raw Size DistR ProbTree BPGraph

Memory

Usage (GB)

0

50

Dataset

kron-logn21

soc-twitter

uk-2005

Figure 8: Memory cost comparison between DR, PT and BPGraph.

Result Computation Edge & Path Sampling Path Identification

Runtime

Breakdown (%)

0

50

100

Dataset

netHept

gnutellaP2P

coauthor

kron-log21

soc-twitter

uk-2005

Figure 9: Execution breakdown of BPGraph over three stages of path

sampling model (path identication, edge and path sampling, result

computation).

ts into V100, while other approaches fail to t. These results also

verify our theoretical memory consumption analysis in Section 3.2.

5.3 Execution Time Breakdown

Further, we evaluate the computation time breakdown, as shown

in Figure 9. Compared with the breakdown of MC-Sampling shown

in Figure 3, we observe that the sampling phase takes much less

portion of the entire execution in

BPGraph

. Since we eliminate

all unnecessary edge sampling, which reduces the sampling time,

the portions of path identication and result computation phases

increase in the overall time.

The path identication phase overhead is a key component of the

overall execution time, which generates the whole path set of given

vertices, as shown in Figure 4(b). Table 5 illustrates its overhead

with various number of source-target pairs for four graphs. From

the reports of the uncertain graph soc-twitter,

BPGraph

executes the

traverse of path identication phase in 8.38-85.20s for the 10-300

source-target queries. The path identication takes 20.3-31.9% of

the total execution time.

5.4 Evaluation on Accuracy

For all sampling based uncertain graph processing approaches, the

solution accuracy is apparently aected by the number of samples.

We aim to study the relationship between the number of samples

𝐾

and accuracy here, using the s-t query application on the twitter

graph as an example.

We do the accuracy study for

BPGraph

, BitEdge-Sampling, and

ProbTree. We dene the accuracy error as the dierence percentage

between the approximate solution of sampling methods and the

exact solution of the uncertain graph. The accuracy error of reliabil-

ity is reported with respect to MC sampling, which is computed as:

𝐴𝑐𝑐𝑢𝑟 𝑎𝑐𝑦𝐸𝑟𝑟 𝑜𝑟 (𝐾)=1

100 Í100

𝑖=1

|𝑅(𝑠𝑖,𝑡𝑖,𝐾) −𝑅𝑒𝑥𝑎𝑐𝑡 (𝑠𝑖,𝑡𝑖) |

𝑅𝑒𝑥𝑎𝑐 𝑡 (𝑠𝑖,𝑡𝑖)

.

𝑅𝑒𝑥𝑎𝑐 𝑡 (𝑠𝑖, 𝑡𝑖)

denotes the reliability achieved from

𝑠𝑖

to

𝑡𝑖

using the exact solution,

and 𝑅(𝑠𝑖, 𝑡𝑖, 𝐾 )denotes the reliability estimated with 𝐾samples.

We evaluate the accuracy error values for

𝐾

from 100 to 1000,

with a step of 100. The three methods have fairly low errors at

𝐾=

1000. For BitEdge and ProbTree, they accuracy error improves

from 10.6% to 1.5% and 15.73% to 0.97% when

𝐾

changes from 100

to 1000. On the other hand, the accuracy error of

BPGraph

changes

from 4.35% to 0.21%. Clearly,

BPGraph

achieves better accuracy with

the same number of samples. The high accuracy of path sampling

model comes from that the model only cares about the dependency

between vertices on the paths and directly applying the reliabilities

to the target.

Note that Table 4 compares the performance of dierent ap-

proaches under the same sample number. We have shown that

BPGraph

needs less samples to achieve the same accuracy com-

pared with other methods. As a result,

BPGraph

will have even

better performance if the goal is to let all approaches converge to

the same accuracy.

Table 5: Performance of generating traversal paths.

Number of ST Pairs

|Q|

Datasets

netHept coauthor kron-logn21 twitter

10 0.27s 0.31s 1.38s 8.38s

50 0.85s 1.23s 3.62s 26.20s

100 1.58s 3.26s 11.6s 34.87s

300 3.95s 12.42s 22.8s 85.20s

5.5 Evaluation on Impact of Distance Metrics

Further, to gure out the impact of distance metrics (illustrated

in Section 2.1), we exploit the 100-sample ST query application

over three typical variations of reliability denition, i.e., reachabil-

ity,distance-constraint reachability, and expected shortest distance.

These three metrics are commonly exploited in uncertain graph lit-

erature for distance computation [

26

][

51

][

25

][

49

]. From the results

in Figure 10, we can see that using three metrics, the overall running

time does not change signicantly (<4.6% performance variation).

The reason for this is that the distance calculation consumes less

Algo. w/ Reachability

Algo. w/ Distance-Constraint Reachability

Algo. w/ Expected Shortest Distance

Execution Time (s)

0

10

20

30

Dataset

netHept gnutellaP2P coauthor kron-logn21

Figure 10: Runtime of BPGraph over three dierent variations of s-t

reliability measurements (illustrated in Section 2.1). Execution time

over 100-sample source-to-target queries are reported.

1.0

1.0

1.0

1.7

1.5

1.8

3.1

2.8

3.2

5.5

4.1

5.2

BPGraph-1GPU

BPGraph-2GPU

BPGraph-4GPU

BPGraph-8GPU

Speedup

0

2

4

6

Dataset

soc-twitter uk-2005 com-friendster

Figure 11: Scale-up scalability evaluation of multi-GPUs on the

three large datasets (soc-twitter, uk-2005, com-friendster).

overhead than the random sampling operation, and the computa-

tion is always pushed on the same path collections. Furthermore,

the results of distance-constraint reachability and expected shortest

distance metrics are ltered from reachability-based results using

additive constraints based on all identied possible paths. Thus,

the calculation of dierent distance metrics over paths are trivial

compared to the sampling workload.

5.6 Scale-Up Scalability over Multiple GPUs

To observe the eect of scaling the uncertain graph processing

procedure of our system

BPGraph

from one GPU to multiple GPUs,

we evaluate

BPGraph

with the large graphs, i.e., soc-twitter, uk-2005

and com-friendster, with up to 8 GPUs and illustrate the speedups

in Figure 11. The performance does scale well as we add more GPUs.

Specically, the performance of multi-GPU execution on soc-twitter

achieves 1.7

×

speedup from 1 GPU to 2 GPUs, and 5.5

×

speedup

from 1 GPU to 8 GPUs. Although the evaluated performance does

not scale linearly with more GPUs due to the limitation of PCI-E

bandwidth,

BPGraph

still achieves a good scalability due to process

larger dataset uk-2005 and com-friendster using more GPUs. We

observe that during

BPGraph

processes larger uncertain graphs

with more amount of edges, adding more GPUs produces much

greater speedup and larger reductions in processing time.

This speedup of adding more GPUs greatly depends on the de-

pendency of partitioned storage of consecutive edge array.

BPGraph

reduces the communication overhead by using vertex status array

to mark the primary/secondary vertices for value synchronization,

which minimizes unnecessary GPU communication trac. On the

uk-2005 dataset,

BPGraph

achieves a better scalability on multiple

GPU, and takes 1498.5s on executing the 100 s-t query tasks using

8 GPUs.

BPGraph

achieves almost 4.2

×

speedup using 8 GPUs over

using 1 GPU (6207s). For the larger dataset com-friendster, the result

presents a best speedup of

BPGraph

, in which

BPGraph

achieves

5.2

×

, 3.2

×

, 1.8

×

speedup using 8 GPUs, 4GPUs and 2GPUs over

one single GPU. This is because the comparatively peer-to-peer

communication and cooperative device synchronization gives us

the benet of good scalability on large uncertain graph datasets on

multi-GPU servers.

6 Related Work

We review the existing reliability query work for uncertain graphs

and also discuss several GPU-accelerated system designs. Our pro-

posed system

BPGraph

advances the state-of-the-art in the parallel

design and implementation of uncertain graph processing.

Uncertain Graph Processing.

Recently, several ecient pro-

cessing approaches have been proposed to use either entirety sam-

pling [

13

,

26

,

48

] or partition sampling methodology [

8

,

17

,

37

,

62

].

The entirety sampling techniques have been widely studied for

queries, e.g., reachability [

47

], k-nearest neighbors [

49

,

63

,

67

], (k,

𝜂

)-core decomposition [

36

]. Although many optimizations, e.g.,

Monte Carlo sampling method [

26

,

48

], recursive sampling method

[

35

], and the representative selection method [

47

], have been devel-

oped. There still lacks a general framework to eciently support

processing large-scale uncertain graph data. On the other hand, the

partition sampling methods [

8

,

37

] utilize compact and partitioned

data structures, which generate a small uncertain subgraph for

querying purposes and answers the reliable results with fewer sam-

ples. However, these state-of-the-art methods still face sub-optimal

performance due to sampling every edge, which signicantly toggle

down the entire performance.

GPU-based Graph Processing.

The works on high perfor-

mance and scalable GPU-accelerated graph algorithm optimization

[

6

,

9

,

20

,

22

,

40

,

60

] and system design [

14

,

18

,

45

,

57

,

58

] have been a

hot topic by exploiting powerful computation ability of accelerators.

Among these GPU-based systems, Medusa [

65

] proposes to simplify

the programming API for GPU-based graph algorithms. CuSha [

28

]

proposes G-Shard to improve the ineciency of warp execution on

CSR-formatted graphs and concatenated windows to address the

non-coalesced memory access problem. WS-VR [

27

] provides warp

segmentation method to enhance the GPU device utilization on

dealing with irregular structural graphs, and also scale the system

to multiple GPUs via a vertex renement to reduce unavailable data

transfer between GPUs via the PCIe bus. Gunrock [

59

] proposes a

new vertex-centric programming abstraction built upon the parallel

operations on a vertex or edge frontier, as well as it supports to

scale to multiple GPUs via optimizing the traversal direction and

GPU memory allocation. Groute [

4

] proposes an asynchronous pro-

cessing model for scheduling computation and communication over

multiple devices on a single node. Groute captures all the irregular

parallelism via pushing computation on each individual vertex and

improve the communication around GPUs. GraphReduce [

54

] is

the rst system to support out-of-core graph processing on GPU

of a single node, which proposes streaming shard partition and

hybrid vertex-centric and edge-centric parallelism model to achieve

iterative large graph processing in GPUs.

Distinct from the above system design,

BPGraph

based on the de-

sign paradigm of uncertain graph processing, not only signicantly

reduces the main bottleneck from massive sampling operations

of the state-of-the-art uncertain graph processing framework via

providing a novel path sampling, but also considers the high per-

formance of scaling GPU accelerator by exploiting several novel

strategies to handle SIMT-aware parallel path generation, traversal,

sampling combination and synchronization.

ICS ’22, June 28–30, 2022, Virtual Event, USA Heng, Lingda, Hang, Donglin, Rui, et al.

7 Conclusion and Future Work

In this work, we propose

BPGraph

, a novel multi-accelerator based

framework for eciently processing uncertain graph analytics to

tackle the challenges we have observed from the state-of-the-art

techniques: low computation eciency, large memory overhead,

lack of support for modern accelerators with massive parallelism,

and hard for users to simply write highly-ecient uncertain graph

analytics. At its core,

BPGraph

is integrated with a newly proposed

runtime path sampling technique to identify unnecessary edges for

sampling given a certain problem, resulting in drastic reduction

in the overall computation.

BPGraph

provides general support for

users to write a wide range of uncertain graph applications with-

out dealing with the low-level complexity. Results on real-world

uncertain graph applications show that

BPGraph

can achieve up to

43

×

(26

×

on average) speedup over the state-of-the-art frameworks,

and scales well with increasing number of GPUs.

Acknowledgment

We would like to thank our shepherd and all anonymous reviewers.

This work is supported by University of Sydney (USYD) faculty

startup funding, SOAR faculty fellowship and Australian Research

Council (ARC) DP210101984. It is also supported by the National

Science Foundation of China (NSFC) Grant No. 62002350, Key-Area

Research and Development Program of Guangdong Province Grant

No. 2019B010154004, and Tencent Youtu Lab. Hang Liu was in part

supported by the NSF CRII Award No. 2000722 and CAREER Award

No. 2046102.

References

[1]

http://law.di.unimi.it/webdata/. LAW web dataset. (http://law.di.unimi.it/

webdata/).

[2]

Eytan Adar and Christopher Re. 2007. Managing uncertainty in social networks.

IEEE Data Eng. Bull. 30, 2 (2007), 15–22.

[3]

Michael O Ball. 1986. Computational complexity of network reliability analysis:

An overview. IEEE Transactions on Reliability 35, 3 (1986), 230–239.

[4]

Tal Ben-Nun, Michael Sutton, Sreepathi Pai, and Keshav Pingali. 2017. Groute:

An asynchronous multi-GPU programming model for irregular computations. In

Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of

Parallel Programming. ACM, 235–248.

[5]

Paolo Boldi, Francesco Bonchi, Aris Gionis, and Tamir Tassa. 2012. Injecting

uncertainty in graphs for identity obfuscation. arXiv preprint arXiv:1208.4145

(2012).

[6]

Federico Busato and Nicola Bombieri. 2015. BFS-4K: an ecient implementation

of BFS for kepler GPU architectures. IEEE Transactions on Parallel and Distributed

Systems 26, 7 (2015), 1826–1838.

[7]

Yurong Cheng, Ye Yuan, Lei Chen, and Guoren Wang. 2015. The reachability query

over distributed uncertain graphs. In 2015 IEEE 35th International Conference on

Distributed Computing Systems. IEEE, 786–787.

[8]

Yurong Cheng, Ye Yuan, Lei Chen, Guoren Wang, Christophe Giraud-Carrier, and

Yongjiao Sun. 2016. Distr: A distributed method for the reachability query over

large uncertain graphs. IEEE Transactions on Parallel and Distributed Systems 27,

11 (2016), 3172–3185.

[9]

Hristo Djidjev, Sunil Thulasidasan, Guillaume Chapuis, Rumen Andonov, and

Dominique Lavenier. 2014. Ecient multi-GPU computation of all-pairs shortest

paths. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th Interna-

tional. IEEE, 360–369.

[10] Talya Eden, Shweta Jain, Ali Pinar, Dana Ron, and C. Seshadhri. 2018. Provable

and Practical Approximations for the Degree Distribution Using Sublinear Graph

Samples. In Proceedings of the 2018 World Wide Web Conference (Lyon, France)

(WWW’18). International World Wide Web Conferences Steering Committee,

Republic and Canton of Geneva, CHE, 449–458. https://doi.org/10.1145/3178876.

3186111

[11]

Talya Eden, Amit Levi, Dana Ron, and C. Seshadhri. 2015. Approximately Count-

ing Triangles in Sublinear Time. In 2015 IEEE 56th Annual Symposium on Founda-

tions of Computer Science. 614–633. https://doi.org/10.1109/FOCS.2015.44

[12]

Talya Eden, Dana Ron, and C. Seshadhri. 2020. Faster Sublinear Approximation of

the Number of k-Cliques in Low-Arboricity Graphs. In Proceedings of the Thirty-

First Annual ACM-SIAM Symposium on Discrete Algorithms (Salt Lake City, Utah)

(SODA’20). Society for Industrial and Applied Mathematics, USA, 1467–1478.

[13]

George S Fishman. 1986. A comparison of four Monte Carlo methods for esti-

mating the probability of st connectedness. IEEE Transactions on reliability 35, 2

(1986), 145–155.

[14]

Zhisong Fu, Michael Personick, and Bryan Thompson. 2014. Mapgraph: A high

level API for fast development of high performance graph analytics on GPUs. In

Proceedings of Workshop on GRAph Data management Experiences and Systems.

ACM, 1–6.

[15]

Anil Gaihre, Zhenlin Wu, Fan Yao, and Hang Liu. 2019. XBFS: eXploring runtime

optimizations for breadth-rst search on GPUs. In Proceedings of the 28th Inter-

national Symposium on High-Performance Parallel and Distributed Computing.

121–131.

[16]

Anil Gaihre, Da Zheng, Scott Weitze, Lingda Li, Shuaiwen Leon Song, Caiwen

Ding, Xiaoye S Li, and Hang Liu. 2021. Dr. Top-k: delegate-centric Top-k on GPUs.

In Proceedings of the International Conference for High Performance Computing,

Networking, Storage and Analysis. 1–14.

[17]

Xiulian Gao and Yuan Gao. 2013. Connectedness index of uncertain graph.

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 21,

01 (2013), 127–137.

[18]

Tong Geng, Tianqi Wang, Chunshu Wu, Chen Yang, Shuaiwen Leon Song, Ang Li,

and Martin Herbordt. 2019. LP-BNN: Ultra-low-latency BNN inference with layer

parallelism. In 2019 IEEE 30th International Conference on Application-specic

Systems, Architectures and Processors (ASAP), Vol. 2160. IEEE, 9–16.

[19]

Joseph E Gonzalez, Yucheng Low,Haijie Gu, Danny Bickson, and Carlos Guestrin.

2012. Powergraph: Distributed graph-parallel computation on natural graphs. In

Presented as part of the 10th USENIX Symposium on Operating Systems Design and

Implementation (OSDI 12). 17–30.

[20]

Pawan Harish and PJ Narayanan. 2007. Accelerating large graph algorithms on

the GPU using CUDA. In International Conference on High-Performance Comput-

ing. Springer, 197–208.

[21]

Mark Harris and Kyrylo Perelygin. 2017. Cooperative groups: Flexible CUDA

thread programming. NVIDIA Developer Blog (2017).

[22]

Sungpack Hong, Sang Kyun Kim, Tayo Oguntebi, and Kunle Olukotun. 2011.

Accelerating CUDA graph algorithms at maximum warp. In ACM SIGPLAN

Notices, Vol. 46. ACM, 267–276.

[23]

Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. 2011. Ecient parallel

graph exploration on multi-core CPU and GPU. In Parallel Architectures and

Compilation Techniques (PACT), 2011 International Conference on. IEEE, 78–88.

[24]

Ming Hua and Jian Pei. 2010. Probabilistic path queries in road networks: traf-

c uncertainty aware path selection. In Proceedings of the 13th International

Conference on Extending Database Technology. 347–358.

[25] Ruoming Jin, Lin Liu, Bolin Ding, and Haixun Wang. 2011. Distance-constraint

reachability computation in uncertain graphs. Proceedings of the VLDB Endow-

ment 4, 9 (2011), 551–562.

[26]

Arijit Khan and Lei Chen. 2015. On uncertain graphs modeling and queries.

Proceedings of the VLDB Endowment 8, 12 (2015), 2042–2043.

[27]

Farzad Khorasani, Rajiv Gupta, and Laxmi N Bhuyan. 2015. Scalable simd-ecient

graph processing on gpus. In 2015 International Conference on Parallel Architecture

and Compilation (PACT). IEEE, 39–50.

[28]

Farzad Khorasani, Keval Vora, Rajiv Gupta, and Laxmi N Bhuyan. 2014. CuSha:

vertex-centric graph processing on GPUs. In Proceedings of the 23rd international

symposium on High-performance parallel and distributed computing. ACM, 239–

252.

[29]

Hyeongsik Kim, Abhisha Bhattacharyya, and Kemafor Anyanwu. 2019. Semantic

Query Transformations for Increased Parallelization in Distributed Knowledge

Graph Query Processing. In Proceedings of the International Conference for High

Performance Computing, Networking, Storage and Analysis (Denver, Colorado)

(SC ’19). Association for Computing Machinery, New York, NY, USA, Article 4,

14 pages. https://doi.org/10.1145/3295500.3356212

[30]

Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is

Twitter, a social network or a news media?. In WWW ’10: Proceedings of the 19th

international conference on World wide web (Raleigh, North Carolina, USA). ACM,

New York, NY, USA, 591–600.

[31]

Dominique LaSalle and George Karypis. 2013. Multi-threaded graph partitioning.

In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

IEEE, 225–236.

[32]

Ang Li, Weifeng Liu, Linnan Wang, Kevin Barker, and Shuaiwen Leon Song. 2018.

Warp-consolidation: A novel execution model for gpus. In Proceedings of the 2018

International Conference on Supercomputing. 53–64.

[33]

Ang Li, Shuaiwen Leon Song, Eric Brugel, Akash Kumar, Daniel Chavarria-

Miranda, and Henk Corporaal. 2016. X: A comprehensive analytic model for

parallel machines. In 2016 IEEE International Parallel and Distributed Processing

Symposium (IPDPS). IEEE, 242–252.

[34]

Ang Li, Shuaiwen Leon Song, Akash Kumar, Eddy Z Zhang, Daniel Chavarría-

Miranda, and Henk Corporaal. 2016. Critical points based register-concurrency

autotuning for GPUs. In 2016 Design, Automation & Test in Europe Conference &

Exhibition (DATE). IEEE, 1273–1278.

[35]

Rong-Hua Li, Jerey Xu Yu, Rui Mao, and Tan Jin. 2015. Recursive stratied

sampling: A new framework for query evaluation on uncertain graphs. IEEE

Transactions on Knowledge and Data Engineering 28, 2 (2015), 468–482.

[36]

Fragkiskos D Malliaros, Christos Giatsidis, Apostolos N Papadopoulos, and

Michalis Vazirgiannis. 2020. The core decomposition of networks: Theory, algo-

rithms and applications. The VLDB Journal 29, 1 (2020), 61–92.

[37]

Silviu Maniu, Reynold Cheng, and Pierre Senellart. 2017. An Indexing Framework

for Queries on Probabilistic Graphs. ACM Transactions on Database Systems 42, 2

(2017).

[38]

Maxime Martinasso, Grzegorz Kwasniewski, Sadaf R Alam, Thomas C Schulthess,

and Torsten Hoeer. 2016. A PCIe congestion-aware performance model for

densely populated accelerator servers. In Proceedings of the International Con-

ference for High Performance Computing, Networking, Storage and Analysis. IEEE

Press, 63.

[39]

Robert Ryan McCune, Tim Weninger, and Greg Madey. 2015. Thinking like a

vertex: a survey of vertex-centric frameworks for large-scale distributed graph

processing. ACM Computing Surveys (CSUR) 48, 2 (2015), 25.

[40]

Duane Merrill, Michael Garland, and Andrew Grimshaw. 2012. Scalable GP U

graph traversal. In ACM SIGPLAN Notices, Vol. 47. ACM, 117–128.

[41]

Duane Merrill, Michael Garland, and Andrew Grimshaw. 2015. High-performance

and scalable GPU graph traversal. ACM Transactions on Parallel Computing 1, 2

(2015), 14.

[42]

Marco Minutoli, Prathyush Sambaturu, Mahantesh Halappanavar, Antonino

Tumeo, Ananth Kalyanaraman, and Anil Vullikanti. 2020. Preempt: Scalable

Epidemic Interventions Using Submodular Optimization on Multi-GPU Systems.

In Proceedings of the International Conference for High Performance Computing,

Networking, Storage and Analysis (Atlanta, Georgia) (SC ’20). IEEE Press, Article

55, 15 pages.

[43]

Mark EJ Newman. 2003. The structure and function of complex networks. SIAM

review 45, 2 (2003), 167–256.

[44]

Evdokia Nikolova, Matthew Brand, and David R Karger. 2006. Optimal Route

Planning under Uncertainty.. In Icaps, Vol. 6. 131–141.

[45]

Yuechao Pan, Yangzihao Wang, Yuduo Wu, Carl Yang, and John D Owens. 2015.

Multi-GPU graph analytics. arXiv preprint arXiv:1504.04804 (2015).

[46]

Santosh Pandey, Lingda Li, Adolfy Hoisie, Xiaoye S Li, and Hang Liu. 2020. C-

SAW: A framework for graph sampling and random walk on GPUs. In SC20:

International Conference for High Performance Computing, Networking, Storage

and Analysis. IEEE, 1–15.

[47]

Panos Parchas, Francesco Gullo, Dimitris Papadias, and Franceseco Bonchi. 2014.

The pursuit of a good possible world: extracting representative instances of

uncertain graphs. In Proceedings of the 2014 ACM SIGMOD international conference

on management of data. 967–978.

[48]

Panos Parchas, Francesco Gullo, Dimitris Papadias, and Francesco Bonchi. 2015.

Uncertain graph processing through representative instances. ACM Transactions

on Database Systems (TODS) 40, 3 (2015), 1–39.

[49]

Michalis Potamias, Francesco Bonchi, Aristides Gionis, and George Kollios. 2010.

K-nearest neighbors in uncertain graphs. Proceedings of the VLDB Endowment 3,

1-2 (2010), 997–1008.

[50]

Ryan A. Rossi and Nesreen K. Ahmed. 2015. The Network Data Repository with

Interactive Graph Analytics and Visualization. In AAAI. http://networkrepository.

com

[51]

Arkaprava Saha, Ruben Brokkelkamp, Yllka Velaj, Arijit Khan, and Francesco

Bonchi. 2021. Shortest paths and centrality in uncertain networks. Proceedings

of the VLDB Endowment 14, 7 (2021), 1188–1201.

[52]

Stephen B Seidman. 1983. Network structure and minimum degree. Social

networks 5, 3 (1983), 269–287.

[53]

Dipanjan Sengupta and Shuaiwen Leon Song. 2017. EvoGraph: On-the-Fly

Ecient Mining of Evolving Graphs on GPU. In High Performance Computing.

Springer International Publishing, Cham, 97–119.

[54]

Dipanjan Sengupta, Shuaiwen Leon Song, Kapil Agarwal, and Karsten Schwan.

2015. GraphReduce: processing large-scale graphs on accelerator-based systems.

In Proceedings of the International Conference for High Performance Computing,

Networking, Storage and Analysis. ACM, 28.

[55]

Philip Taet and John Mellor-Crummey.2019. Understanding Congestion in High

Performance Interconnection Networks Using Sampling. In Proceedings of the

International Conference for High Performance Computing, Networking, Storage and

Analysis (Denver, Colorado) (SC ’19). Association for Computing Machinery, New

York, NY, USA, Article 43, 24 pages. https://doi.org/10.1145/3295500.3356168

[56]

Jingweijia Tan, Shuaiwen Leon Song, Kaige Yan, Xin Fu, Andres Marquez, and

Darren Kerbyson. 2016. Combating the reliability challenge of GP U register le

at low supply voltage. In 2016 International Conference on Parallel Architecture

and Compilation Techniques (PACT). IEEE, 3–15.

[57]

Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, and

John McPherson. 2013. From think like a vertex to think like a graph. Proceedings

of the VLDB Endowment 7, 3 (2013), 193–204.

[58]

Ha-Nguyen Tran, Jung-jae Kim, and Bingsheng He. 2015. Fast subgraph matching

on large graphs using graphics processors. In International Conference on Database

Systems for Advanced Applications. Springer, 299–315.

[59]

Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riel, and

John D Owens. 2016. Gunrock: A high-performance graph processing library on

the GPU. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and

Practice of Parallel Programming. ACM, 11.

[60]

Yuduo Wu, Yangzihao Wang, Yuechao Pan, Carl Yang, and John D Owens. 2015.

Performance characterization of high-level programming models for GPU graph

analytics. In Workload Characterization (IISWC), 2015 IEEE International Sympo-

sium on. IEEE, 66–75.

[61]

Reynold S Xin, Joseph E Gonzalez, Michael J Franklin, and Ion Stoica. 2013.

Graphx: A resilient distributed graph system on spark. In First international

workshop on graph data management experiences and systems. 1–6.

[62]

Bohua Yang, Dong Wen, Lu Qin, Ying Zhang, Lijun Chang, and Rong-Hua Li.

2019. Index-Based Optimal Algorithm for Computing K-Cores in Large Uncertain

Graphs. In 2019 IEEE 35th International Conference on Data Engineering (ICDE).

IEEE, 64–75.

[63]

Ye Yuan, Lei Chen, and Guoren Wang. 2010. Eciently answering probability

threshold-based shortest path queries over uncertain graphs. In International

Conference on Database Systems for Advanced Applications. Springer, 155–170.

[64]

Yu Zhang, Xiaofei Liao, Hai Jin, Bingsheng He, Haikun Liu, and Lin Gu. 2019.

DiGraph: An Ecient Path-Based Iterative Directed Graph Processing System

on Multiple GPUs. In Proceedings of the Twenty-Fourth International Conference

on Architectural Support for Programming Languages and Operating Systems

(Providence, RI, USA) (ASPLOS ’19). Association for Computing Machinery, New

York, NY, USA, 601–614.

[65]

Jianlong Zhong and Bingsheng He. 2014. Medusa: Simplied graph processing

on GPUs. IEEE Transactions on Parallel and Distributed Systems 25, 6 (2014),

1543–1552.

[66]

Jian Zhou, Fan Yang, and Ke Wang. 2014. An inverse shortest path problem on

an uncertain graph. Journal of Networks 9, 9 (2014), 2353.

[67]

Rong Zhu, Zhaonian Zou, and Jianzhong Li. 2017. Towards ecient top-k re-

liability search on uncertain graphs. Knowledge and Information Systems 50, 3

(2017), 723–750.

[68]

Xiaowei Zhu, Wentao Han, and Wenguang Chen. 2015. GridGraph: Large-Scale

Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning..

In USENIX Annual Technical Conference. 375–386.

[69]

Zhaonian Zou, Faming Li, Jianzhong Li, and Yingshu Li. 2017. Scalable Processing

of Massive Uncertain Graph Data: A Simultaneous Processing Approach. In IEEE

International Conference on Data Engineering.

[70]

Zhaonian Zou, Jianzhong Li, Hong Gao, and Shuo Zhang. 2010. Finding top-k

maximal cliques in an uncertain graph. In 2010 IEEE 26th International Conference

on Data Engineering (ICDE 2010). IEEE, 649–652.

ICS ’22, June 28–30, 2022, Virtual Event, USA Heng, Lingda, Hang, Donglin, Rui, et al.

A Accuracy Analysis of Path-Sampling

Methodology

This section provides theoretically analysis on the accuracy of

uncertain graph methods.

Consider an uncertain graph

G=(𝑉 , 𝐸, 𝑃 )

with

|𝑉|=𝑛

and

|𝐸|=

𝑚

, where

𝑉

and

𝐸

denote the set of nodes and edges respectively.

𝑃

is a set of probabilities representing the likelihoods of the existence

of edges, i.e.,

𝑝𝑒

denotes the probability of

𝑒∈𝐸

. The existence

of an edge is independent with each other. Let

𝐺=(𝑉𝐺, 𝐸𝐺)

be

a possible graph which is obtained by sampling each edge e in

G

following the probability

𝑝𝑒

. Obviously,

𝑉𝐺=𝑉

,

𝐸𝐺⊂𝐸

, and the

probability of 𝐺is given by

Pr[𝐺]=Ö

𝑒∈𝐸𝐺

𝑝𝑒Ö

𝑒∈𝐸\𝐸𝐺

(1−𝑝𝑒)

Taking into account that the accuracy of results are dependent

on the number of sample worlds, we further theoretically analyze

the accuracy achieved by recent sampling methodologies, entirety,

partition and our path sampling methodologies. There are two parts

that cause accuracy loss: (1) the choice of sampling method; and (2)

the choice of decomposition and preprocessing methods.

Referring to (1) the choice of sampling method, since sampling

is a technique for approximate query processing, it will lose infor-

mation while accelerating the querying process, and less sampling

time leads to larger errors in the reliable results. To achieve a theo-

retical estimation accuracy, the Cherno bound is widely applied

to determine the number of possible worlds in uncertain graph

literature. Given an uncertain graph

G

, a distance function

𝑑

, and a

pair of nodes

𝑠

and

𝑡

, the accuracy of estimating the value of

𝑑(𝑠, 𝑡 )

by MC sampling can be well guaranteed. To achieve an error rate

of

𝜖>

0with a failure probability of

𝜎>

0, i.e., the number of

samples needed is:

𝑔(G, 𝑠, 𝑡, 𝜖 , 𝛿)=max n3

𝜖2𝑑′(𝑠,𝑡 ),𝜙(G)2

2𝜖2o·ln 2

𝛿

,

where

𝑑′(𝑠, 𝑡 )

is the estimated value of

𝑑(𝑠, 𝑡 )

, and the function

𝜙(G)=max(𝑠 ,𝑡) ∈𝑉×𝑉𝑑(𝑠, 𝑡)is the diameter of G.

In practice, one usually focuses on nding the pairs with a given

threshold

𝜌

. Note that in general

𝜌

is not too small, and thus we

have

𝜙(G)2

2𝜖2≥3

𝜖2𝜌

. Therefore, the number of needed samples is

computed as: 𝑔(G, 𝜖, 𝛿 )=𝜙(G)2

2𝜖2ln 2

𝛿.

Referring to (2) the choice of decomposition and partition methods,

there exists inaccurate impacts in the generated uncertain sub-

graphs due to information loss during graph decomposition (e.g.,

deleting vertices or edges), leading to errors in reliability results.

Entirety Sampling. Entirety sampling estimates the reliable re-

sults from the full uncertain graph, reporting source-to-target reach-

ability as 1or 0in each sample. Because the entire number of

possible world are generated, the entirety sampling methodology

without deleting any structures ensures lossless reliability results.

Partition Sampling. Partition sampling methodologies use decom-

position methods that result in losing information, such as, indexing

tree index structures with

𝑤𝑖𝑑𝑡ℎ >

2(illustrated in Section 2.2).

The uncertain subgraph distilling from index searching becomes

inaccurate, resulting in reliable errors. When

𝑤𝑖𝑑𝑡ℎ =

2, the tree is

a binary tree that ensures full connection between triplets without

cutting any edges [

37

], resulting in lossless results. The partition

sampling would then be accuracy lossy with bound via building

full connected index tree.

Path Sampling. Because of the non-redundant generated struc-

tures, path sampling is ecient on achieving reliable results. First,

path sampling improves sampling eciency by removing the large

overhead of generating massive possible worlds. As a result, the

sampling error is reduced, alleviating the impact from sampling

insucient amount of possible worlds. Second, path sampling is

faster than entirety and partition sampling for executing uncertain

graph applications because it eliminates redundant sampling over-

head. Table 6 depicts the comparison of recent work on sampling

methods.

Table 6: Sampling Method Comparison.

Sampling Method Space Time Query Accuracy

Monte Carlo Quadratic Linear Lossless

BitEdge Quadratic Linear Lossless

ProbTree Quadratic Linear Lossy (with bound)

DistR Linear Linear Lossy (with bound)

BPGraph Linear Linear Lossless