Conference PaperPDF Available

OGAPI : Oblivious Graph Processing in Multicores

OGAPI : Oblivious Graph Processing in Multicores
Masab Ahmad, Omer Khan
University of Connecticut, Storrs, CT, USA
{masab.ahmad, khan}
Abstract—Graph processing has become important problem
for ultra-efficient embedded computing. However, malicious enti-
ties snooping on graph access/information leakage patterns might
violate privacy. Examples of such exploitations include leakage
of important graph vertices and other crucial algorithmic phases
that can lead to privacy setbacks in various applications. Prior
works focus on hardware schemes to mitigate associated memory
leakage patterns, which come with dreadful overheads in the
context of irregular graph workloads. In this work, we present
algorithm level mechanisms to minimize or completely eliminate
graph access leakage across all possible channels. We employ
mechanisms centered around algorithmic redundancy to hide
graph access patterns, and we show how effective parallelization
strategy can allow performance improvements across these graph
workloads. Our approach shows performance overheads of 2%
over native executions of the shortest path graph workload.
Graph algorithms are highly ubiquitous in today’s world [8].
With the advent of many-core processors such as multicores,
exploitable parallelism is allowing these algorithms be de-
ployed many embedded applications. Most algorithms scale to
high thread counts, however exhibit weak scalability depend-
ing on how the data is accessed in these algorithms [1]. Graph
algorithms, such as the ones that compute shortest paths,
access data related to certain graph vertices more than other
corresponding graph vertices [2]. This variation in accesses
constructs data access patterns within memory subsystem,
where certain vertices are accessed more than others.
Data access in graphs can be specified as vertex access,
which pertains to graph access for a given algorithm. Patterns
in data accesses are already known to be exploitable in prior
works, with the idea being that an algorithm leaks more infor-
mation when working on crucial data. Similarly, in the case
of graph algorithms, vertex access patterns leak information
about important vertices [3], as shown in Figure 1. Taking the
example of an autonomous unmanned aerial drone that has
path planning as a primary objective, graph access patterns can
leak information pertaining to shortest paths. Malicious entities
snooping around on the hardware computing these applications
can infer where these drones are progressing, and can thus
make them prone to sabotage. With many graph frameworks,
such as Galois [8], mechanisms are thus required that reduce
information leakage from the execution of such algorithms.
Prior work on trusted execution environments, such as
Intel’s software guard extensions (SGX) platform [6], has
generally been completely hardware oriented, using ISA ex-
tensions to isolate program execution. However, these do not
address the problem of side channels, that can occur anywhere
Vertices on
the Shortest
Vertex ID
Vertex Accesses
More Vertex ID
Vertex ID
Fig. 1. How important vertices are accessed more in graph algorithms.
in the architecture from interference on shared hardware
channels [4]. A number of prior works apply oblivious data
access schemes to algorithms using redundant data accesses,
and prove that they do indeed eliminate leakage [9]. Using
program redundancy, vertex access patterns can be made
constant, so a malicious entity viewing channel leakage cannot
distinguish between vertices, and thus cannot acquire any
meaningful information [3]. However these works only show
complexity overheads for sequential versions, and do not
show any performance analysis and implications. Therefore,
no prior work analyzes graph workloads in the context of
vertex leakage in a parallel multicore setting, a setting where
graph algorithms are ubiquitously applied today [8] [1].
Performance overheads take various overturns in multicores
and other parallel paradigms, where parallelization strategy
induces unpredictable synchronization and memory/data ac-
cess overheads. However, they do present an opportunity to
hide performance overheads associated with redundant data
accesses in oblivious program execution. We propose our
analysis and leakage elimination scheme in a parallel setting,
and show that a software approach is much simpler with mini-
mal performance overheads. Moreover, our approach hides the
redundant work using an efficient parallelization strategy.
Shared hardware resources leak information in the form of
bits to a snooping adversary. The theory behind this leakage
is explained in [4], where authors use randomization and
redundancy over time to bound bit leakage. We take an
example of the shortest path workload (SSSP), and show how
adding redundant work removes information leakage from
vertex/data accesses. With an insecure baseline, only those
With Oblivious
Data Access
Vertex Accesses
Vertex ID
Vertex ID
Fig. 2. Eliminating vertex leakage via algorithmic redundancy.
vertices are accessed whose distances are minimal, and this
results in more writes to cache lines where the requested data
is located. Hence the vertex accesses to such vertices can be
differentiated from other vertices, as shown in Fig 2. In the
oblivious case, we add a uniform number of writes for each
vertex computations, resulting in a combination of the actual
algorithmic work as well as dummy computations. This causes
adversaries to see the same number of accesses for all vertex
computations, and thus it cannot infer any private information.
Once a user has chosen how much redundancy is required,
one can efficiently parallelize the oblivious graph algorithm
to hide its associated latency overheads. In a parallel scheme
each thread performs some work on a chunk of vertices. With
leakage control, each thread is expected not to leak any number
of bits over time to an adversary across all possible channels.
With extra work mainly in the form of compute and memory
access, parallelism can efficiently hide associated additional
latencies by simply dividing the redundant work amongst
threads. However, the way redundant work is performed is
highly application dependent, and thus communication be-
tween threads might become worse with parallelization. This
offsets computation versus communication ratios in a multi-
core setting, which stops scalability of applications, degrading
overall performance. These effects need to be studied in detail
in practical settings in order to properly quantify tradeoffs
between leakage and efficiency.
One of the popular graph algorithm falls in the domain of
finding a single source shortest path (SSSP). We take the SSSP
workload from the CRONO suite [1], which contains several
state-of-the-art parallel workloads interfaced with both real and
synthetic graphs. The input graph used in this paper is the
California (CA) road network graph [5]. The SSSP workload is
executed on a simulated 256-core multicore using the Graphite
Simulator [7]. Each core is modeled as an in-order pipeline
with 32KB private L1 instruction and cache caches, and a
256KB shared L2 cache. The 256-core processor also models
8 memory controllers to access the off-chip memory. All input
graphs for SSSP problem have an adjacency list representation.
Fig 3 shows overall speedups/slowdown obtained for the
SSSP parallel workload relative to its sequential version. The
x-axis shows the number of threads used to compute SSSP
with and without oblivious (redundant) execution. For the
CA road network, the average degree of the graph is around
1 2 4 8 16 32 64 96 128 160 192 224 256
Thread Count
Without Oblivious Execution
Fig. 3. Eliminating vertex leakage via algorithmic redundancy
2.5, while the maximum degree is around 12. So each vertex
in the oblivious execution has to relax 12 edges to hide its
data access pattern. The baseline SSSP algorithm’s speedup
obtained at 256 threads is 4.24×, which is in conjunction
with [1]. This weak scalability is attributed to synchronization
overhead, which is the major completion time component
at high thread count [1]. However, with oblivious execution
the overall speedup drops to 4.17×, which is a negligible
performance overhead (around 2%). The primary reason for
this is that the addition of redundant work is easily parallelized
across threads, and thus the side effects of synchronization are
less compared to the baseline scenario.
In this paper we show how side channel information leak-
age can be minimized using software redundancy in state-
of-the-art graph algorithms. In addition, we effectively use
parallelization of the workload to hide the overheads of the
redundant work. Our analysis of the shortest path workload
shows an overhead of 2% at high thread count. As future
work we plan to extend the idea of algorithmic redundancy and
efficient parallelization to reduce the performance overheads
of oblivious graph algorithms.
[1] M. Ahmad and et. al., “Crono : A benchmark suite for multithreaded
graph algorithms executing on futuristic multicores,” in Proceedings of
the 2015 Annual IEEE Int. Symposium on Workload Characterization,
ser. IISWC. Washington, DC, USA: IEEE, 2015.
[2] M. Ahmad, K. Lakshminarasimhan, and O. Khan, “Efficient paralleliza-
tion of path planning workload on single-chip shared-memory multi-
cores,” in Proc. of the IEEE High Performance Extreme Computing Conf.,
ser. HPEC ’15. IEEE, 2015.
[3] M. Blanton and et. al., “Data-oblivious graph algorithms for secure
computation and outsourcing,” in Proceedings of the 8th ACM SIGSAC
Symposium on Information, Computer and Communications Security, ser.
ASIA CCS ’13. NY, USA: ACM, 2013, pp. 207–218.
[4] C. Fletcher and et. al., “Suppressing the oblivious ram timing channel
while making information leakage and program efficiency trade-offs,
in High Performance Computer Architecture (HPCA), 2014 IEEE 20th
International Symposium on, Feb 2014, pp. 213–224.
[5] J. Leskovec and et. al., “Community structure in large networks: Natural
cluster sizes and the absence of large well-defined clusters,” 2008.
[6] F. McKeen and et. al., “Innovative instructions and software model for
isolated execution,” in Proceedings of the 2nd Int. Workshop on Hardware
and Architectural Support for Security and Privacy, ser. HASP ’13. NY,
USA: ACM, 2013, pp. 10:1–10:1.
[7] J. Miller and et. al., “Graphite: A distributed parallel simulator for
multicores,” in High Performance Computer Architecture (HPCA), 2010
IEEE 16th Int. Symposium on, Jan 2010, pp. 1–12.
[8] D. Nguyen and et. al., “Deterministic galois: On-demand, portable and
parameterless,” in Proceedings of the 19th Int. Conference on Architec-
tural Support for Programming Languages and Operating Systems, ser.
ASPLOS ’14. NY, USA: ACM, 2014, pp. 499–512.
[9] X. S. Wang and et. al., “Oblivious data structures,” in Proceedings of
the 2014 ACM SIGSAC Conference on Computer and Communications
Security, ser. CCS ’14. NY, USA: ACM, 2014, pp. 215–226.
... Authentication 4 , that identifies peripherals trying to communicate information to/from the system. Privacy leakage control 5 [8], which obfuscates information on the network, so a hacker may not derive important information such as a security key by looking at the stream of 1's and 0's. These schemes individually incur substantial performance, energy, and memory overheads [3]. ...
... I am working with the Center for Hardware Assurance, Security, and Engineering (CHASE) at UConn, whose sole aim is to do research on security such that impacts the whole community. The CHASE center works with big information technology companies to do research on hardware security 8 . In this environment of top-notch security experts, I will be able to consider all aspects as part of a holistically secure system. ...
Full-text available
Conference Paper
Full-text available
Algorithms operating on a graph setting are known to be highly irregular and unstructured. This leads to workload imbalance and data locality challenge when these algorithms are parallelized and executed on the evolving multicore processors. Previous parallel benchmark suites for shared memory multicores have focused on various workload domains, such as scientific, graphics, vision, financial and media processing. However, these suites lack graph applications that must be evaluated in the context of architectural design space exploration for futuristic multicores. This paper presents CRONO, a benchmark suite composed of multi-threaded graph algorithms for shared memory multicore processors. We analyze and characterize these benchmarks using a multicore simulator, as well as a real multicore machine setup. CRONO uses both synthetic and real world graphs. Our characterization shows that graph benchmarks are diverse and challenging in the context of scaling efficiency. They exhibit low locality due to unstructured memory access patterns, and incur fine-grain communication between threads. Energy overheads also occur due to non-deterministic memory and synchronization patterns on network connections. Our characterization reveals that these challenges remain in state-of-the-art graph algorithms, and in this context CRONO can be used to identify, analyze and develop novel architectural methods to mitigate their efficiency bottlenecks in futuristic multicore processors.
Conference Paper
Full-text available
Oblivious RAM (ORAM) is an established cryptographic technique to hide a program's address pattern to an untrusted storage system. More recently, ORAM schemes have been proposed to replace conventional memory controllers in secure processor settings to protect against information leakage in external memory and the processor I/O bus. A serious problem in current secure processor ORAM proposals is that they don't obfuscate when ORAM accesses are made, or do so in a very conservative manner. Since secure processors make ORAM accesses on last-level cache misses, ORAM access timing strongly correlates to program access pattern (e.g., locality). This brings ORAM's purpose in secure processors into question. This paper makes two contributions. First, we show how a secure processor can bound ORAM timing channel leakage to a user-controllable leakage limit. The secure processor is allowed to dynamically optimize ORAM access rate for power/performance, subject to the constraint that the leakage limit is not violated. Second, we show how changing the leakage limit impacts program efficiency. We present a dynamic scheme that leaks at most 32 bits through the ORAM timing channel and introduces only 20% performance overhead and 12% power overhead relative to a baseline ORAM that has no timing channel protection. By reducing leakage to 16 bits, our scheme degrades in performance by 5% but gains in power efficiency by 3%. We show that a static (zero leakage) scheme imposes a 34% power overhead for equivalent performance (or a 30% performance overhead for equivalent power) relative to our dynamic scheme.
Conference Paper
Full-text available
This paper introduces the Graphite open-source distributed parallel multicore simulator infrastructure. Graphite is designed from the ground up for exploration of future multi-core processors containing dozens, hundreds, or even thousands of cores. It provides high performance for fast design space exploration and software development. Several techniques are used to achieve this including: direct execution, seamless multicore and multi-machine distribution, and lax synchronization. Graphite is capable of accelerating simulations by distributing them across multiple commodity Linux machines. When using multiple machines, it provides the illusion of a single process with a single, shared address space, allowing it to run off-the-shelf pthread applications with no source code modification. Our results demonstrate that Graphite can simulate target architectures containing over 1000 cores on ten 8-core servers. Performance scales well as more machines are added with near linear speedup in many cases. Simulation slowdown is as low as 41× versus native execution.
Full-text available
A large body of work has been devoted to defining and identifying clusters or communities in social and information networks. We explore from a novel perspective several questions related to identifying meaningful communities in large social and information networks, and we come to several striking conclusions. We employ approximation algorithms for the graph partitioning problem to characterize as a function of size the statistical and structural properties of partitions of graphs that could plausibly be interpreted as communities. In particular, we define the network community profile plot, which characterizes the "best" possible community--according to the conductance measure--over a wide range of size scales. We study over 100 large real-world social and information networks. Our results suggest a significantly more refined picture of community structure in large networks than has been appreciated previously. In particular, we observe tight communities that are barely connected to the rest of the network at very small size scales; and communities of larger size scales gradually "blend into" the expander-like core of the network and thus become less "community-like." This behavior is not explained, even at a qualitative level, by any of the commonly-used network generation models. Moreover, it is exactly the opposite of what one would expect based on intuition from expander graphs, low-dimensional or manifold-like graphs, and from small social networks that have served as testbeds of community detection algorithms. We have found that a generative graph model, in which new edges are added via an iterative "forest fire" burning process, is able to produce graphs exhibiting a network community profile plot similar to what we observe in our network datasets.
Conference Paper
We design novel, asymptotically more efficient data structures and algorithms for programs whose data access patterns exhibit some degree of predictability. To this end, we propose two novel techniques, a pointer-based technique and a locality-based technique. We show that these two techniques are powerful building blocks in making data structures and algorithms oblivious. Specifically, we apply these techniques to a broad range of commonly used data structures, including maps, sets, priority-queues, stacks, deques; and algorithms, including a memory allocator algorithm, max-flow on graphs with low doubling dimension, and shortestpath distance queries on weighted planar graphs. Our oblivious counterparts of the above outperform the best known ORAM scheme both asymptotically and in practice. Copyright is held by the owner/author(s). Publication rights licensed to ACM.
This work treats the problem of designing data-oblivious algorithms for classical and widely used graph problems. A data-oblivious algorithm is defined as having the same sequence of operations regardless of the input data and data-independent memory accesses. Such algorithms are suitable for secure processing in outsourced and similar environments, which serves as the main motivation for this work. We provide data-oblivious algorithms for breadth-first search, single-source single-destination shortest path, minimum spanning tree, and maximum flow, the asymptotic complexities of which are optimal, or close to optimal, for dense graphs.
Conference Paper
Non-determinism in program execution can make program development and debugging difficult. In this paper, we argue that solutions to this problem should be on-demand, portable and parameterless. On-demand means that the programming model should permit the writing of non-deterministic programs since these programs often perform better than deterministic ones for the same problem. Portable means that the program should produce the same answer even if it is run on different machines. Parameterless means that if there are machine-dependent scheduling parameters that must be tuned for good performance, they must not affect the output. Although many solutions for deterministic program execution have been proposed in the literature, they fall short along one or more of these dimensions. To remedy this, we propose a new approach, based on the Galois programming model, in which (i) the programming model permits the writing of non-deterministic programs and (ii) the runtime system executes these programs deterministically if needed. Evaluation of this approach on a collection of benchmarks from the PARSEC, PBBS, and Lonestar suites shows that it delivers deterministic execution with substantially less overhead than other systems in the literature.
Innovative instructions and software model for isolated execution
  • F Mckeen
F. McKeen and et. al., "Innovative instructions and software model for isolated execution," in Proceedings of the 2nd Int. Workshop on Hardware and Architectural Support for Security and Privacy, ser. HASP '13. NY, USA: ACM, 2013, pp. 10:1-10:1.
Data-oblivious graph algorithms for secure computation and outsourcing
  • M Blanton
M. Blanton and et. al., "Data-oblivious graph algorithms for secure computation and outsourcing," in Proceedings of the 8th ACM SIGSAC Symposium on Information, Computer and Communications Security, ser. ASIA CCS '13. NY, USA: ACM, 2013, pp. 207-218.