Chapter

Performance Comparison of Graph BFS Implemented in MapReduce and PGAS Programming Models

Chapter

Performance Comparison of Graph BFS Implemented in MapReduce and PGAS Programming Models

If you want to read the PDF, try requesting it from the authors.

Abstract

Computations based on graphs are very common problems but complexity, increasing size of analyzed graphs and a huge amount of communication make this analysis a challenging task. In this paper, we present a comparison of two parallel BFS (Breath-First Search) implementations: MapReduce run on Hadoop infrastructure and in PGAS (Partitioned Global Address Space) model. The latter implementation has been developed with the help of the PCJ (Parallel Computations in Java) - a library for parallel and distributed computations in Java. Both implementations realize the level synchronous strategy - Hadoop algorithm assumes iterative MapReduce jobs, whereas PCJ uses explicit synchronization after each level. The scalability of both solutions is similar. However, the PCJ implementation is much faster (about 100 times) than the MapReduce Hadoop solution.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... There are papers [40,41] that show the PCJ implementation of some benchmarks scales very well and outperforms the Hadoop implementation, even by a factor of 100. One of the benchmarks calculates an approximation of π value applying the quasi-Monte The TeraSort is one of the widely used benchmarks for Hadoop. ...
... Previous studies [40,41] show that the PCJ implementation of some benchmarks outperforms the Hadoop implementation, even by a factor of 100. Other studies that compare PCJ with APGAS and Apache Spark [56] show that the PCJ implementation has the same performance as Apache Spark, and for some benchmarks it can be almost 2-times more efficient. ...
... Moreover, the PGAS model is a general-purpose model that suits many problems, in contrast to the MapReduce model, where the problem has to be adapted for the model (cf. [9,40,57,58]). ...
Article
Full-text available
Sorting algorithms are among the most commonly used algorithms in computer science and modern software. Having efficient implementation of sorting is necessary for a wide spectrum of scientific applications. This paper describes the sorting algorithm written using the partitioned global address space (PGAS) model, implemented using the Parallel Computing in Java (PCJ) library. The iterative implementation description is used to outline the possible performance issues and provide means to resolve them. The key idea of the implementation is to have an efficient building block that can be easily integrated into many application codes. This paper also presents the performance comparison of the PCJ implementation with the MapReduce approach, using Apache Hadoop TeraSort implementation. The comparison serves to show that the performance of the implementation is good enough, as the PCJ implementation shows similar efficiency to the Hadoop implementation.
Article
Full-text available
Graph500 is a new benchmark to rank supercomputers with a large-scale graph search problem. We found that the provided reference implementations are not scalable in a large distributed environment. We devised an optimized method based on 2D partitioning and other methods such as communication compression and vertex sorting. Our optimized implementation can handle BFS (Breadth First Search) of a large graph with 236 (68.7 billion vertices) and 240 (1.1 trillion) edges in 10.58 seconds while using 1366 nodes and 16,392 CPU cores. This performance corresponds to 103.9 GE/s. We also studied the performance characteristics of our optimized implementation and reference implementations on a large distributed memory supercomputer with a Fat-Tree-based Infiniband network.
Article
Full-text available
Data-intensive, graph-based computations are pervasive in several scientific applications, and are known to to be quite challenging to implement on distributed memory systems. In this work, we explore the design space of parallel algorithms for Breadth-First Search (BFS), a key subroutine in several graph algorithms. We present two highly-tuned parallel approaches for BFS on large parallel systems: a level-synchronous strategy that relies on a simple vertex-based partitioning of the graph, and a two-dimensional sparse matrix-partitioning-based approach that mitigates parallel communication overhead. For both approaches, we also present hybrid versions with intra-node multithreading. Our novel hybrid two-dimensional algorithm reduces communication times by up to a factor of 3.5, relative to a common vertex based approach. Our experimental study identifies execution regimes in which these approaches will be competitive, and we demonstrate extremely high performance on leading distributed-memory parallel systems. For instance, for a 40,000-core parallel execution on Hopper, an AMD Magny-Cours based system, we achieve a BFS performance rate of 17.8 billion edge visits per second on an undirected graph of 4.3 billion vertices and 68.7 billion edges with skewed degree distribution.
Article
Full-text available
A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.
Conference Paper
Graph processing is used in many fields of science such as sociology, risk prediction or biology. Although analysis of graphs is important it also poses numerous challenges especially for large graphs which have to be processed on multicore systems. In this paper, we present PGAS (Partitioned Global Address Space) version of the level-synchronous BFS (Breadth First Search) algorithm and its implementation written in Java. Java so far is not extensively used in high performance computing, but because of its popularity, portability, and increasing capabilities is becoming more widely exploit especially for data analysis. The level-synchronous BFS has been implemented using a PCJ (Parallel Computations in Java) library. In this paper, we present implementation details and compare its scalability and performance with the MPI implementation of Graph500 benchmark. We show good scalability and performance of our implementation in comparison with MPI code written in C. We present challenges we faced and optimizations we used in our implementation necessary to obtain good performance.
Chapter
Graph-based computations are used in many applications. Increasing size of analyzed data and its complexity make graph analysis a challenging task. In this paper we present performance evaluation of Java implementation of Graph500 benchmark. It has been developed with the help of the PCJ (Parallel Computations in Java) library for parallel and distributed computations in Java. PCJ is based on a PGAS (Partitioned Global Address Space) programming paradigm, where all communication details such as threads or network programming are hidden. In this paper, we present Java implementation details of first and second kernel from Graph500 benchmark. The results are compared with the existing MPI implementations of Graph500 benchmark, showing good scalability of PCJ library.
Conference Paper
In this paper we present PCJ - a new library for parallel computations in Java. The PCJ library implements partitioned global address space approach. It hides communication details and therefore it is easy to use and allows for fast development of parallel programs. With the PCJ user can focus on implementation of the algorithm rather than on thread or network programming. The design details with examples of usage for basic operations are described. We also present evaluation of the performance of the PCJ communication on the state of art hardware such as cluster with gigabit interconnect. The results show good performance and scalability when compared to native MPI implementations.
Article
Graph500 is a new benchmark for supercomputers based on large-scale graph analysis, which is becoming an important form of analysis in many real-world applications. Graph algorithms run well on supercomputers with shared memory. For the Linpack-based supercomputer rankings, TOP500 reports that heterogeneous and distributed-memory super-computers with large numbers of GPGPUs are becoming dominant. However, the performance characteristics of large-scale graph analysis benchmarks such as Graph500 on distributed-memory supercomputers have so far received little study. This is the first report of a performance evaluation and analysis for Graph500 on a commodity-processor-based distributed-memory supercomputer. We found that the reference implementation “replicated-csr” based on distributed level-synchronized breadth-first search solves a large free graph problem with 231 vertices and 235 edges (approximately 2.15 billon vertices and 34.3 billion edges) in 3.09 seconds with 128 nodes and 3,072 cores. This equates to 11 giga-edges traversed per second. We describe the algorithms and implementations of the reference implementations of Graph500, and analyze the performance characteristics with varying graph sizes and numbers of computer nodes and different implementations. Our results will also contribute to the development of optimized algorithms for the coming exascale machines.
Chapter
In this paper we present PCJ - a new library for parallel computations in Java inspired by the partitioned global address space approach. We present design details together with the examples of usage for basic operations such as a point-point communication, synchronization or broadcast. The PCJ library is easy to use and allows for a fast development of the parallel programs. It allows to develop distributed applications in easy way, hiding communication details which allows user to focus on the implementation of the algorithm rather than on network or threads programming. The objects and variables are local to the program threads however some variables can be marked shared which allows to access them from different nodes. For shared variables PCJ offers one-sided communication which allows for easy access to the data stored at the different nodes. The parallel programs developed with PCJ can be run on the distributed systems with different Java VM running on the nodes. In the paper we present evaluation of the performance of the PCJ communication on the state of art hardware. The results are compared with the native MPI implementation showing good performance and scalability of the PCJ.
Conference Paper
MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.
Level-synchronous parallel breadth-first search algorithms for multicore and multiprocessor systems
  • R Berrendorf
  • M Makulla
R. Berrendorf, M. Makulla: Level-Synchronous Parallel Breadth-First Search Algorithms For Multicore and Multiprocessor Systems. FC 14 pp 2631, 2014
Parallel computations in Java with PCJ library
  • M Nowicki
  • P Ba
  • La
M. Nowicki, P. Ba la. Parallel computations in Java with PCJ library In: W. W. Smari and V. Zeljkovic (Eds.) 2012 International Conference on High Performance Computing and Simulation (HPCS), IEEE 2012 pp. 381-387
PCJ-new approach for parallel computations in java
  • M Nowicki
  • P Ba
  • La
M. Nowicki, P. Ba la. PCJ-new approach for parallel computations in java In: P. Manninen, P. Oster (Eds.) Applied Parallel and Scientific Computing, (LNCS 7782), Springer, Heidelberg (2013) pp. 115-125
The performance evaluation of the Java implementation of Graph500
  • M Ryczkowska
  • M Nowicki
  • P Bala
  • R Wyrzykowski
  • E Deelman
  • J Dongarra
  • K Karczewski
  • J Kitowski
Ryczkowska, M., Nowicki, M., Bala, P.: The performance evaluation of the Java implementation of Graph500. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds.) PPAM 2015. LNCS, vol. 9574, pp. 221-230. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32152-3 21
built on top of Apache Hadoop) and GraphX (Apache Spark's API for graphs and graph-parallel computations)
  • Pregel
Pregel, built on top of Apache Hadoop) and GraphX (Apache Spark's API for graphs and graph-parallel computations).
The performance evaluation of the Java implementation of Graph500
  • M Ryczkowska
  • M Nowicki
  • P Ba
  • La
M. Ryczkowska, M. Nowicki, P. Ba la: The performance evaluation of the Java implementation of Graph500, 11th International Conference on Parallel Processing and Applied Mathematics (PPAM 2015) in press
Madduri: Parallel breadth-first search on distributed memory systems
  • A Buluc
A. Buluc, K. Madduri: Parallel breadth-first search on distributed memory systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, 2011.