Conference Paper

Comparison of the HPC and Big Data Java Libraries Spark, PCJ and APGAS

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Developed by us, the PCJ library is the successful implementation providing good scalability and reasonable performance. Another prospective implementation is APGAS, a library offering an X10-like programming solution for Java [25]. ...
... APGAS stat is the basic implementation, APGAS dyn is a version enhanced with dynamic load-balancing capabilities. The APGAS library, as well as its implementation of WordCount code, are based on the prior work [25]. APGAS code was run using SLURM in Multiple Programs, Multiple Data (MPMD) mode, with commands used to start computations and remote APGAS places differing. ...
... The performance of the reduction phase is key for the overall performance [10] and the best results in case of PCJ are obtained using binary tree communication. APGAS solution uses the reduction as implemented in [25] (this work reports the worse performance of PCJ, due to the use of simpler and therefore less efficient reduction scheme). ...
Article
Full-text available
With the development of peta- and exascale size computational systems there is growing interest in running Big Data and Artificial Intelligence (AI) applications on them. Big Data and AI applications are implemented in Java, Scala, Python and other languages that are not widely used in High-Performance Computing (HPC) which is still dominated by C and Fortran. Moreover, they are based on dedicated environments such as Hadoop or Spark which are difficult to integrate with the traditional HPC management systems. We have developed the Parallel Computing in Java (PCJ) library, a tool for scalable high-performance computing and Big Data processing in Java. In this paper, we present the basic functionality of the PCJ library with examples of highly scalable applications running on the large resources. The performance results are presented for different classes of applications including traditional computational intensive (HPC) workloads (e.g. stencil), as well as communication-intensive algorithms such as Fast Fourier Transform (FFT). We present implementation details and performance results for Big Data type processing running on petascale size systems. The examples of large scale AI workloads parallelized using PCJ are presented.
... The performance comparison with the C/MPI based codes has been presented in previous papers 4,5 . The extensive comparison with Java-based solutions including APGAS (Java implementation of X10 language) has been also performed 6,3 . ...
Article
Large-scale computing and data processing with cloud resources is gaining popularity. However, the usage of the cloud differs from traditional high-performance computing (HPC) systems and both algorithms and codes have to be adjusted. This work is often time-consuming and performance is not guaranteed. To address this problem we have developed the PCJ library (parallel computing in Java), a novel tool for scalable HPC and big data processing in Java. In this article, we present a performance evaluation of parallel applications implemented in Java using the PCJ library. The performance evaluation is based on the examples of highly scalable applications of different characteristics focusing on CPU, communication or I/O. They run on the traditional HPC system and Amazon web services Cloud as well as Linaro Developer Cloud. For the clouds, we have used Intel x86 and ARM processors for running Java codes without changing any line of the program code and without the need for time-consuming recompilation. Presented applications have been parallelized using the partitioned global address space programming model and its realization in the PCJ library. Our results prove that the PCJ library, due to its performance and ability to create simple portable code, has great promise to be successful for the parallelization of various applications and run them on the cloud with a performance close to HPC systems.
... The performance comparison with the C/MPI based codes has been presented in previous papers [22,26]. The extensive comparison with Java-based solutions including APGAS (Java implementation of X10 language) has been also performed [27,31]. ...
Chapter
Cloud resources are more often used for large scale computing and data processing. However, the usage of the cloud is different than traditional High-Performance Computing (HPC) systems and both algorithms and codes have to be adjusted. This work is often time-consuming and performance is not guaranteed. To address this problem we have developed the PCJ library (Parallel Computing in Java), a novel tool for scalable high-performance computing and big data processing in Java. In this paper, we present a performance evaluation of parallel applications implemented in Java using the PCJ library. The performance evaluation is based on the examples of highly scalable applications that run on the traditional HPC system and Amazon AWS Cloud. For the cloud, we have used Intel x86 and ARM processors running Java codes without changing any line of the program code and without the need for time-consuming recompilation. Presented applications have been parallelized using the PGAS programming model and its realization in the PCJ library. Our results prove that the PCJ library, due to its performance and ability to create simple portable code, has great promise to be successful for the parallelization of various applications and run them on the cloud with a similar performance as for HPC systems.
... Previous studies [40,41] show that the PCJ implementation of some benchmarks outperforms the Hadoop implementation, even by a factor of 100. Other studies that compare PCJ with APGAS and Apache Spark [56] show that the PCJ implementation has the same performance as Apache Spark, and for some benchmarks it can be almost 2-times more efficient. However, not all of the studies were perfectly suited for MapReduce processing. ...
Article
Full-text available
Sorting algorithms are among the most commonly used algorithms in computer science and modern software. Having efficient implementation of sorting is necessary for a wide spectrum of scientific applications. This paper describes the sorting algorithm written using the partitioned global address space (PGAS) model, implemented using the Parallel Computing in Java (PCJ) library. The iterative implementation description is used to outline the possible performance issues and provide means to resolve them. The key idea of the implementation is to have an efficient building block that can be easily integrated into many application codes. This paper also presents the performance comparison of the PCJ implementation with the MapReduce approach, using Apache Hadoop TeraSort implementation. The comparison serves to show that the performance of the implementation is good enough, as the PCJ implementation shows similar efficiency to the Hadoop implementation.
Conference Paper
Full-text available
In this paper, we present PCJ (Parallel Computing in Java), a novel tool for scalable high-performance computing and big data processing in Java. PCJ is Java library implementing PGAS (Partitioned Global Address Space) programming paradigm. It allows for the easy and feasible development of computational applications as well as Big Data processing. The use of Java brings HPC and Big Data type of processing together and enables running on the different types of hardware. In particular, the high scalability and good performance of PCJ applications have been demonstrated using Cray XC40 systems. We present performance and scalability of PCJ library measured on Cray XC40 systems with standard benchmarks such as ping-pong, broadcast, and random access. We describe parallelization of example applications of different characteristics including FFT and 2D stencil. Results for standard Big Data benchmarks such as word count are presented. In all cases, measured performance and scalability confirm that PCJ is a good tool to develop parallel applications of different type.
Article
Full-text available
Since large parallel machines are typically clusters of multicore nodes, parallel programs should be able to deal with both shared memory and distributed memory. This paper proposes a hybrid work stealing scheme, which combines the lifeline-based variant of distributed task pools with the node-internal load balancing of Java’s Fork/Join framework. We implemented our scheme by extending the APGAS library for Java, which is a branch of the X10 project. APGAS programmers can now spawn locality-flexible tasks with a new asyncAny construct. These tasks are transparently mapped to any resource in the overall system, so that the load is balanced over both nodes and cores. Unprocessed asyncAny-tasks can also be cancelled. In performance measurements with up to 144 workers on up to 12 nodes, we observed near linear speedups for four benchmarks and a low overhead for cancellation-related bookkeeping.
Conference Paper
Full-text available
Apache Spark is a popular framework for data analytics with attractive features such as fault tolerance and interoperability with the Hadoop ecosystem. Unfortunately, many analytics operations in Spark are an order of magnitude or more slower compared to native implementations written with high performance computing tools such as MPI. There is a need to bridge the performance gap while retaining the benefits of the Spark ecosystem such as availability, productivity , and fault tolerance. In this paper, we propose a system for integrating MPI with Spark and analyze the costs and benefits of doing so for four distributed graph and machine learning applications. We show that offloading computation to an MPI environment from within Spark provides 3.1−17.7× speedups on the four sparse applications, including all of the overheads. This opens up an avenue to reuse existing MPI libraries in Spark with little effort.
Article
Full-text available
This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.
Chapter
Full-text available
With the wide adoption of the multicore and multiprocessor systems the parallel programming became a very important element of the computer science. The programming of the multicore systems is still complicated and far to be easy. The difficulties are caused, amongst others, by the parallel tools, libraries and programming models which are not easy especially for a nonexperienced programmer. In this paper, we present PCJ-a Java library for parallel programming of heterogeneous multicore systems. The PCJ is adopting Partitioned Global Address Space paradigm which makes programming easy. We present basic functionality pf the PCJ library and its usage for parallelization of selected applications. The scalability of the genetic algorithm implementation is presented. The parallelization of the N-body algorithm implementation with PCJ is also described.
Article
Full-text available
One of the biggest challenges of the current big data landscape is our inability to process vast amounts of information in a reasonable time. In this work, we explore and compare two distributed computing frameworks implemented on commodity cluster architectures: MPI/OpenMP on Beowulf that is high-performance oriented and exploits multi-machine/multicore infrastructures, and Apache Spark on Hadoop which targets iterative algorithms through in-memory computing. We use the Google Cloud Platform service to create virtual machine clusters, run the frameworks, and evaluate two supervised machine learning algorithms: KNN and Pegasos SVM. Results obtained from experiments with a particle physics data set show MPI/OpenMP outperforms Spark by more than one order of magnitude in terms of processing speed and provides more consistent performance. However, Spark shows better data management infrastructure and the possibility of dealing with other aspects such as node failure and data replication.
Conference Paper
Full-text available
APGAS (Asynchronous Partitioned Global Address Space) is a model for concurrent and distributed programming, known primarily as the foundation of the X10 programming language. In this paper, we present an implementation of this model as an embedded domain-specific language for Scala. We illustrate common usage patterns and contrast with alternative approaches available to Scala programmers. In particular, using two distributed algorithms as examples, we illustrate how APGAS-style programs compare to idiomatic Akka implementations. We demonstrate the use of APGAS places and tasks, distributed termination, and distributed objects.
Conference Paper
Full-text available
We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks.
Article
Full-text available
We present GLB, a programming model and an associated implementation that can handle a wide range of irregular parallel programming problems running over large-scale distributed systems. GLB is applicable both to problems that are easily load-balanced via static scheduling and to problems that are hard to statically load balance. GLB hides the intricate synchronizations (e.g., inter-node communication, initialization and startup, load balancing, termination and result collection) from the users. GLB internally uses a version of the lifeline graph based work-stealing algorithm proposed by Saraswat et al [25]. Users of GLB are simply required to write several pieces of sequential code that comply with the GLB interface. GLB then schedules and orchestrates the parallel execution of the code correctly and efficiently at scale. We have applied GLB to two representative benchmarks: Betweenness Centrality (BC) and Unbalanced Tree Search (UTS). Among them, BC can be statically load-balanced whereas UTS cannot. In either case, GLB scales well -- achieving nearly linear speedup on different computer architectures (Power, Blue Gene/Q, and K) -- up to 16K cores.
Conference Paper
Full-text available
This paper presents an unbalanced tree search (UTS) bench- mark designed to evaluate the performance and ease of programming for parallel applications requiring dynamic load balancing. We describe al- gorithms for building a variety of unbalanced search trees to simulate dierent forms of load imbalance. We created versions of UTS in two par- allel languages, OpenMP and Unified Parallel C (UPC), using work steal- ing as the mechanism for reducing load imbalance. We benchmarked the performance of UTS on various parallel architectures, including shared- memory systems and PC clusters. We found it simple to implement UTS in both UPC and OpenMP, due to UPC's shared-memory abstractions. Results show that both UPC and OpenMP can support ecient dynamic load balancing on shared-memory architectures. However, UPC cannot alleviate the underlying communication costs of distributed-memory sys- tems. Since dynamic load balancing requires intensive communication, performance portability remains dicult
Article
The Apache Spark framework for distributed computation is popular in the data analytics community due to its ease of use, but its MapReduce‐style programming model can incur significant overheads when performing computations that do not map directly onto this model. One way to mitigate these costs is to off‐load computations onto MPI codes. In recent work, we introduced Alchemist, a system for the analysis of large‐scale data sets. Alchemist calls MPI‐based libraries from within Spark applications, and it has minimal coding, communication, and memory overheads. In particular, Alchemist allows users to retain the productivity benefits of working within the Spark software ecosystem without sacrificing performance efficiency in linear algebra, machine learning, and other related computations. In this paper, we discuss the motivation behind the development of Alchemist, and we provide a detailed overview of its design and usage. We also demonstrate the efficiency of our approach on medium‐to‐large data sets, using some standard linear algebra operations, namely, matrix multiplication and the truncated singular value decomposition of a dense matrix, and we compare the performance of Spark with that of Spark+Alchemist. These computations are run on the NERSC supercomputer Cori Phase 1, a Cray XC40.
Chapter
Since today’s clusters consist of nodes with multicore processors, modern parallel applications should be able to deal with shared and distributed memory simultaneously. In this paper, we present a novel hybrid work stealing scheme for the APGAS library for Java, which is a branch of the X10 project. Our scheme extends the library’s runtime system, which traditionally performs intra-node work stealing with the Java Fork/Join framework. We add an inter-node work stealing scheme that is inspired by lifeline-based global load balancing. The extended functionality can be accessed from the APGAS library with new constructs. Most important, locality-flexible tasks can be submitted with asyncAny, and are then automatically scheduled over both nodes and cores. In experiments with up to 144 workers on up to 12 nodes, our system achieved near linear speedups for three benchmarks.
Article
Apache Spark is a popular framework for data analytics with attractive features such as fault tolerance and interoperability with the Hadoop ecosystem. Unfortunately, many analytics operations in Spark are an order of magnitude or more slower compared to native implementations written with high performance computing tools such as MPI. There is a need to bridge the performance gap while retaining the benefits of the Spark ecosystem such as availability, productivity, and fault tolerance. In this paper, we propose a system for integrating MPI with Spark and analyze the costs and benefits of doing so for four distributed graph and machine learning applications. We show that offloading computation to an MPI environment from within Spark provides 3.1−17.7× speedups on the four sparse applications, including all of the overheads. This opens up an avenue to reuse existing MPI libraries in Spark with little effort.
Conference Paper
Work stealing can be implemented in either a cooperative or a coordinated way. We compared the two approaches for lifeline-based global load balancing, which is the algorithm used by X10's Global Load Balancing framework GLB. We conducted our study with the APGAS library for Java, to which we ported GLB in a first step. Our cooperative variant resembles the original GLB framework, except that strict sequentialization is replaced by Java synchronization constructs such as critical sections. Our coordinated variant enables concurrent access to local task pools by using a split queue data structure. In experiments with modified versions of the UTS and BC benchmarks, the cooperative and coordinated APGAS variants had similar executions times, without a clear winner. Both variants outperformed the original GLB when compiled with Managed X10. Experiments were run on up to 128 nodes, to which we assigned up to 512 places.
Conference Paper
The field of data analytics is currently going through a renaissance as a result of ever-increasing dataset sizes, the value of the models that can be trained from those datasets, and a surge in flexible, distributed programming models. In particular, the Apache Hadoop and Spark programming systems, as well as their supporting projects (e.g. HDFS, SparkSQL), have greatly simplified the analysis and transformation of datasets whose size exceeds the capacity of a single machine. While these programming models facilitate the use of distributed systems to analyze large datasets, they have been plagued by performance issues. The I/O performance bottlenecks of Hadoop are partially responsible for the creation of Spark. Performance bottlenecks in Spark due to the JVM object model, garbage collection, interpreted/managed execution, and other abstraction layers are responsible for the creation of additional optimization layers, such as Project Tungsten. Indeed, the Project Tungsten issue tracker states that the "majority of Spark workloads are not bottlenecked by I/O or network, but rather CPU and memory". In this work, we address the CPU and memory performance bottlenecks that exist in Apache Spark by accelerating user-written computational kernels using accelerators. We refer to our approach as Spark With Accelerated Tasks (SWAT). SWAT is an accelerated data analytics (ADA) framework that enables programmers to natively execute Spark applications on high performance hardware platforms with co-processors, while continuing to write their applications in a JVM-based language like Java or Scala. Runtime code generation creates OpenCL kernels from JVM bytecode, which are then executed on OpenCL accelerators. In our work we emphasize 1) full compatibility with a modern, existing, and accepted data analytics platform, 2) an asynchronous, event-driven, and resource-aware runtime, 3) multi-GPU memory management and caching, and 4) ease-of-use and programmability. Our performance evaluation demonstrates up to 3.24x overall application speedup relative to Spark across six machine learning benchmarks, with a detailed investigation of these performance improvements.
Conference Paper
We propose the APGAS library for Java 8. Inspired by the core constructs and semantics of the Resilient X10 programming language, APGAS brings many benefits of the X10 programming model to the Java programmer as a pure, idiomatic Java library. APGAS supports the development of resilient distributed applications running on elastic clusters of JVMs. It provides asynchronous lightweight tasks (local and remote), resilient distributed termination detection, and global heap references. We compare and contrast the X10 and APGAS programming styles, review key design choices, and demonstrate that APGAS achieves performance comparable with X10.
Conference Paper
MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.
Article
War and Peace / Leo Tolstoy Note: The University of Adelaide Library eBooks @ Adelaide.
PCJ library repository
  • P Bala
  • M Nowicki
P. Bała and M. Nowicki, "PCJ library repository," 2018. [Online]. Available: https://github.com/hpdcj/PCJ
Extended APGAS library repository
  • J Posner
J. Posner, "Extended APGAS library repository," 2018. [Online]. Available: https://github.com/posnerj/PLM-APGAS
Google Java style guide
  • Google
Google, "Google Java style guide," 2018. [Online]. Available: https://google.github.io/styleguide/javaguide.html
HabaneroUPC++ library repository
  • R University
R. University, "HabaneroUPC++ library repository," 2018. [Online]. Available: https://github.com/habanero-rice/habanero-upc
Comparison of HabaneroUPC++- and APGAS library
  • J Scherbaum
J. Scherbaum, "Comparison of HabaneroUPC++-and APGAS library," bachelor's thesis, University of Kassel, Germany, 2016.
The leading open source in-memory data grid
  • Hazelcast
Hazelcast, "The leading open source in-memory data grid," 2018. [Online]. Available: http://hazelcast.org
The Apache Software Foundation
The Apache Software Foundation, "Apache Spark," 2018. [Online]. Available: https://github.com/apache/spark [18] --, "Apache Mesos," 2018. [Online]. Available: http: //mesos.apache.org
Akka library repository
  • Lightbend
Lightbend, "Akka library repository," 2018. [Online]. Available: https://github.com/akka/akka
Both novels used by WordCount are online available: • Lev Tolstoy's War and Peace
  • Datasets
Datasets: Both novels used by WordCount are online available: • Lev Tolstoy's War and Peace [28] (written in English, 3.3 MB): http://www.gutenberg.org/files/2600/2600-0.txt • Georges des Scudéry's Artamène ou le Grand Cyrus [29] (written in French, 10 MB): http://www.artamene.org/ telecharger.php
3: APGAS dyn : Code for Pi 1 long points = 1L << Integer
  • Lst
Lst. 3: APGAS dyn : Code for Pi 1 long points = 1L << Integer.parseInt(args[0]);
• Program: Java • Compilation: javac 1.8 for Spark, javac 10 for PCJ and APGAS • Run-time environment: Linux with Oracle Java • Hardware: Any Hardware • Output: Calculated results
  • J Scherbaum
J. Scherbaum, "Comparison of HabaneroUPC++-and APGAS library," bachelor's thesis, University of Kassel, Germany, 2016. • Program: Java • Compilation: javac 1.8 for Spark, javac 10 for PCJ and APGAS • Run-time environment: Linux with Oracle Java • Hardware: Any Hardware • Output: Calculated results, elapsed running time • Publicly available?: Yes