Conference PaperPDF Available

Performance evaluation of parallel computing and Big Data processing with Java and PCJ library

Abstract and Figures

In this paper, we present PCJ (Parallel Computing in Java), a novel tool for scalable high-performance computing and big data processing in Java. PCJ is Java library implementing PGAS (Partitioned Global Address Space) programming paradigm. It allows for the easy and feasible development of computational applications as well as Big Data processing. The use of Java brings HPC and Big Data type of processing together and enables running on the different types of hardware. In particular, the high scalability and good performance of PCJ applications have been demonstrated using Cray XC40 systems. We present performance and scalability of PCJ library measured on Cray XC40 systems with standard benchmarks such as ping-pong, broadcast, and random access. We describe parallelization of example applications of different characteristics including FFT and 2D stencil. Results for standard Big Data benchmarks such as word count are presented. In all cases, measured performance and scalability confirm that PCJ is a good tool to develop parallel applications of different type.
Content may be subject to copyright.
A preview of the PDF is not available
... In previous works, we have shown that the PCJ library allows for easy and feasible development of computational applications as well as Big Data and AI processing running on supercomputers or clusters 3 . The performance comparison with the C/MPI based codes has been presented in previous papers 4,5 . The extensive comparison with Java-based solutions including APGAS (Java implementation of X10 language) has been also performed 6,3 . ...
... Communication is localized in the all-to-all routine that is used for a global conversion of data layout, from block to cyclic and vice verse. Implementation details are described in 5 . Full source code is available on GitHub 52 . ...
... The FFT code runs faster on Cray XC40 and scales up to several PCJ threads. Scalability is better for larger arrays as it was presented in our previous work 5 ...
Article
Large-scale computing and data processing with cloud resources is gaining popularity. However, the usage of the cloud differs from traditional high-performance computing (HPC) systems and both algorithms and codes have to be adjusted. This work is often time-consuming and performance is not guaranteed. To address this problem we have developed the PCJ library (parallel computing in Java), a novel tool for scalable HPC and big data processing in Java. In this article, we present a performance evaluation of parallel applications implemented in Java using the PCJ library. The performance evaluation is based on the examples of highly scalable applications of different characteristics focusing on CPU, communication or I/O. They run on the traditional HPC system and Amazon web services Cloud as well as Linaro Developer Cloud. For the clouds, we have used Intel x86 and ARM processors for running Java codes without changing any line of the program code and without the need for time-consuming recompilation. Presented applications have been parallelized using the partitioned global address space programming model and its realization in the PCJ library. Our results prove that the PCJ library, due to its performance and ability to create simple portable code, has great promise to be successful for the parallelization of various applications and run them on the cloud with a performance close to HPC systems.
... In this paper, we focus on the comparison of PCJ with Java-based solutions. The performance comparison with the C/MPI based codes has been presented in previous papers [2,10]. ...
... There are also solutions based on various implementations of the MPI library [14,15], distributed Java Virtual Machine (JVM) [16] and solutions based on Remote Method Invocation (RMI) [17]. Such [10]) compared to the Python loading time for original and modified Python installation (see [12]). The execution time of the hostname command run concurrently on the nodes is plotted for reference solutions rely on the external communication libraries written in other languages which causes many problems in terms of usability, portability, scalability, and performance. ...
... For a larger number of threads, the parallel efficiency decreases due to the small workload run on each processor compared to the communication time required for halo exchange. The scaling results obtained in the weak scaling mode (i.e. with a constant amount of work allocated to each thread despite the thread number) show good scalability beyond 100,000 thread limit [10]. The ideal scaling dashed line for PCJ is plotted for reference. ...
Article
Full-text available
With the development of peta- and exascale size computational systems there is growing interest in running Big Data and Artificial Intelligence (AI) applications on them. Big Data and AI applications are implemented in Java, Scala, Python and other languages that are not widely used in High-Performance Computing (HPC) which is still dominated by C and Fortran. Moreover, they are based on dedicated environments such as Hadoop or Spark which are difficult to integrate with the traditional HPC management systems. We have developed the Parallel Computing in Java (PCJ) library, a tool for scalable high-performance computing and Big Data processing in Java. In this paper, we present the basic functionality of the PCJ library with examples of highly scalable applications running on the large resources. The performance results are presented for different classes of applications including traditional computational intensive (HPC) workloads (e.g. stencil), as well as communication-intensive algorithms such as Fast Fourier Transform (FFT). We present implementation details and performance results for Big Data type processing running on petascale size systems. The examples of large scale AI workloads parallelized using PCJ are presented.
... In the previous works, we have shown that the PCJ library allows for the easy and feasible development of computational applications as well as Big Data and AI processing running on supercomputers or clusters. The performance comparison with the C/MPI based codes has been presented in previous papers [22,26]. The extensive comparison with Java-based solutions including APGAS (Java implementation of X10 language) has been also performed [27,31]. ...
... The communication is localized in the all-to-all routine that is used for a global conversion of data layout, from block to cyclic and vice verse. The implementation details are described in [26]. Full source code is available on GitHub at [11]. ...
... The FFT code runs faster on Cray XC40 and scales up to several PCJ threads. The scalability is better for larger arrays as it was presented in the [26]. The AWS cloud shows also good scalability. ...
Chapter
Cloud resources are more often used for large scale computing and data processing. However, the usage of the cloud is different than traditional High-Performance Computing (HPC) systems and both algorithms and codes have to be adjusted. This work is often time-consuming and performance is not guaranteed. To address this problem we have developed the PCJ library (Parallel Computing in Java), a novel tool for scalable high-performance computing and big data processing in Java. In this paper, we present a performance evaluation of parallel applications implemented in Java using the PCJ library. The performance evaluation is based on the examples of highly scalable applications that run on the traditional HPC system and Amazon AWS Cloud. For the cloud, we have used Intel x86 and ARM processors running Java codes without changing any line of the program code and without the need for time-consuming recompilation. Presented applications have been parallelized using the PGAS programming model and its realization in the PCJ library. Our results prove that the PCJ library, due to its performance and ability to create simple portable code, has great promise to be successful for the parallelization of various applications and run them on the cloud with a similar performance as for HPC systems.
... Here we present results for the simple reduction algorithm based on the serial execution as well as an optimal algorithm based on the binary tree. More detailed comparison of the different reduction algorithms can be found in [19], [20]. In the serial algorithm presented in Listing 11 the thread #0 gets values from all nodes and adds partial results. ...
... The different implementation of the serial algorithm but using put() or waitFor() methods were also developed but their performance and scalability are similar [19]. The difference between serial and binary tree implementation is also small, the difference is visible for a large number of threads exceeding one thousand [20]. ...
... The performance of the PCJ library has been tested based on the standard microbenchmarks such as ping-pong, broadcast, and barrier showing good scalability and performance up to hundreds of thousands of cores [20]. ...
Conference Paper
PCJ is a Java library for scalable high performance and computing and Big Data processing. The library implements the partitioned global address space (PGAS) model. The PCJ application is run as a multi-threaded application with the threads distributed over multiple Java Virtual Machines. Each task has its own local memory to store and access variables locally. Selected variables can be shared between tasks and can be accessed, read and modified by other tasks. The library provides methods to perform basic operations like synchronization of tasks, get and put values in an asynchronous one-sided way. Additionally, PCJ offers methods for creating groups of tasks, broadcasting and monitoring variables. The library hides details of inter-and intra-node communication-making programming easy and feasible. The PCJ library allows for easy development of highly scalable (up to 200k cores) applications running on the large resources. PCJ applications can be also run on the systems designed for data analytics such as Hadoop clusters. In this case, performance is higher than for native applications. The PCJ library fully complies with Java standards, therefore, the programmer does not have to use additional libraries, which are not part of the standard Java distribution. In this paper, we present details of the PCJ library, its API and example applications. The results show good performance and scalability. It is noteworthy that the PCJ library due to its performance and ability to create simple code has great promise to be successful for the parallelization of HPC and Big Data applications.
... The PCJ library for Java [5] won the HPC Challenge Class 2 Best Productivity Award on Supercomputing in 2014, and achieves a better performance than MPI with Java bindings [11]. In some situations, however, the performance of PCJ is up to three times below that of MPI with C bindings [11]. ...
... The PCJ library for Java [5] won the HPC Challenge Class 2 Best Productivity Award on Supercomputing in 2014, and achieves a better performance than MPI with Java bindings [11]. In some situations, however, the performance of PCJ is up to three times below that of MPI with C bindings [11]. ...
... PCJ implements the Partitioned Global Address Space (PGAS) paradigm for running concurrent applications. The PCJ library allows running concurrent applications on systems comprise one or many multicore nodes like standard workstation, nodes with hundreds of computational threads like Intel KNL processors [11], computing clusters or even supercomputers [12]. ...
Article
Full-text available
Sorting algorithms are among the most commonly used algorithms in computer science and modern software. Having efficient implementation of sorting is necessary for a wide spectrum of scientific applications. This paper describes the sorting algorithm written using the partitioned global address space (PGAS) model, implemented using the Parallel Computing in Java (PCJ) library. The iterative implementation description is used to outline the possible performance issues and provide means to resolve them. The key idea of the implementation is to have an efficient building block that can be easily integrated into many application codes. This paper also presents the performance comparison of the PCJ implementation with the MapReduce approach, using Apache Hadoop TeraSort implementation. The comparison serves to show that the performance of the implementation is good enough, as the PCJ implementation shows similar efficiency to the Hadoop implementation.
Article
Full-text available
The detailed knowledge of C. elegans connectome for 3 decades has not contributed dramatically to our understanding of worm’s behavior. One of main reasons for this situation has been the lack of data on the type of synaptic signaling between particular neurons in the worm’s connectome. The aim of this study was to determine synaptic polarities for each connection in a small pre-motor circuit controlling locomotion. Even in this compact network of just 7 neurons the space of all possible patterns of connection types (excitation vs. inhibition) is huge. To deal effectively with this combinatorial problem we devised a novel and relatively fast technique based on genetic algorithms and large-scale parallel computations, which we combined with detailed neurophysiological modeling of interneuron dynamics and compared the theory to the available behavioral data. As a result of these massive computations, we found that the optimal connectivity pattern that matches the best locomotory data is the one in which all interneuron connections are inhibitory, even those terminating on motor neurons. This finding is consistent with recent experimental data on cholinergic signaling in C. elegans, and it suggests that the system controlling locomotion is designed to save metabolic energy. Moreover, this result provides a solid basis for a more realistic modeling of neural control in these worms, and our novel powerful computational technique can in principle be applied (possibly with some modifications) to other small-scale functional circuits in C. elegans.
Conference Paper
Full-text available
Apache Spark is a popular framework for data analytics with attractive features such as fault tolerance and interoperability with the Hadoop ecosystem. Unfortunately, many analytics operations in Spark are an order of magnitude or more slower compared to native implementations written with high performance computing tools such as MPI. There is a need to bridge the performance gap while retaining the benefits of the Spark ecosystem such as availability, productivity , and fault tolerance. In this paper, we propose a system for integrating MPI with Spark and analyze the costs and benefits of doing so for four distributed graph and machine learning applications. We show that offloading computation to an MPI environment from within Spark provides 3.1−17.7× speedups on the four sparse applications, including all of the overheads. This opens up an avenue to reuse existing MPI libraries in Spark with little effort.
Conference Paper
Full-text available
Building high-performance virtual machines is a complex and expensive undertaking; many popular languages still have low-performance implementations. We describe a new approach to virtual machine (VM) construction that amortizes much of the effort in initial construction by allowing new languages to be implemented with modest additional effort. The approach relies on abstract syntax tree (AST) interpretation where a node can rewrite itself to a more specialized or more general node, together with an optimizing compiler that exploits the structure of the interpreter. The compiler uses speculative assumptions and deoptimization in order to produce efficient machine code. Our initial experience suggests that high performance is attainable while preserving a modular and layered architecture, and that new high-performance language implementations can be obtained by writing little more than a stylized interpreter.
Conference Paper
Graph processing is used in many fields of science such as sociology, risk prediction or biology. Although analysis of graphs is important it also poses numerous challenges especially for large graphs which have to be processed on multicore systems. In this paper, we present PGAS (Partitioned Global Address Space) version of the level-synchronous BFS (Breadth First Search) algorithm and its implementation written in Java. Java so far is not extensively used in high performance computing, but because of its popularity, portability, and increasing capabilities is becoming more widely exploit especially for data analysis. The level-synchronous BFS has been implemented using a PCJ (Parallel Computations in Java) library. In this paper, we present implementation details and compare its scalability and performance with the MPI implementation of Graph500 benchmark. We show good scalability and performance of our implementation in comparison with MPI code written in C. We present challenges we faced and optimizations we used in our implementation necessary to obtain good performance.
Article
This paper describes the Java MPI bindings that have been included in the Open MPI distribution. Open MPI is one of the most popular implementations of MPI, the Message-Passing Interface, which is the predominant programming paradigm for parallel applications on distributed memory computers. We have added Java support to Open MPI, exposing MPI functionality to Java programmers. Our approach is based on the Java Native Interface, and has similarities with previous efforts, as well as important differences. This paper serves as a reference for the application program interface, and in addition we provide details of the internal implementation to justify some of the design decisions. We also show some results to assess the performance of the bindings.
Article
In this paper, we propose high-performance radix-2, 3 and 5 parallel 1-D complex FFT algorithms for distributed-memory parallel computers. We use the four-step or six-step FFT algorithms to implement the radix-2, 3 and 5 parallel 1-D complex FFT algorithms. In our parallel FFT algorithms, since we use cyclic distribution, all-to-all communication takes place only once. Moreover, the input data and output data are both in natural order. We also show that the suitability of a parallel FFT algorithm is machine-dependent because of the differences in the architecture of the processor elements in distributed-memory parallel computers. Experimental results of 2p3q5r point FFTs on distributed-memory parallel computers, HITACHI SR2201 and IBM SP2 are reported. We succeeded to get performances of about 130 GFLOPS on a 1024PE HITACHI SR2201 and about 1.25 GFLOPS on a 32PE IBM SP2.
Article
War and Peace / Leo Tolstoy Note: The University of Adelaide Library eBooks @ Adelaide.