ChapterPDF Available

PCJ as a tool for massively parallel data processing

Authors:

Abstract

In this report we present PCJ (Parallel Computing in Java) as a novel tool for scalable data processing in Java. PCJ library is Java library based on PGAS (Partitioned Global Address Space) programming paradigm and allows for easy and feasible development of computational applications including BigData processing.
PCJ as a tool for massively parallel data processing
Marek Nowicki2, Łukasz Górski1,2, Magdalena Ryczkowska1,2, Piotr Bała1
1 ICM, University of Warsaw, Poland
2 WMiI, Nicolaus Copernicus University, Toruń, Poland
emails: {faramir,lgorski,gdama,bala}@icm.edu.pl
Keywords: BigData, Java, PCJ, parallel computing, PGAS
1. Introduction
In this report we present PCJ (Parallel Computing in Java) as a novel tool for scalable
data processing in Java. PCJ library is Java library based on PGAS (Partitioned Global
Address Space) programming paradigm and allows for easy and feasible development of
computational applications including BigData processing.
The demand on increasingly faster data processing resulted in creating dedicated tools
and algorithms. One of the most widely used is MapReduce [1] model together with open-
source Apache Hadoop platform [2]. The big advantage of this tool is fault tolerance and
ability to keep thousands of terabytes of data on distributed file system. However, gaining
high performance in some sort of problems might be a huge challenge. Because of that
Apache Spark has been developed [3]. It allows to keep data in-memory and speed up
analysis.
2. PCJ Library
PCJ [4] is a library that allows developing applications in pure Java language. It does
not require any language extensions or special compiler. The user has to download single jar
file and can develop and run parallel applications on any system with Java installed. PCJ
uses PGAS (Partitioned Global Address Space) programming model, with all
communication details like threads administration or network programming hidden. The
communication in the model is one-sided and asynchronous. PCJ library provides necessary
tools for threads numbering, data broadcast and threads synchronization. All those features
make programming simpler together with high performance preserved.
The PCJ applications can run on the traditional HPC systems such as x86 clusters,
multicore PCs and other systems including recent Intel KNL processors. The applications
implemented with PCJ and Java scale up to hundreds thousands cores. A good example is
2D stencil code running with 196k cores of Cray XC40 at HLRS.
3. BigData processing with PCJ
The PCJ library has been compared with Apache Hadoop. The performance results
confirm that applications implementation based on the PCJ library are much faster (5-500
times depending on the problem) than Hadoop, even for typical map-reduce applications
such as counting of words in the file.
There are preliminary results for executing application written in the PCJ library in
Apache Spark ecosystem. The PCJ based implementation of the π evaluation is more than 3
times faster. One should note that application developed with PCJ has been run on the
Hadoop cluster as Spark application using Hadoop task management. Even with such setup,
performance of PCJ was significantly higher.
The PCJ library has been used for parallelization of the DNA sequence search within
large database containing more than 20 millions of records (52GB file) which is the key
element of the processing NGS results. The parallelization is based on the work distribution
based on the partitioned of the input sequence and processing using NCBI-BLAST. The load
balancing has been ensured by monitoring the execution of BLAST instances. It was
demonstrated that this design allow the application to scale almost linearly (more than 90%
efficiency for 32 nodes) up to 1536 cores of the HPC cluster. The performance results for
Cray XC40 are similar and present at least 90% parallel efficiency for 32 nodes and 75%
parallel efficiency at 128 nodes (6144 cores) [6].
4. Conclusions
PCJ library is highly scalable, easy to use tool for development of parallel
applications, including BigData processing. The performance of applications implemented
with PCJ is higher than for traditional tools used for data processing such as Hadoop or
Spark. Moreover, the development with PCJ is much easier than in the case of other tools. It
requires less libraries to use, and minimizes number of language constructs used. The
resulting code is usually shorter and more readable.
PCJ applications can be developed and tested using standard Java environment, the
time consuming installation of the infrastructure tools such as Hadoop is not required.
Compare to other tools, PCJ library has no fault tolerance mechanism, however,
experimental version exists and will be integrated with the main release soon.
Acknowledgments
This work has been performed using the PL-Grid infrastructure. Partial support from
CHIST-ERA consortium is acknowledged through NCN grant 2014/14/Z/ST6/00007. MN
acknowledges EuroLab-4-HPC cross-site collaboration grant and PRACE for awarding
access to resource HazelHen at HLRS (Stuttgart, Germany).
References
1. Dean, S. Ghemawat: MapReduce: simplified data processing on large clusters.
Communications of the ACM, vol. 51 no. 1 pp. 107-113 (2008).
2. Apache Hadoop. http://hadoop.apache.org/. Accessed: 22 Sept. 2017.
3. Apache Spark. http://spark.apache.org/. Accessed: 22 Sept. 2017.
4. M. Nowicki, P. Bała. Parallel computations in Java with PCJ library In: W. W.
Smari and V. Zeljkovic (Eds.) 2012 International Conference on High
Performance Computing and Simulation (HPCS), IEEE 2012 pp. 381-387
5. http://pcj.icm.edu.pl Accessed: 22 Sep 2017.
6. M. Nowicki, D. Bzhalava, P. Bała Massively Parallel Sequence Alignment with
BLAST Through Work Distribution Implemented Using PCJ Library In: S.
Ibrahim, Kim-Kwang R. Choo, Z. Yan, W. Pedrycz (Eds.) Algorithms and
Architectures for Parallel Processing. ICA3PP 2017. Lecture Notes in Computer
Science, vol 10393. Springer, Cham, 2017, pp. 503-512

Supplementary resource (1)

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
This article presents massively parallel execution of the BLAST algorithm on supercomputers and HPC clusters using thousands of processors. Our work is based on the optimal splitting up the set of queries running with the non-modified NCBI-BLAST package for sequence alignment. The work distribution and search management have been implemented in Java using a PCJ (Parallel Computing in Java) library. The PCJ-BLAST package is responsible for reading sequence for comparison, splitting it up and start multiple NCBI-BLAST executables. We also investigated a problem of parallel I/O and thanks to PCJ library we deliver high throughput execution of BLAST. The presented results show that using Java and PCJ library we achieved very good performance and efficiency. In result, we have significantly reduced time required for sequence analysis. We have also proved that PCJ library can be used as an efficient tool for fast development of the scalable applications.
Parallel computations in Java with PCJ library
  • M Nowicki
  • P Bała
M. Nowicki, P. Bała. Parallel computations in Java with PCJ library In: W. W. Smari and V. Zeljkovic (Eds.) 2012 International Conference on High Performance Computing and Simulation (HPCS), IEEE 2012 pp. 381-387
Bała Massively Parallel Sequence Alignment with BLAST Through Work Distribution Implemented Using PCJ Library In
  • M Nowicki
  • D Bzhalava
M. Nowicki, D. Bzhalava, P. Bała Massively Parallel Sequence Alignment with BLAST Through Work Distribution Implemented Using PCJ Library In: S. Ibrahim, Kim-Kwang R. Choo, Z. Yan, W. Pedrycz (Eds.) Algorithms and Architectures for Parallel Processing. ICA3PP 2017. Lecture Notes in Computer Science, vol 10393. Springer, Cham, 2017, pp. 503-512
Algorithms and Architectures for Parallel Processing
  • Ibrahim
  • R Kim-Kwang
  • Choo
Ibrahim, Kim-Kwang R. Choo, Z. Yan, W. Pedrycz (Eds.) Algorithms and Architectures for Parallel Processing. ICA3PP 2017. Lecture Notes in Computer Science, vol 10393. Springer, Cham, 2017, pp. 503-512