ArticlePDF Available

Abstract and Figures

With the development of peta- and exascale size computational systems there is growing interest in running Big Data and Artificial Intelligence (AI) applications on them. Big Data and AI applications are implemented in Java, Scala, Python and other languages that are not widely used in High-Performance Computing (HPC) which is still dominated by C and Fortran. Moreover, they are based on dedicated environments such as Hadoop or Spark which are difficult to integrate with the traditional HPC management systems. We have developed the Parallel Computing in Java (PCJ) library, a tool for scalable high-performance computing and Big Data processing in Java. In this paper, we present the basic functionality of the PCJ library with examples of highly scalable applications running on the large resources. The performance results are presented for different classes of applications including traditional computational intensive (HPC) workloads (e.g. stencil), as well as communication-intensive algorithms such as Fast Fourier Transform (FFT). We present implementation details and performance results for Big Data type processing running on petascale size systems. The examples of large scale AI workloads parallelized using PCJ are presented.
This content is subject to copyright. Terms and conditions apply.
PCJ Java library asasolution tointegrate
HPC, Big Data andArticial Intelligence
Marek Nowicki1* , Łukasz Górski2 and Piotr Bała2
Artificial Intelligence (AI), also known as computational intelligence, is becoming more
and more popular in a large number of disciplines. It helps to solve problems for which
it is impossible or at least very hard to write a traditional algorithm. Currently, a deep
learning approach is very famous and widely studied. Deep learning, or more broadly
speaking, machine learning and neural network approaches, parses very large training
data, learns from it by fixing its internal state. e bigger the volume and variety of the
training data the neural network can better learn the environment and then give bet-
ter answers for the previously not observed data. Processing a large amount of data and
teaching neural networks with a large number of parameters requires significant compu-
tational effort not available at laptops or workstations. erefore, it is important to have
With the development of peta- and exascale size computational systems there is grow-
ing interest in running Big Data and Artificial Intelligence (AI) applications on them. Big
Data and AI applications are implemented in Java, Scala, Python and other languages
that are not widely used in High-Performance Computing (HPC) which is still domi-
nated by C and Fortran. Moreover, they are based on dedicated environments such as
Hadoop or Spark which are difficult to integrate with the traditional HPC management
systems. We have developed the Parallel Computing in Java (PCJ) library, a tool for
scalable high-performance computing and Big Data processing in Java. In this paper,
we present the basic functionality of the PCJ library with examples of highly scalable
applications running on the large resources. The performance results are presented for
different classes of applications including traditional computational intensive (HPC)
workloads (e.g. stencil), as well as communication-intensive algorithms such as Fast
Fourier Transform (FFT). We present implementation details and performance results
for Big Data type processing running on petascale size systems. The examples of large
scale AI workloads parallelized using PCJ are presented.
Keywords: Parallel computing, Java, Partitioned Global Address Space, PCJ, HPC, Big
Data, Artifical Intelligence
Open Access
© The Author(s) 2021. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material
in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco
mmons. org/ licen ses/ by/4. 0/.
Nowickietal. J Big Data (2021) 8:62
1 Faculty of Mathematics
and Computer Science,
Nicolaus Copernicus
University in Toruń, ul.
Chopina 12/18, 87-100 Toruń,
Full list of author information
is available at the end of the
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 2 of 21
Nowickietal. J Big Data (2021) 8:62
a tool, that could integrate Big Data processing with the Artificial Intelligence workloads
on the High-Performance Computing (HPC) systems.
ere is an ongoing need to adapt existing systems and design new ones that would facili-
tate the AI-based calculations. e research tries to push existing limitations in the areas
such as the performance of heterogenous systems that employ specialised hardware for
AI-based computation acceleration or I/O and networking performance (to enhance the
throughput of training or inference data[1]). Whilst the deployment of new solutions is
concerned with the advent of new AI-based tools (with Python-based libraries like PyTorch
or TensorFlow), their integration with existing HPC systems is not always easy. e Parallel
Computing in Java (PCJ) library is presented herein as an HPC-based tool that can be used
to bridge together various workloads that are currently running on the existing systems. In
particular, we show that it can be used to distribute neural network training and is a good
performer as far as I/O is concerned, especially in comparison with Hadoop/Spark. e
former corroborates the idea that the library can be used in concert with existing cluster
management tools (like Torque or SLURM) to distribute work across a cluster for neural
network training or to deploy a production-ready model in many copies for fast inference;
the latter proves that training data can be efficiently handled.
Recently, as part of the various exascale initiatives, there has been a strong interest in run-
ning Big Data and AI applications on HPC systems. Because of the different tools used in
these areas as well as due to the different nature of the algorithms used, the achievement of
good performance is difficult. Big Data and AI applications are implemented in Java, Scala,
Python and other languages that are not widely used in HPC, which is still dominated by C
and Fortran. Moreover, Big Data and AI frameworks rely on dedicated environments such as
Hadoop or Spark which are difficult to integrate with the traditional HPC management sys-
tems. To solve this problem, vendors are putting a lot of effort to rewrite the most time-con-
suming parts to C/MPI, but this is a laborious and not easy task and successes are limited.
ere is a lot of effort to adapt Big Data and AI software tools to HPC systems. However,
this approach does not remove the limitations of existing software packages and libraries. Sig-
nificant effort is also put to modify existing HPC technologies to make them more flexible and
easy to use, but success is limited. e popularity of traditional programming languages such
as C and Fortran decreases. Message-Passing Interface (MPI), which is the basic parallelization
library, is also criticized because of the complicated Application Programming Interface (API)
and difficult programming. Users are looking for easy to learn, yet feasible and scalable tools
more aligned with popular programming languages such as Java or Python. ey would like to
develop applications using workstations or laptops and then easily move them to large systems
including peta- and exascale ones. Solutions developed by the hardware providers take a direc-
tion of unification of operating systems and compilers and bringing them to workstations. Such
an approach is not enough and new solutions are necessary.
Our approach presented in this paper is to use a well-established programming language
(Java) to provide users with the easy to use, flexible and scalable programming framework
that allows for development of different types of workloads including HPC, Big Data, AI
and others. is opens the field to easy integration of HPC with Big Data and AI appli-
cations. Moreover, due to the Java portability, user can develop solution on his laptop or
workstation and than move, even without recompilation, to the cloud or HPC infrastruc-
ture including peta-scale systems.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 3 of 21
Nowickietal. J Big Data (2021) 8:62
For these purposes, we have developed the PCJ library[2]. PCJ is implementing the Parti-
tioned Global Address Space (PGAS) programming paradigm[3], as languages adhering to it
are very promising in the context of exascale. In the PGAS model, all variables are private to
the owner thread. Nevertheless, some variables can be marked as shared. Shared variables are
accessible to other threads of execution, which can address the remote variable and modify it
or store locally. e PGAS model provides simple and easy to use constructs to perform basic
operations which significantly reduces programmers’ effort preserving code performance and
scalability. e PCJ library fully complies with Java standards, therefore, the programmer does
not have to use additional libraries, which are not part of the standard Java distribution.
e PCJ library won the HPC Challenge award in 2014[4] and has been already success-
fully used for parallelization of various applications. A good example is a communication-
intensive graph search from the Graph500 test suite. e PCJ implementation scales well
and outperforms the Hadoop implementation by a factor of 100[5], but not all bench-
marks were well suited for Hadoop processing. Paper[6] compares the PCJ library and
Apache Hadoop using a conventional, widely used benchmark for measuring the perfor-
mance of Hadoop clusters, and shows that the performance of applications developed
with the PCJ library is similar or even better than the Apache Hadoop solution. e PCJ
library was also used to develop code for the evolutionary algorithm which has been used
to find a minimum of a simple function as defined in the CEC’14 Benchmark Suite[7].
Recent examples of PCJ usage include parallelization of the sequence alignment[8]. e
PCJ library allowed for the easy implementation of the dynamic load balancing for multi-
ple NCBI-BLAST instances spanned over multiple nodes giving the results at least 2 times
earlier than the implementations based on the static work distribution[9].
In previous works, we have shown that the PCJ library allows for the easy development
of computational applications as well as Big Data and AI processing. In this paper, we
focus on the comparison of PCJ with Java-based solutions. e performance comparison
with the C/MPI based codes has been presented in previous papers[2, 10].
e remainder of this paper is organized as follows. After remarks on emerging
programming languages and programming paradigms ("Prospective languages and
programming paradigms" section), we present the basic functionality of the PCJ
library ("Methods" section). "Results and discussion" section contains subsections
with results and discussion of various types of applications. "HPC workloads" sec-
tion contains the performance results are presented for a different class of applica-
tions including traditional computational intensive (HPC) workloads (e.g. stencil), as
well as communication-intensive algorithms such as Fast Fourier Transform (FFT), in
"Data analitycs" section we present implementation details and performance results
for Big Data type processing running on petascale size systems. e examples of large
scale AI workloads parallelized using PCJ are presented in "Artificial Intelligence
workloads" section. e section finishes with a description of ongoing work on the
PCJ library. e paper concludes in "Conclusion" section.
Prospective languages andprogramming paradigms
A growing interest in running machine learning and Big Data workloads is associated
with new programming languages that have not been traditionally considered for use in
high-performance computing. is includes Python, Julia, Java, and some others.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 4 of 21
Nowickietal. J Big Data (2021) 8:62
Python is now being viewed as acceptable for HPC applications, due to the 2016Gor-
don Bell finalist application PyFR[11], which demonstrated that Python application per-
formance can compete head-to-head against native language applications written in C/
C++ and Fortran on the world’s largest supercomputers. However, the multiple versions
available have limited backward compatibility which requires significant administrative
effort to handle them. A good example of problems is a long startup time of the Python
application reported[12]. For the large number of nodes it can take hours. e dedi-
cated effort is required to minimize it to acceptable value (see Fig.1).
Python remains a single-threaded environment with the global interpreter lock as the
main bottleneck. reads must wait for other threads to complete before starting to do
their assigned work. In result, the production code is too slow to be useful for large sim-
ulations. ere are some other implementations with better thread support, but their
compatibility could be limited.
e hardware vendors provide a tuned version of Python to improve performance. It is
done by using some C functions that perform (when coded optimally) at machine level
speeds. ese libraries can vectorize and parallelize the assigned workload and under-
stand the different hardware architectures.
Julia is a programming language that is still new and relatively unknown by many in the
HPC community but it is rapidly growing in popularity. For the parallel execution, Julia
provides Tasks and other modules that rely on the Julia runtime library. ese modules
allow to suspend and resume computations with full control of inter-task communica-
tion without having to manually interface with the operating system’s scheduler. A good
example of the HPC application implemented in Julia is the Celeste project[13]. It was
able to attain performance using only Julia source code and the Julia threading model. As
a result, it was possible to fully utilize the manycore Intel Xeon Phi processors.
e parallelization tools available for Java include threads and Java Concurrency which
have been introduced in Java SE 5 and improved in Java SE 6. ere are also solutions
based on various implementations of the MPI library [14, 15], distributed Java Virtual
Machine (JVM)[16] and solutions based on Remote Method Invocation (RMI)[17]. Such
Fig. 1 PCJ startup time (from [10]) compared to the Python loading time for original and modified Python
installation (see [12]). The execution time of the hostname command run concurrently on the nodes is
plotted for reference
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 5 of 21
Nowickietal. J Big Data (2021) 8:62
solutions rely on the external communication libraries written in other languages which
causes many problems in terms of usability, portability, scalability, and performance.
We should also mention solutions motivated by the partitioned global address space
approach represented by Titanium—a scientific computing dialect of Java[18]. Titanium
defines new language constructs and has to use a dedicated compiler which makes it dif-
ficult to follow recent changes in Java language.
Python, Julia and, to some extent, Java follow the well-known path. e parallelization
is possible based on the independent task model with limited communication capabili-
ties. is significantly reduces classes of algorithms that can be implemented to trivially
parallel ones. An alternative approach is based on the interfacing MPI library, thus using
a message-passing model.
Recently, programming models based on PGAS are gaining popularity. It is expected
that PGAS languages will be more important at exascale because of the distinct features
and development efforts which is lower than for other approaches. e PGAS model can
be supported by a library such as SHMEM[19], Global Arrays[20] or Charm++[21]
or by a language, such as UPC[22], Fortran[23] or Chapel[24]. PGAS systems differ in
the way the global namespace is organized. Some, such as SHMEM or Fortran, provide a
local view of data while others provide a global view of data.
Until now, there was no successful realization of the PGAS programming model for
Java. Developed by us, the PCJ library is the successful implementation providing good
scalability and reasonable performance. Another prospective implementation is APGAS,
a library offering an X10-like programming solution for Java[25].
The PCJ library
PCJ[2] is an OpenSource Java library available under the BSD license with the source code
hosted on GitHub. PCJ does not require any language extensions or special compiler. e
user has to download the single jar file and then he can develop and run parallel applications
on any system with Java installed. Alternatively, build automation tool like Maven or Gra-
dle can be used, as the library is deployed into Maven Central Repository (group:
icm.pcj, artifact: pcj). e programmers are provided with the PCJ class with a set of
methods to implement necessary parallel constructs. All technical details like threads admin-
istration, communication, and network programming are hidden from the programmers.
e PCJ library can be considered as a simple extension to Java to write parallel programs. It
provides necessary tools for easy implementation of data and work partitioning best suited to
the problem. PCJ does not provide automatic tools for the data distribution or task paralleliza-
tion but once the parallel algorithm is given it allows for its efficient implementation.
e PCJ library follows the common PGAS paradigm (see Fig.2). e application is run
as a collection of threads—called here PCJ threads. Each PCJ thread owns a local copy of
variables, each copy has a different location in physical memory. is applies also to the
threads run within the same JVM. e PCJ library provides methods to start PCJ threads
in one JVM or in a parallel environment—using multiple JVMs. PCJ threads are created at
the application launch and stopped during execution termination. e library provides also
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 6 of 21
Nowickietal. J Big Data (2021) 8:62
basic methods to manage threads such as starting execution, finding the total number of
threads and number of actual PCJ thread as well as methods to manage groups of threads.
e PCJ library provides methods to synchronize execution (PCJ.asyncBarrier())
and to exchange data between threads. e communication is one-sided and asynchro-
nous and is performed by calling PCJ.asyncPut(), PCJ.asyncGet() and PCJ.
asyncBroadcast() methods. e synchronous (blocking) versions of communica-
tion methods are also available. e data exchange can be done only for specially marked
variables. Exposition of local fields for remote addressing is performed with the use of @
Storage and @RegisterStorage annotations.
PCJ provides mechanisms to control the state of data transfer, in particular, to ensure a pro-
grammer that asynchronous data transfer is finished. For example, a thread can get a shared
variable and stores it in the PcjFuture<double[]> object. en, the received value is cop-
ied to the local variable. e whole process can be overlapped with other operations, eg. calcu-
lations. e programmer can check the status of data transfer using PcjFuture’s methods.
e PCJ API follows successful PGAS implementations such as Co-Array Fortran or
X10, however, the dedicated effort has been done to align it with the experience of Java
programmers. e full API is presented in the Table1.
With version 5.1 of the PCJ library, we provide users with the methods for collective
operations. ese methods implement the most efficient communication using a binary
tree which scales with the number of nodes n as
. is reduction algorithm is faster
than simple iteration over available threads, especially for a large number of PCJ threads
running on a node. Collective methods collect data within a physical node before send-
ing it to other nodes which reduces the number of communication performed between
nodes, i.e. between different JVM’s.
Implementation details
e use of Java language requires a specific implementation of basic PGAS functionality
which is multi-threaded execution and communication between threads.
PCJ allows for different scenarios such as multiple threads in a single JVM or runs mul-
tiple JVMs on a single physical node. Starting a JVM on a remote node relies on Secure
Shell (SSH) connection to the machine. It is necessary to set up passwordless login, e.g.
by using authentication keys without a passphrase. As presented in Fig. 1 the startup
time is lower than for Python. However, it grows up with the number of nodes, but it
should be noted that PCJ startup time includes initial communication and synchroniza-
tion of threads which is not included for other presented solutions.
It is also possible to utilize the execution schema accessible on supercomputers
or clusters (like aprun, srun, mpirun or mpiexec) that starts selected applica-
tion on all nodes allocated for the job. In this situation, instead of calling deploy(),
the start() method should be used. However, in that situation, internet address
instead of loopback addresses should be used for describing the nodes. e file with
node descriptions has to be prepared by the user, e.g. by saving the output of host-
name command executed on allocated nodes.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 7 of 21
Nowickietal. J Big Data (2021) 8:62
e architectural details of the communication between PCJ threads are presented in
e intranode communication is implemented using the Java Concurrency mecha-
nism. Sending objects from one thread to another requires cloning object’ data. Cop-
ying just object reference could cause concurrency problems in accessing the object.
Table 1 Summary of the elementary PCJ elements
Fig. 2 Diagram of PCJ computing model (from [48]). Arrows present possible communication using
put(...) or get(...) methods acting on shared variables
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 8 of 21
Nowickietal. J Big Data (2021) 8:62
PCJ library makes sure that the object is deeply copied by serializing the object and
then deserializing it on the other thread. It is done partially by the sending thread
(serializing) and partially by local workers (deserializing). is way of cloning data is
safe, as the data is deeply copied—the other thread has its own copy of data and can
use it independently.
Object.clone() method available in Java is not sufficient. It does not force to cre-
ate a deep copy of the object. For example, it creates only a shallow copy of arrays, there-
fore the data stored in the arrays are not copied between threads. e same stands for
the implementation of this method in standard classes like java.util.ArrayList.
Moreover, it requires implementation of java.lang.Cloneable interface for all
communicable classes and overriding clone() method with a public modifier that
also had to copy all mutable objects into clone. e serialization/deserialization mecha-
nism is more general and requires only that all used classes be serializable, thus imple-
menting the interface, and in most cases does not require
writing serialization handling methods. Additionally, serialization, so changing objects
into bytes stream (array), is also a requirement for sending data between nodes.
e communication between nodes uses standard network communication with
sockets. e data is serialized by the sending thread and the transferred data is deserial-
ized by remote workers. e network communication is performed using Java New I/O
classes (i.e. java.nio.*). e details of the algorithms used to implement PCJ com-
munication are described in[26].
e example parallel application which sums up n random numbers is presented in List-
ing1. e PcjExample class implements the StartPoint interface which provides
methods to start the application in parallel. e PCJ.executionBuilder() is used
to set up the execution environment: the class which is used as the main class for parallel
application and a list of nodes provided here in the nodes.txt file.
Fig. 3 Diagram of PCJ communication model (from [6]). Arrows present local and remote communication
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 9 of 21
Nowickietal. J Big Data (2021) 8:62
e work is distributed in block manner—each PCJ thread is summing up part of the
data. e parallelization is performed by changing the length of the main loop defined in
line29. e length of the loop is adjusted automatically to the number of threads used
for execution.
e partial sums are accumulated in the variable a local to each PCJ thread. e vari-
able a can be get/put/broadcast as it is defined in lines12–14. Line9 ensures that this
set of variables can be used in class PcjExample.
To ensure that all threads finished calculating partial sums the PCJ.barrier()
method is used in line32. Partial sums are then accumulated at PCJ thread #0 using
PCJ.reduce() method (line34) and printed out.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 10 of 21
Nowickietal. J Big Data (2021) 8:62
Node or thread failure in the PCJ library uses a fail-safe mechanism. Without that
mechanism, the whole computation could be stuck in not a recoverable state. When
computations are executed on a cluster system, that situation could cause useless uti-
lization of Central Processing Unit (CPU)-hours without any useful action done up to
the job time limit.
In version 5.1 of the PCJ library, there is an added fail-safe mechanism that causes
whole computations gracefully finish when failure appears. e fail-safe mechanism is
based on alive and abort messages—the heartbeat mechanism.
e alive message is periodically sent to a node’s neighbour nodes, i.e. parent and chil-
dren nodes, by each node, e.g. neighbours of node 1 are nodes: 0, 3 and 4 (cf. Fig.4). If
the node does not receive an alive message from one of its neighbour nodes within pre-
determined, configurable time, it assumes the failure of the node. Failure of the node is
also assumed when an alive message cannot be sent to the node, or one of the node’s PCJ
threads exits with an uncaught exception.
When the failure occurs, the node that discovers the breakdown removes failed node
from its neighbours’ list, immediately sends abort messages to the rest of neighbours,
and interrupts PCJ threads that are executing on the node. Each node that receives an
abort message removes the node that sent the message from its neighbours’ list (to avoid
sending a message back to already notified node), and sends an abort message to all
remaining neighbours and then interrupts its own PCJ threads.
e fail-safe mechanism allows for quicker shutting down after a breakdown, so the
cluster’s CPU-hours are not uselessly utilized. Users can disable the fail-safe mechanism
by setting an appropriate flag of PCJ execution.
Results anddiscussion
e performance results have been obtained using the Cray XC40 system at ICM (Uni-
versity of Warsaw, Poland) and HLRS (University of Stuttgart, Germany). e computing
nodes (boards) are equipped with two Intel Xeon E5-2690v3(ICM) or Intel Haswell
E5-2680(HLRS) processors, each processor contains 12 cores. In both cases, there is
hyperthreading available (2 threads per core). Both systems have Cray Aries intercon-
nect installed. e PCJ library has been also tested on the other architectures such as
Power8 or Intel KNL[27]. However, we decided to present here results obtained using
Cray XC40 systems since one of the first exascale systems will be a continuation of such
architecture[28]. We have used Java 1.8.0_51 from Oracle for PCJ and Oracle JDK10.0.2
for APGAS. For the C/MPI we have used Cray MPICH implementations in version 8.3
and 8.4 for ICM and HLRS machines respectively. We have used OpenMPI in version
4.0.0, that gives Java bindings for the MPI, to collect data for the Java/MPI execution.
HPC workloads
2D stencil
As an example of a 2D stencil algorithm we have used Game of Life which can be seen as
a typical 9-point 2D stencil—the 2D Moore neighborhood. e Game of Life is a cellular
automaton devised by John Conway[29]. In our implementation[30] the board is not
infinite—it has its maximum width and height. Each thread owns a subboard—a part of
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 11 of 21
Nowickietal. J Big Data (2021) 8:62
the board divided in a uniform way using block distribution. Although there are known
fast algorithms and optimizations that can save computational time generating the next
universe state, like Hashlife or memorization of the changed cells, we have decided to
use a straightforward implementation with a lookup of the state for each cell. However,
to save memory, each cell is represented as a single bit, where 0 and 1 mean that the cell
is dead and alive respectively.
After generating the new universe state, the border cells of subboards are exchanged
asynchronously between proper threads. e threads that have cells on the first and
last columns and rows of the universe are not exchanging the cells state to the opposite
threads. e state of neighbour cells that would be behind the universe edge is treated as
We have measured the performance in the total number of cells processed in the unit
of time (
). For each test, we performed 11 time steps. We warmed up the Java Vir-
tual Machine to allow the JVM to use Just-in-Time (JIT) compilation to optimize the
run instead of execution in interpreted mode. We also ensured that the Garbage Col-
lector (GC) had not much impact on the gained performance. To do so we took peak
performance (maximum of steps performance) for the whole simulation. We have used
48 working threads per node.
Figure 5 presents performance comparison of Game of Life applications for
604, 800
604, 800
cells universe. e performance for both implementations (PCJ and
Java/MPI) is very similar and results in almost ideal scalability. C/MPI version presents
3-times higher performance and similar scalability. e performance data shows scal-
ability up to 100,000 threads (on 2048 nodes). For a larger number of threads, the paral-
lel efficiency decreases due to the small workload run on each processor compared to
the communication time required for halo exchange. e scaling results obtained in the
weak scaling mode (i.e. with a constant amount of work allocated to each thread despite
the thread number) show good scalability beyond 100,000 thread limit[10]. e ideal
scaling dashed line for PCJ is plotted for reference. Presented results show ability of run-
ning large scale HPC applications using Java and the PCJ library.
Inset in Fig.5 presents the performance statistics calculated based on 11 time steps of
the Game of Life application executed on 256 nodes (12,288 threads). e ends of whisk-
ers are minimum and maximum values, a cross (
) represents an average value, a box
Fig. 4 Communication tree with selected neighbours of node-#1 for fail-safe mechanism. Green arrows
represent the node-#1 alive messages sent to its neighbour nodes. Blue arrows represent the node-#1 alive
messages received from the neighbour nodes
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 12 of 21
Nowickietal. J Big Data (2021) 8:62
represents values between 1st and 3rd quartiles, and a band inside the box is a median
value. In the case of C/MPI, the box and whiskers are not visible, as the execution shows
the same performance for all of the execution steps. In the case of JVM executions (PCJ
and Java/MPI), minimum values come from the very first steps of execution, when the
execution was made in interpreted mode. However, the JIT compilation quickly opti-
mized the run and the vast majority of steps were run with the highest performance. It is
clearly visible that Java applications, after JIT compilation, has very stable performance
results as the maximum, median and 1st and 3rd quartiles data are almost indistinguish-
able in the figure.
Fast Fourier Transform
e main difficulty in efficient parallelization of FFT comes from the global character
of the algorithm, which involves an extensive all to all communication. One of the effi-
cient distributed FFT implementations available is based on the algorithm published by
Takahashi and Kanada[31]. It is used as a reference MPI implementation in the HPC
Challenge Benchmark[32], a well-known suite of tests for assessing the HPC systems
performance. is implementation is treated as a baseline for the tests of the PCJ version
described herein (itself based on[33]), with the performance of all-to-all exchange being
the key factor.
In the case of PCJ code [34] we have chosen, as a starting point, PGAS implemen-
tation developed for Coarray Fortran 2.0[33]. e original Fortran algorithm uses a
radix 2 binary exchange algorithm that aims to reduce interprocess communication and
is structured as follows: firstly, a local FFT calculation is performed based on the bit-
reversing permutation of input data; after this step all threads perform data transposi-
tion from block to cyclic layout, thus allowing for subsequent local FFT computations;
finally, a reverse transposition restores data to is original block layout[33]. Similarly to
Random Access implementation, inter-thread communication is therefore localized in
the all-to-all routine that is used for a global conversion of data layout, from block to
cyclic and vice verse. Such implementation allows one to limit the communication, yet
makes the implementation of all-to-all exchange once again central to the overall pro-
gram’s performance.
e results for complex one-dimensional FFT of
elements (Fig.6) show how the
three alternative PCJ all-to-all implementations compare in terms of scalability. Blocking
and non-blocking ones iterate through all other threads to read data from their shared
memory areas (PcjFutures are used in a non-blocking version). Hypercube-based
communication utilizes a series of pairwise exchanges to avoid network congestion.
While nonblocking communication achieved the best peak performance, the hypercube-
based solution exploited the available computational resources to the greatest extent,
reaching peak performance for 4096 threads when compared to 1024 threads in the case
of nonblocking communication. Java/MPI code uses the same algorithm as PCJ for cal-
culation and all-to-all exchange. It is implemented using the native MPI primitive. e
scalability of the PCJ implementation follows the results of reference C/MPI code as well
as those of Java/MPI. Total execution time for Java is larger when compared to all-native
implementation irrespective of the underlying communication library. Presented results
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 13 of 21
Nowickietal. J Big Data (2021) 8:62
confirm, that performance and scalability of PCJ and Java/MPI implementations are
similar. e PCJ library is easier to use, less error prone and does not require libraries
external to Java such as MPI. erefore it is good alternative to MPI. Java implementa-
tions are slower than HPCC which is implemented using C. is comes from that differ-
ent ways of storing and accessing data.
Data analitycs
WordCount is traditionally used for demonstrative purposes to showcase the basics of
the map-reduce programming paradigm. It works by reading an input file on a line-
by-line basis and counting individual word occurrences (map phase). e reduction
is performed by summing the partial results calculated by worker threads. Full source
code of the application is available at GitHub[35]. Herein the comparison between
PCJ’s and APGAS’s performance is presented with the C++/MPI version shown as
a baseline.
is the basic implementation,
is a version enhanced
with dynamic load-balancing capabilities. e APGAS library, as well as its imple-
mentation of WordCount code, are based on the prior work[25]. APGAS code was
run using SLURM in Multiple Programs, Multiple Data (MPMD) mode, with com-
mands used to start computations and remote APGAS places differing. A range of
the number of nodes used to run a given number of threads was tested and the best-
achieved results are presented. Due to APGAS’s requirements, Oracle JDK 10.0.2 was
used in all cases. e tests use 3.3MB UTF-8 encoded text of English translation of
Tolstoy’s War and Peace as a textual corpus for word counting code. ey were per-
formed in a strong scalability regime, with the input file being read 4096 times and
all threads reading the same file. e file content is not preloaded into the application
memory before the benchmark.
Fig. 5 Performance results of Game of Life implemented with PCJ, C/MPI and Java/MPI for
604, 800
604, 800
cells universe. The strong scaling results were obtained on Cray XC40 at ICM (PCJ, C/MPI and Java/MPI) and
Cray XC40 at HLRS (PCJ). Ideal scaling for PCJ is drawn (dashed line). Inset plot presents the performance
statistics calculated based on 11 time steps on 256 nodes (12,288 threads). In the inset plot, ends of whiskers
are minimum and maximum values, a cross (
) represents an average value, a box represents values between
1st and 3rd quartiles, and a band inside the box is a median value
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 14 of 21
Nowickietal. J Big Data (2021) 8:62
e performance of the reduction phase is key for the overall performance[10] and
the best results in case of PCJ are obtained using binary tree communication. APGAS
solution uses the reduction as implemented in [25] (this work reports the worse
performance of PCJ, due to the use of simpler and therefore less efficient reduction
e results presented in Fig.7 show good scalability of the PCJ implementation.
PCJ’s performance was better when compared to APGAS, which can be tracked to the
PCJ’s reduction implementation. Regarding native code, C++ was chosen as a better-
suited language for this task than C, because of its built-in map primitives and higher
level string manipulation routines. While C++ code scales ideally, its poor perfor-
mance when measured in absolute time can be traced back to the implementation of
line-tokenizing. All the codes (PCJ, APGAS, C++), in line with our earlier works[5],
consistently use regular expressions for this task.
One should note, that different set of results obtained on the Hadoop cluster shows
that PCJ implementation is at least 3times faster than Hadoop one[5] and Spark
Articial Intelligence workloads
AI is currently a vibrant area of research, gaining a lot from advances in the process-
ing capabilities of modern hardware. e PCJ library was tested in the area of artificial
intelligence to ensure that it provides AI workloads with sufficient processing potential,
able to exploit the future exascale systems. In this respect, two types of workloads were
considered. Firstly, stemming from the traditional mode of AI research aimed at discov-
ering the inner working of real physiological systems, the library was used to aid the
researchers in the task of modeling the C. Elegans neuronal circuity. Secondly, it was
used to power the training of the modern artificial neural network, distributing the gra-
dient descent calculations.
Fig. 6 Performance of the 1D FFT implemented using different algorithms. The data for the HPCC
implementation using C/MPI as well as Java/MPI is presented for reference. The array of
elements was
used. Ideal scaling for PCJ is drawn (dashed line). Inset presents communication time as a percentage of total
execution time. The benchmarks were run on Cray XC40 at ICM
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 15 of 21
Nowickietal. J Big Data (2021) 8:62
Neural networks—modeling theconnectome ofC. Elegans
Nematode C. Elegans is a model organism whose neuronal development has been stud-
ied extensively and remains the only organism with a fully known connectome. ere
are currently some experiments that aim to link its structure and actual worm’s behav-
ior. In one of those experiments, worm’s motoric neurons were ablated using a laser,
affecting the changes of its movement patterns[36]. e results of those experiments
allowed to create a mathematical model of the relevant connectome fragment by a bio-
physics expert. e model was defined by a set of ordinary differential equations, with 8
e value of those parameters was key to the model’s accuracy, yet they were impos-
sible to calculate using the traditional numerical or analytical methods. erefore a dif-
ferential evolution algorithm was used to explore the solution space and fit the model’s
parameters so that its predictions are in line with the empirical data. e mathematical
model has been implemented in Java and parallelized with the use of the PCJ library[36,
37]. It should be noted that the library allowed to rapidly (ca. 2 months) prototype the
connectome model and align it according to the shifting requirements of the biophysics
In regards to the implementation’s performance, Fig.8 can be consulted, where it is
expressed as a number of tested configurations per second. e experimental dataset
amounted to a population of 5 candidate vectors affiliated with each thread that was
evaluated through 5 iterations in a weak scaling regime. A scaling close to the ideal was
achieved both irrespective of the hyperthreading status, as its overhead in this scenario
is minimal. e outlier visible in the case of 192 threads is most probably due to the
stochastic nature of the differential evolution algorithm and disparities regarding model
evaluation time for concrete sets of parameters.
Fig. 7 Strong scalability of WordCount application implemented in PCJ and APGAS.
denote original fork-join algorithm and algorithm using lifeline-based global load balancing. C++/MPI
results are plotted for reference. Ideal scaling for PCJ and C++/MPI are drawn with the dashed lines. The
results are presented for 3.3 MB file and were obtained using Cray XC40 at ICM. The APGAS results were not
calculated for a number of nodes greater than 1024 due to long startup time exceeding one hour (which can
be attributed to the library’s bootstrapper which—at the time of running the experiments—was not generic
enough to offer a fast startup on a range of systems)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 16 of 21
Nowickietal. J Big Data (2021) 8:62
Distributed neural network training
e PCJ library was also tested in workloads specific to modern machine learning appli-
cations. It was successfully integrated with TensorFlow for the distribution of gradient
descent operation for effective training of neural networks[38], performing very well
against the Python/C/MPI-based state-of-the-art solution, Horovod[39].
For presentation purposes, a simple network consisting of three fully connected lay-
ers (sized 300, 100 and 10 neurons respectively[40]) was trained for handwritten digits
recognition for 20 epochs (i.e. for a fixed number of iterations) on MNIST dataset[41]
(composed of 60,000 training images of which 5000 were set aside for validation pur-
poses in this test), with mini-batch consisting of 50 images. PCJ tests two algorithms.
e first one uses the same general idea for gradient descent calculations as Horovod
(i.e. data-parallel calculations are performed process-wise, and the gradients are subse-
quently averaged after each mini-batch). e second one implements asynchronous par-
allel gradient descent as described in[42].
Implementation-wise, Horovod works by supplying the user with simple to use Python
package with wrappers and hooks that allow enhancing existing code with distributed
capabilities and MPI is used for interprocess communication. In the case of PCJ, a spe-
cial runner was coded in Java with the use of TensorFlow’s Java API for the distribution
and instrumentation of training calculations. Relevant changes had to be implemented
in Python code as well. Our code implements the reduction operation based on the
hypercube allreduce algorithm[43].
e calculations were performed using the Cray XC40 system at ICM with Python
3.6.1 installed alongside TensorFlow v.1.130-rc1. Horovod was installed with the use of
Python’s pip tool version 0.16.0. SLURM was used to start distributed calculations, with
one TensorFlow process per node. We have used 48 working threads per node.
Results in strong scalability regime presented in Fig.9 show that the PCJ implementa-
tion that facilitates asynchronicity is on a par with MPI-based Horovod. In the case of
smaller training data sizes when a larger number of nodes is used, our implementation
is at a disadvantage in terms of accuracy. is is because the overall calculation time
is small and communication routines are not able to finish in time before thread fin-
ish local training. e datapoint for 3072 threads (64 nodes) was thus omitted for asyn-
chronous case in Fig.9. Achieving full performance of Horovod on our cluster was only
possible after using non-standard configuration for available TensorFlow installation.
is in turn allowed to fully exploit inter-node parallelism with the use of Math Kernel
Library (MKL). TensorFlow for Java available as a Maven package did not exhibit the
need for this fine-tuning, as it does not use MKL for computation.
Presented results clearly show that PCJ can be efficiently used for parallelization of
AI workloads. Moreover, use of Java language allows for easy integration with existing
applications and frameworks. In this case PCJ allowed for easier deployment of most
efficient configuration of TensorFlow on HPC cluster.
Future work
From the very beginning, the PCJ library has been using sockets for transferring the data
between nodes. is design was straightforward, however, it precludes the full utilization
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 17 of 21
Nowickietal. J Big Data (2021) 8:62
of the novel communication hardware such as Cray Aries or InfiniBand interconnects.
ere is ongoing work to use novel technologies in PCJ. is is especially important for
network-intensive applications. However, we are looking for Java interfaces that can sim-
plify integration. DiSNI[44] or jVerbs[45] seems to be a good choice, however, both
are based on the specific implementation of communication and their usage by the PCJ
library is not easy. ere are also attempts to speed up data access in Java using Remote
Direct Memory Access (RDMA) technology[46, 47]. We are investigating how to use it
in the PCJ library.
Another reason for low communication performance is the problem of data copying
during the send and receive process. is cannot be avoided due to the Java design: tech-
nologies based on the zero-copy and direct access to the memory do not work in this
case. is is an important issue not only for the PCJ library but for Java in general.
As one of the main principles of the PCJ library is not to depend on adding any addi-
tional library, PCJ uses a standard Java object serialization mechanism to make a com-
plete copy of an object. ere are undergoing works that would allow using external
serialization or cloning libraries, like Kryo, that could speed up making a copy of data.
e current development of the PCJ library is focused on the code execution on the
multiple, multicore processors. Whilst Cray XC40 is representative for most of the cur-
rent TOP500 systems, of which only 20% are equipped with Graphics Processing Units
(GPUs), the peta- and exascale systems are heterogeneous, and in addition to the CPU’s
nodes contains accelerators such as GPUs, Field-Programmable Gate Arrays (FPGAs),
and others. e PCJ library supports accelerators through JNI mechanisms. In particular
one can use JCuda to run Compute Unified Device Architecture (CUDA) kernels on the
accelerators. is mechanism has been checked experimentally, the performance results
are in preparation. Similarly, nothing precludes the already existing PCJ-TensorFlow
code from using TensorFlow’s GPU exploitation capabilities.
Fig. 8 Performance of the evolution algorithm to search parameters of neural network simulating
connectome of C. Elegans. The performance data for execution with and without hyperthreading is
presented. The benchmarks were run on Cray XC40 at HLRS. Ideal scaling is drawn with the dashed line
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 18 of 21
Nowickietal. J Big Data (2021) 8:62
Near perspective of exascale systems and a growing number of petascale computers
makes strong interest in new, more productive programming tools and paradigms capa-
ble of developing codes for large systems. At the same time, we observe a change in the
type of workloads run on supercomputers. ere is a strong interest in running Big Data
processing or Artificial Intelligence applications. Unfortunately, the majority of the new
workloads are not well suited for large computers. ey are implemented in languages
like Java or Scala which, up to now, were out of interest of the HPC community.
In this paper, we performed a brief review of the programming languages and pro-
gramming paradigms getting attention in the context of HPC, Big Data and AI process-
ing. We focused on Java as the most widely used programming language and presented
its feasibility to implement AI and Big Data applications for large scale computers.
As presented in the paper, the PCJ library allows for easy development of highly scal-
able parallel applications. Moreover, PCJ puts great promise to be successful for the
parallelization of HPC workloads as well as AI, and Big Data applications. Example
applications and their scalability and performance have been reported in this paper.
Results presented here, and in previous publications, clearly show the feasibility of
Java language to implement parallel applications with a large number of threads. e
PGAS programming model allows for easy implementation of various parallel schemas,
including traditional HPC as well as Big Data and AI, ready to run on peta- and exascale
e proposed solution will open up new possibilities of applications. Java as the most
popular programming language is widely used in business applications. e PCJ library
allows, with little effort, to extend the application to include computer simulations, data
analysis and artificial intelligence. e PCJ library allows to easily develop applications
and run them on a variety of resources from personal workstation computers to cloud
resources. e key element is the ease of extending existing applications and the inte-
gration of various types of processing while maintaining the advantages offered by Java.
Fig. 9 Comparison of the distributed training time taken by Horovod and PCJ as measured on Cray XC40 at
ICM. Accuracy of
was achieved
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 19 of 21
Nowickietal. J Big Data (2021) 8:62
is solution is very important for existing applications and allows for easy and quick
adaptation to the growing demand.
AI: Artificial Intelligence; API: Application Programming Interface; CPU: Central Processing Unit; CUDA: Compute Unified
Device Architecture; FFT: Fast Fourier Transform; FPGA: Field-Programmable Gate Array; GC: Garbage collector; GPU:
Graphics processing unit; HPC: High-performance computing; JIT: Just-in-Time; JVM: Java Virtual Machine; MKL: Math
Kernel Library; MPI: Message-Passing Interface; MPMD: Multiple Programs, Multiple Data; PCJ: Parallel Computing in Java;
PGAS: Partitioned Global Address Space; RDMA: Remote Direct Memory Access; RMI: Remote method invocation; SSH:
Secure shell.
The authors gratefully acknowledge the support of Dr. Alexey Cheptsov and the computer resources and technical
support provided by HLRS. We acknowledge PRACE for awarding us access to resource Hazel Hen based in Germany at
HLRS under a preparatory access project no. 2010PA4009 and HPC-Europa3 program under application no. HPC17J7Y4M.
We acknowledge Interdisciplinary Centre for Mathematical and Computational Modelling (ICM) University of Warsaw for
providing computational resources under grants no. GB65-15, GA69-19.
Authors’ contributions
MN was responsible for the description of the PCJ library and preparing and running a 2D stencil (Game of Life) bench-
mark and interpreting the results as well as was a major contributor in writing the manuscript. ŁG was responsible for
preparing other presented applications, collecting and interpreting their performance results, and writing the descrip-
tion in the manuscript. PB contributions include the original idea, literature review, interpreting the results and writing
conclusions. All authors read and approved the final manuscript.
This research was carried out with the partial support of EuroLab-4-HPC as cross-site collaboration grant and HPC-
Europa3 program (application no. HPC17J7Y4M) for visiting HLRS, Germany.
Availability of data and materials
All the source codes are included on the websites listed in the References section.
Competing interests
All authors declare that they have no competing interests.
Author details
1 Faculty of Mathematics and Computer Science, Nicolaus Copernicus University in Toruń, ul. Chopina 12/18,
87-100 Toruń, Poland. 2 Interdisciplinary Centre for Mathematical and Computational Modeling, University of Warsaw, ul.
Tyniecka 15/17, 02-630 Warsaw, Poland.
Received: 29 January 2021 Accepted: 13 April 2021
1. Hadjidoukas P, Bartezzaghi A, Scheidegger F, Istrate R, Bekas C, Malossi A. torcpy: Supporting task parallelism in
Python. SoftwareX. 2020;12:100517.
2. Nowicki M, Bała P. Parallel computations in Java with PCJ library. In: 2012 International Conference on High Perfor-
mance Computing & Simulation (HPCS). IEEE; 2012. p. 381–387.
3. Almasi G. PGAS (Partitioned Global Address Space) Languages. In: Padua D, editor. Encyclopedia of Parallel Comput-
ing. Boston: Springer; 2011. p. 1539–45.
4. Challenge Awards HPC, Competition: Awards, . Awards: Class 2. 2014. . http:// www. hpcch allen ge. org/ custom/ index.
html? lid= 103& slid= 272. Accessed 29 Jan 2021.
5. Nowicki M, Ryczkowska M, Górski Ł, Bała P. Big Data Analytics in Java with PCJ Library: Performance Comparison
with Hadoop. In: International Conference on Parallel Processing and Applied Mathematics. Springer; 2017. p.
6. Nowicki M. Comparison of sort algorithms in Hadoop and PCJ. J Big Data. 2020;7:1. https:// doi. org/ 10. 1186% 2Fs40
537- 020- 00376-9
7. Liang J, Qu B, Suganthan P. Problem definitions and evaluation criteria for the CEC 2014 special session and
competition on single objective real-parameter numerical optimization. Computational Intelligence Laboratory,
Zhengzhou University, Zhengzhou China and Technical Report, Nanyang Technological University, Singapore. 2013.
8. Nowicki M, Bzhalava D, Bała P. Massively Parallel Sequence Alignment with BLAST Through Work Distribution Imple-
mented Using PCJ Library. In: International Conference on Algorithms and Architectures for Parallel Processing.
Springer; 2017. p. 503–512.
9. Nowicki M, Bzhalava D, Bała P. Massively parallel implementation of sequence alignment with basic local alignment
search tool using parallel computing in java library. J Comput Biol. 2018;25(8):871–81.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 20 of 21
Nowickietal. J Big Data (2021) 8:62
10. Nowicki M, Górski Ł, Bała P. Performance evaluation of parallel computing and Big Data processing with Java and
PCJ library. Cray Users Group. 2018.
11. Vincent P, Witherden F, Vermeire B, Park JS, Iyer A. Towards green aviation with python at petascale. In: Proceedings
of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press;
2016. p. 1.
12. Johnson N. Python impor t scaling; 2014. Accessed: 29.01.2021. http:// www. archer. ac. uk/ docum entat ion/ white-
papers/ dynam ic- import/ ARCHER_ wp_ dynam ic- import. pdf.
13. Kincade K. Celeste: A New Model for Cataloging the Universe; 2015. https:// newsc enter. lbl. gov/ 2015/ 09/ 09/ celes
te-a- new- model- for- catal oging- the- unive rse/. Accessed 29 Jan 2021.
14. Carpenter B, Getov V, Judd G, Skjellum A, Fox G. MPJ: MPI-like message passing for Java. Concurrency.
15. Vega-Gisbert O, Roman JE, Squyres JM. Design and implementation of Java bindings in Open MPI. Parallel Comput.
16. Bonér J, Kuleshov E. Clustering the Java virtual machine using aspect-oriented programming. In: AOSD’07: Proceed-
ings of the 6th International Conference on Aspect-Oriented Software Development; 2007.
17. Nester C, Philippsen M, Haumacher B. A more efficient RMI for Java. In: Java Grande. vol. 99; 1999. p. 152–159.
18. Yelick K, Semenzato L, Pike G, Miyamoto C, Liblit B, Krishnamurthy A, et al. Titanium: a high-performance Java dialect.
Concurr Comput. 1998;10(11–13):825–36.
19. Feind K. Shared memory access (SHMEM) routines. Cray Research. 1995.
20. Nieplocha J, Harrison RJ, Littlefield RJ. Global arrays: A nonuniform memory access programming model for high-
performance computers. J Supercomput. 1996;10(2):169–89.
21. Kale LV, Zheng G. 13. In: Charm++ and AMPI: Adaptive Runtime Strategies via Migratable Objects. Wiley; 2009. p.
265–282. https:// onlin elibr ary. wiley. com/ doi/ abs/ 10. 1002/ 97804 70558 027. ch13.
22. Carlson WW, Draper JM, Culler DE, Yelick K, Brooks E, Warren K. Introduction to UPC and language specification.
Technical Report CCS-TR-99-157, IDA Center for Computing Sciences; 1999.
23. Reid J. The new features of Fortran 2008. In: ACM SIGPLAN Fortran Forum. vol. 27. ACM; 2008. p. 8–21.
24. Chamberlain BL, Callahan D, Zima HP. Parallel Programmability and the Chapel Language. Int J High Perf Comput
Appl. 2007;21(3):291–312.
25. Posner J, Reitz L, Fohry C. Comparison of the HPC and Big Data Java Libraries Spark, PCJ and APGAS. In. IEEE/ACM
Parallel Applications Workshop, Alternatives To MPI (PAW-ATM). IEEE. 2018;2018:11–22.
26. Nowicki M, Górski Ł, Bała P. PCJ–Java Library for Highly Scalable HPC and Big Data Processing. In: 2018 International
Conference on High Performance Computing & Simulation (HPCS). IEEE; 2018. p. 12–20.
27. Nowicki M, Górski Ł, Bała P. Evaluation of the Parallel Performance of the Java and PCJ on the Intel KNL Based Sys-
tems. In: International Conference on Parallel Processing and Applied Mathematics. Springer; 2017. p. 288–297.
28. Trader T. It’s Official: Aurora on Track to Be First US Exascale Computer in 2021. HPC Wire. 2019;(March 18).
29. Gardener M. Mathematical Games: The fantastic combinations of John Conway’s new solitaire game “Life’’. Sci Am.
30. PCJ implementations of the Game of Life benchmark;. Accessed: 29.01.2021. https:// github. com/ hpdcj/ PCJ- examp
les/ blob/ 3abf3 2f808 fa05a f2b7f 1cfd0 b21bd 6c5ef c1339/ src/ org/ pcj/ examp les/ GameO fLife. java.
31. Takahashi D, Kanada Y. High-performance radix-2, 3 and 5 parallel 1-D complex FFT algorithms for distributed-mem-
ory parallel computers. J Supercomput. 2000;15(2):207–28.
32. Luszczek PR, Bailey DH, Dongarra JJ, Kepner J, Lucas RF, Rabenseifner R, et al. The HPC Challenge (HPCC) Benchmark
Suite. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing. SC ’06. New York, NY, USA: ACM; 2006.
33. Mellor-Crummey J, Adhianto L, Jin G, Krentel M, Murthy K, Scherer W, et al.. Class II submission to the HPC Challenge
award competition Coarray Fortran 2.0. Citeseer.
34. PCJ implementations of the FFT benchmar. https:// github. com/ hpdcj/ hpc- chall enge- fft/ tree/ ebd55 7e40a d50f6
14a86 90003 21ee8 22b67 d2623. Accessed 29 Jan 2021.
35. PCJ implementations of the WordCount application. https:// github. com/ hpdcj/ wordc ount/ tree/ 6a265 bc921 47a89
c3717 6692c cae8d cf8d9 7df72. Accessed 29 Jan 2021.
36. Rakowski F, Karbowski J. Optimal synaptic signaling connectome for locomotory behavior in Caenorhabditis
elegans: Design minimizing energy cost. PLoS Comput Biol. 2017;13(11):e1005834.
37. PCJ implementations of the modeling the connectome of C. Elegans application. https:// github. com/ hpdcj/ evolu
tiona ry- algor ithm/ tree/ 60246 7a794 7fd3d a946f 70fd2 fae64 6e2f1 500da. Accessed 29 Jan 2021.
38. PCJ implementations of the distributed neural network training application. https:// github. com/ hpdcj/ mnist- tf/
tree/ 77fa1 43e2a a3b83 294a8 fc607 b382c 518d4 396d7/ java- mnist. Accessed 29 Jan 2021.
39. Sergeev A, Balso MD. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv: 18020
5799. 2018.
40. Géron A. Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build
intelligent systems. O’Reilly Media, Inc.; 2017.
41. LeCun Y, Cortes C, J C Burges C. The MNIST Database of handwritten digits. http:// yann. lecun. com/ exdb/ mnist/.
Accessed 17 Mar 2021.
42. Keuper J, Pfreundt FJ. Asynchronous parallel stochastic gradient descent: A numeric core for scalable distributed
machine learning algorithms. In: Proceedings of the Workshop on Machine Learning in High-Performance Comput-
ing Environments. ACM; 2015. p. 1.
43. Grama A, Kumar V, Gupta A, Karypis G. Introduction to parallel computing. Pearson Education; 2003.
44. IBMCode. Direct Storage and Networking Interface (DiSNI); 2018. https:// devel oper. ibm. com/ techn ologi es/ analy
tics/ proje cts/ direct- stora ge- and- netwo rking- inter face- disni/. Accessed 29 Jan 2021.
45. IBM. The jVerbs library; 2012. https:// www. ibm. com/ suppo rt/ knowl edgec enter/ en/ SSYKE2_ 8.0. 0/ com. ibm. java. 80.
doc/ docs/ rdma_ jverbs. html. Accessed 29 Jan 2021.
46. Biswas R, Lu X, Panda DK. Accelerating TensorFlow with Adaptive RDMA-Based gRPC. In: 2018 IEEE 25th International
Conference on High Performance Computing (HiPC). IEEE; 2018. p. 2–11.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 21 of 21
Nowickietal. J Big Data (2021) 8:62
47. Lu X, Shankar D, Panda DK. Scalable and distributed key-value store-based data management using RDMA-Mem-
cached. IEEE Data Eng Bull. 2017;40(1):50–61.
48. Nowicki M, Ryczkowska M, Górski Ł, Szynkiewicz M, Bała P. PCJ-a Java library for heterogenous parallel computing.
Recent Advances in Information Science (Recent Advances in Computer Engineering Series vol 36), WSEAS Press.
2016;p. 66–72.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
... In previous works, we have shown that the PCJ library allows for easy and feasible development of computational applications as well as Big Data and AI processing running on supercomputers or clusters 3 . The performance comparison with the C/MPI based codes has been presented in previous papers 4,5 . ...
... The performance comparison with the C/MPI based codes has been presented in previous papers 4,5 . The extensive comparison with Java-based solutions including APGAS (Java implementation of X10 language) has been also performed 6,3 . ...
... java.nio.*). The details of the algorithms used to implement PCJ communication are described in our previous publications 2, 3 . ...
Large-scale computing and data processing with cloud resources is gaining popularity. However, the usage of the cloud differs from traditional high-performance computing (HPC) systems and both algorithms and codes have to be adjusted. This work is often time-consuming and performance is not guaranteed. To address this problem we have developed the PCJ library (parallel computing in Java), a novel tool for scalable HPC and big data processing in Java. In this article, we present a performance evaluation of parallel applications implemented in Java using the PCJ library. The performance evaluation is based on the examples of highly scalable applications of different characteristics focusing on CPU, communication or I/O. They run on the traditional HPC system and Amazon web services Cloud as well as Linaro Developer Cloud. For the clouds, we have used Intel x86 and ARM processors for running Java codes without changing any line of the program code and without the need for time-consuming recompilation. Presented applications have been parallelized using the partitioned global address space programming model and its realization in the PCJ library. Our results prove that the PCJ library, due to its performance and ability to create simple portable code, has great promise to be successful for the parallelization of various applications and run them on the cloud with a performance close to HPC systems.
... Python 3.7.10. This version of the performance test uses PCJ 5.2.0 and trained the network for 30 epochs with 100 mini-batch sizes ( [31] can be consulted for the results achieved for the older version of the software stack). Python/Horovod version is shown for either the default configuration of MKL library or with its parameters tuned for our computing system. ...
Full-text available
Conference Paper
Machine learning and Big Data workloads are becoming as important as traditional HPC ones. AI and Big Data users tend to use new programming languages such as Python, Julia, or Java, while the HPC community is still dominated by C/C++ or Fortran. Hence, there is a need for new programming libraries and languages that will integrate different applications and allow them to run on large computer infrastructure. Since modest computers are multinode and multicore, parallel execution is an additional challenge here. For that purpose, we have developed the PCJ library, which introduces parallel programming capabilities to Java using the Partitioned Global Address Space model. It does not modify language nor running environment (JVM). The PCJ library allows for easy development of parallel code and runs it on laptops, workstations, supercomputers, and the cloud. This paper presents an overview of the PCJ library and its usage in parallelizing selected workloads, including HPC, AI, and Big Data. The performance and scalability are presented. We present recent addition to the PCJ library, which are collective operations. The collective operations significantly reduce the number of lines of code to write, ensuring good performance.
... AI is having a great impact in contemporary society; its use in the private sector is widely extended. For example, algorithms are being used to improve computational language (Nowicki et al. 2021) or for the development of autonomous cars (Harris 2018). However, the value creation and functioning of AI in specific public uses are also evident. ...
Full-text available
The expanding use of artificial intelligence (AI) in public administration is generating numerous opportunities for governments. Current Spanish regulations have established electronic administration and support the expansion and implementation of this new technology, but they may not be adapted to the legal needs caused by AI. Consequently, this research aims to identify the risks associated with AI uses in Spanish public administration and if the legal mechanisms can solve them. We answer these questions by employing a qualitative research approach, conducting semi-structured interviews with several experts in the matter. Despite the benefits that this technology may involve, throughout this research we can confirm that the use of artificial intelligence can generate several problems such as opacity, legal uncertainty, biases, or breaches of personal data protection. The mechanisms already provided by Spanish law are not enough to avoid these risks as they have not been designed to face the use of artificial intelligence in public administration. In addition, a homogeneous legal definition of AI needs to be established.
The longitudinal differential settlement of tunnel is the key factor to determine the safety of tunnel structure during service life. Based on the measured settlement data of the tunnel, this paper provides a forewarning method for defects locations. Firstly, Timoshenko beam is used to simulate the longitudinal mechanical response of the tunnel structure under the surrounding rock load, and the precise tunnel settlement curve reconstructed by Fast Fourier Transform (FFT) is used to replace the Timoshenko theoretical solution which is difficult to be solved directly. Combined with the tunnel longitudinal deformation forewarning analysis model, the approximate conversion between the longitudinal deformation curvature radius, circumferential joint opening, differential settlement and dislocation is carried out. The sensitivity of the deformation indexes is discussed through the dichotomy method. It is concluded that longitudinal deformation curvature radius is the most sensitive deformation indexes for the longitudinal structural performance of the tunnel. Finally, the forewarning results are compared with the actual diseases locations. The results show that the method can better determine the defect location of longitudinal tunnel structure in operation. The method can provide a better reference for deformation forewarning and performance evaluation of existing tunnel.
Full-text available
Sorting algorithms are among the most commonly used algorithms in computer science and modern software. Having efficient implementation of sorting is necessary for a wide spectrum of scientific applications. This paper describes the sorting algorithm written using the partitioned global address space (PGAS) model, implemented using the Parallel Computing in Java (PCJ) library. The iterative implementation description is used to outline the possible performance issues and provide means to resolve them. The key idea of the implementation is to have an efficient building block that can be easily integrated into many application codes. This paper also presents the performance comparison of the PCJ implementation with the MapReduce approach, using Apache Hadoop TeraSort implementation. The comparison serves to show that the performance of the implementation is good enough, as the PCJ implementation shows similar efficiency to the Hadoop implementation.
Full-text available
Task-based parallelism has been established as one of the main forms of code parallelization, where asynchronous tasks are launched and distributed across the processing units of a local machine, a cluster or a supercomputer. The tasks can be either completely decoupled, corresponding to a set of independent jobs, or be part of an iterative algorithm where the task results are processed and drive the next step. Typical use cases include the application of the same function to different data, parametric searches and algorithms used in numerical optimization and Bayesian uncertainty quantification. In this work, we introduce torcpy, a platform-agnostic adaptive load balancing library that orchestrates the asynchronous execution of tasks, expressed as callables with arguments, on both shared and distributed memory platforms. The library is implemented on top of MPI and multithreading and provides lightweight support for nested loops and map functions. Experimental results using representative applications demonstrate the flexibility and efficiency of the proposed Python package.
Full-text available
Conference Paper
In this paper, we present PCJ (Parallel Computing in Java), a novel tool for scalable high-performance computing and big data processing in Java. PCJ is Java library implementing PGAS (Partitioned Global Address Space) programming paradigm. It allows for the easy and feasible development of computational applications as well as Big Data processing. The use of Java brings HPC and Big Data type of processing together and enables running on the different types of hardware. In particular, the high scalability and good performance of PCJ applications have been demonstrated using Cray XC40 systems. We present performance and scalability of PCJ library measured on Cray XC40 systems with standard benchmarks such as ping-pong, broadcast, and random access. We describe parallelization of example applications of different characteristics including FFT and 2D stencil. Results for standard Big Data benchmarks such as word count are presented. In all cases, measured performance and scalability confirm that PCJ is a good tool to develop parallel applications of different type.
Conference Paper
PCJ is a Java library for scalable high performance and computing and Big Data processing. The library implements the partitioned global address space (PGAS) model. The PCJ application is run as a multi-threaded application with the threads distributed over multiple Java Virtual Machines. Each task has its own local memory to store and access variables locally. Selected variables can be shared between tasks and can be accessed, read and modified by other tasks. The library provides methods to perform basic operations like synchronization of tasks, get and put values in an asynchronous one-sided way. Additionally, PCJ offers methods for creating groups of tasks, broadcasting and monitoring variables. The library hides details of inter-and intra-node communication-making programming easy and feasible. The PCJ library allows for easy development of highly scalable (up to 200k cores) applications running on the large resources. PCJ applications can be also run on the systems designed for data analytics such as Hadoop clusters. In this case, performance is higher than for native applications. The PCJ library fully complies with Java standards, therefore, the programmer does not have to use additional libraries, which are not part of the standard Java distribution. In this paper, we present details of the PCJ library, its API and example applications. The results show good performance and scalability. It is noteworthy that the PCJ library due to its performance and ability to create simple code has great promise to be successful for the parallelization of HPC and Big Data applications.
Basic Local Alignment Search Tool (BLAST) is an essential algorithm that researchers use for sequence alignment analysis. The National Center for Biotechnology Information (NCBI)-BLAST application is the most popular implementation of the BLAST algorithm. It can run on a single multithreading node. However, the volume of nucleotide and protein data is fast growing, making single node insufficient. It is more and more important to develop high-performance computing solutions, which could help researchers to analyze genetic data in a fast and scalable way. This article presents execution of the BLAST algorithm on high performance computing (HPC) clusters and supercomputers in a massively parallel manner using thousands of processors. The Parallel Computing in Java (PCJ) library has been used to implement the optimal splitting up of the input queries, the work distribution, and search management. It is used with the nonmodified NCBI-BLAST package, which is an additional advantage for the users. The result application-PCJ-BLAST-is responsible for reading sequence for comparison, splitting it up and starting multiple NCBI-BLAST executables. Since I/O performance could limit sequence analysis performance, the article contains an investigation of this problem. The obtained results show that using Java and PCJ library it is possible to perform sequence analysis using hundreds of nodes in parallel. We have achieved excellent performance and efficiency and we have significantly reduced the time required for sequence analysis. Our work also proved that PCJ library could be used as an effective tool for fast development of the scalable applications.
In this paper, we present performance and scalability of the Java codes parallelized on the Intel KNL platform using Java and PCJ Library. The parallelization is performed using PGAS programming model with no modification to Java language nor Java Virtual Machine. The obtained results show good overall performance, especially for parallel applications. The microbenchmark results, compared to the C/MPI, show that PCJ communication efficiency should be improved.
Training modern deep learning models requires large amounts of computation, often provided by GPUs. Scaling computation from one GPU to many can enable much faster training and research progress but entails two complications. First, the training library must support inter-GPU communication. Depending on the particular methods employed, this communication may entail anywhere from negligible to significant overhead. Second, the user must modify his or her training code to take advantage of inter-GPU communication. Depending on the training library's API, the modification required may be either significant or minimal. Existing methods for enabling multi-GPU training under the TensorFlow library entail non-negligible communication overhead and require users to heavily modify their model-building code, leading many researchers to avoid the whole mess and stick with slower single-GPU training. In this paper we introduce Horovod, an open source library that improves on both obstructions to scaling: it employs efficient inter-GPU communication via ring reduction and requires only a few lines of modification to user code, enabling faster, easier distributed training in TensorFlow. Horovod is available under the Apache 2.0 license at