ChapterPDF Available

PCJ -a Java library for heterogenous parallel computing

Authors:

Abstract and Figures

With the wide adoption of the multicore and multiprocessor systems the parallel programming became a very important element of the computer science. The programming of the multicore systems is still complicated and far to be easy. The difficulties are caused, amongst others, by the parallel tools, libraries and programming models which are not easy especially for a nonexperienced programmer. In this paper, we present PCJ-a Java library for parallel programming of heterogeneous multicore systems. The PCJ is adopting Partitioned Global Address Space paradigm which makes programming easy. We present basic functionality pf the PCJ library and its usage for parallelization of selected applications. The scalability of the genetic algorithm implementation is presented. The parallelization of the N-body algorithm implementation with PCJ is also described.
Content may be subject to copyright.
PCJ - a Java library for heterogenous parallel computing
MAREK NOWICKI
N. Copernicus University
Faculty of Mathematics
and Computer Sciene
Chopina 12, 87-100 Toru´
n
POLAND
faramir@mat.umk.pl
MAGDALENA RYCZKOWSKA
N. Copernicus University
Faculty of Mathematics
and Computer Sciene
Chopina 12, 87-100 Toru´
n
POLAND
gdama@mat.umk.pl
ŁUKASZ G ´
ORSKI
N. Copernicus University
Faculty of Mathematics
and Computer Sciene
Chopina 12, 87-100 Toru´
n
POLAND
lgorski@mat.umk.pl
MICHAŁ SZYNKIEWICZ
N. Copernicus University
Faculty of Mathematics
and Computer Sciene
Chopina 12, 87-100 Toru´
n
POLAND
szynkiewicz@mat.umk.pl
PIOTR BAŁA
University of Warsaw
Interdisciplinary Centre for Mathematical
and Computational Modeling
Pawi´
nskiego 5a, 02-106 Warszawa
POLAND
bala@icm.edu.pl
Abstract: With the wide adoption of the multicore and multiprocessor systems the parallel programming became
a very important element of the computer science. The programming of the multicore systems is still complicated
and far to be easy. The difficulties are caused, amongst others, by the parallel tools, libraries and programming
models which are not easy especially for a nonexperienced programmer. In this paper, we present PCJ - a Java
library for parallel programming of heterogeneous multicore systems. The PCJ is adopting Partitioned Global
Address Space paradigm which makes programming easy. We present basic functionality pf the PCJ library and
its usage for parallelization of selected applications. The scalability of the genetic algorithm implementation is
presented. The parallelization of the N-body algorithm implementation with PCJ is also described.
Key–Words: Parallel computing, Java, PGAS
1 Introduction
With the wide adoption of the multicore and multipro-
cessor systems the parallel programming is still not an
easy task. The parallelization of the problem has to be
performed on the algorithmic level, therefore the use
of the automatic tools is not possible. The parallel al-
gorithms are not easy to develop and require computer
science knowledge in addition to the domain exper-
tise. Once a parallel algorithm is developed it has to
be implemented using suitable parallel programming
tools. This task is also not trivial. The difficulties
are caused, amongst others, by the parallel tools, li-
braries and programming models. The message pass-
ing model is difficult, the shared memory model is
easier to learn but writing codes which scale well is
not easy. Others, like Map-Reduce, are suitable for
an only certain class of problems. Finally, the tradi-
tional languages such as FORTRAN and C/C++ are
loosing popularity compared to the new ones such as
Java, Scala, Python and many others.
There is also quite a potential in the PGAS lan-
guages [1] but they are not widely popularized. Most
implementations are still based on the C or FOR-
TRAN and there is a lack of widely adopted solutions
for emerging languages such as Java. The PGAS pro-
gramming model allows for efficient implementation
of parallel algorithms.
2 PCJ Library
PCJ is a library [2, 3, 4, 5] for Java language that helps
to perform parallel and distributed calculations. It is
able to work on the multicore systems with the typical
interconnect such as ethernet or infiniband providing
users with the uniform view across nodes. The library
is OpenSource (BSD license) and its source code is
available at GitHub.
PCJ implements partitioned global address space
model and was inspired by languages like Co-Array
Fortran [6], Unified Parallel C [7] and Titanium [11].
Recent Advances in Information Science
ISBN: 978-1-61804-365-8
66
Figure 1: Schematic view of the PCJ computing
model. Arrows denote possible communication using
shared variables and put() and get() methods.
We put emphasis on compliance with Java standards.
In contrast to the listed languages, the PCJ does not
extend nor modify language syntax. The programmer
does not have to use additional libraries, which are not
part of the standard Java distribution.
In the PCJ, as presented in the Figure 1, each task
(PCJ thread) has its own local memory and executes
its own set of instructions. Variables and instructions
are private to the task. Each task can access other tasks
variables that have a special annotation @Shared.
The library provides methods to perform basic opera-
tions like synchronization of tasks, get and put values
in an asynchronous one-sided way.
The library offers methods for creating groups of
tasks, broadcasting, and monitoring variables. The
PCJ library fully complies with Java standards, there-
fore, the programmer does not have to use additional
libraries, which are not part of the standard Java dis-
tribution. In particular, PCJ can use, implemented in
Java SE 7, Sockets Direct Protocol (SDP), which in-
creases network performance over infiniband connec-
tions.
The application using PCJ library is run as typical
Java application using Java Virtual Machine (JVM). In
the multinode environment one (or more) JVM has to
be started on each node. PCJ library takes care on this
process and allows a user to start execution on multi-
ple nodes, running multiple threads on each node. The
number of nodes and threads can be easily configured.
One instance of JVM is understood as PCJ node.
In principle, it can run on a single (physical) mul-
ticore node. One PCJ node can hold many tasks
(PCJ threads).This design is aligned with novel com-
puter architectures containing hundreds or thousands
of nodes, each of them built of several or even more
cores.
Since PCJ application is not running within single
JVM, the communication between different threads
has to be realized in different manners. If commu-
nicating threads run within the same JVM, the Java
concurrency mechanisms are used to synchronize and
exchange information. If data exchange has to be re-
alized between different JVM’s the network commu-
nication using, for example, sockets have to be used.
3 PCJ details
The basic primitives of PGAS programming paradigm
offered by the PCJ library are as follows and may be
executed over all the threads of execution or only a
subset forming a group:
get(int threadId, String name) - get allows to read
a shared variable (tagged by name) published
by another thread identified with threadId); both
synchronous and asynchronous read with Fu-
tureObject is supported;
put(int threadId, String name, T newValue) - dual
to get, put writes to a shared variable (tagged by
name) owned by a thread identified with threa-
dId; the operation is non-blocking and may re-
turn before target variable is updated;
barrier() - blocks the threads until all pass the syn-
chronization point in the program; a two-point
version of barrier that synchronizes only the se-
lected two threads is also supported
broadcast (String name, T newValue) - broadcasts
the newValue and writes it to each thread’s shared
variable tagged by name;
waitfor(String name) - due to the asynchronicity of
communication primitives a measure that allows
one thread to block until another changes one of
its shared variables (tagged with a name) was in-
troduced.
The presented PCJ methods allows to implement
complicated parallel algorithms. The PCJ library does
not provide constructs for automatic data distribution
and this task has to be performed by the program-
mer. This allows to design data and work distribution
aligned with the parallel algorithm necessary to obtain
efficient and scalable implementation.
Below we present the most important implemen-
tation details of the basic PCJ functionality.
3.1 Node numbering
In the PCJ, there is one node called Manager. It is
responsible for setting unique identifiers to the tasks,
sending messages to other tasks to start calculations,
Recent Advances in Information Science
ISBN: 978-1-61804-365-8
67
creating groups and synchronizing all tasks in calcu-
lations. The Manager node has its own tasks and can
execute parallel programs.
The Manager is the Master of a group of all tasks
and has group identifier equals to 0. Each node has its
own, unique for whole calculations, identifier. That
node is called physical id or node id in short. All
nodes are connected to each other and that connec-
tion is accomplished before starting a calculation. At
this stage, nodes are exchanging their physical node
ids.
At the beginning, user who wants to start us-
ing PCJ for parallel execution has to execute static
method PCJ.start() providing information about re-
quested StartPoint and Storage classes and list of
nodes. The list of nodes is used to number PCJ nodes
and PCJ threads. Every PCJ node is processing the
list to localize items that contain its hostname data –
items number will be used to number PCJ threads.
There is a special node, called node0, that is co-
ordinating other nodes in a startup. Node0 is a node
located as the first item on the list. After processing
the list, each node connects to node0 and tells the
items numbers from the list, that contains its host-
name. When node0 receives information about ev-
ery node from the list, it number nodes with numbers
starting from 0, increasing the number by one on each
distinguished node – the number is called physicalId.
Node0 responses to all other nodes with their physi-
calId.
At this point every node is connected with node0
and knows its physicalId. Next step is to exchange in-
formation between nodes and to connect every node
with each other. To do that, node0 is broadcasting in-
formation about each node. The broadcast is made
using a balanced tree structure, where each node con-
tains at most two children. At the beginning of the op-
eration, the tree has only one vertex, which is node0
– root. Broadcasted message contains information
about new node in the tree: physicalId, parent phys-
icalId,threadsIds and hostname.
When the node receives that data, it sends it down
the tree, save information about a new node, and when
a node is the parent of the new node, it adds it as
own children. After that, the node connects to new
node and sends information about itself (physicalId
and threadIds). At the end, when the new node re-
ceives information from all nodes with the physical id
less physical id of the new node, it sends information
to node0, which completes initialization step.
When all nodes send information about comple-
tion of the initialization step, node0 sends a mes-
sage to start user application. Each node starts ad-
equate number of PCJ threads using provided Start-
Point class.
3.2 Communication
The communication between different PCJ threads
has to be realized in different manners. If communi-
cating threads run within the same JVM, the Java con-
currency mechanisms can be used to synchronize and
exchange information. If data exchange has to be re-
alized between different JVM’s the network commu-
nication using, for example, sockets have to be used.
The PCJ library handles both situations hiding de-
tails from the user. It distinguishes between inter- and
intranode communication and pick up proper data ex-
change mechanism. Moreover, nodes are organized in
the graph which allows to optimize global communi-
cation.
The communication between tasks running on the
same JVM is performed using Java methods for thread
synchronization. One should note that from the PCJ
user point of view both mechanisms are transparent.
The particular mechanism is used depends on the task
ids involved in the communication.
PCJ uses TCP/IP protocol for the connection. The
TCP protocol was chosen because of its features: it
gives a reliable and ordered way of transmitting data
with and error-checking mechanism over an IP net-
work. Of course, it has some drawbacks, especially
associated with performance because TCP is opti-
mized for accurate rather than timely delivery. Us-
age of other protocols, like UDP, would require ad-
ditional work for implementing required features: or-
dering out-of-order messages and retransmissions of
lost or incorrect messages.
The network communication takes place between
nodes and is performed using Java New IO classes
(java.nio.*). There is one thread per node for
receiving incoming data and another one for process-
ing messages. The communication is nonblocking and
uses 256 KB buffer by default [3]. The buffer size can
be changed using dedicated JVM parameter.
PCJ threads can exchange data in an asyn-
chronous way. Sending a value to another task storage
is performed using the put method as presented in the
listing 1. Since the data transfer is asynchronous the
put method is accompanying with the waitFor state-
ment executed by the PCJ thread receiving data. The
get method is used for the getting value from other
task storage. In these two methods, the other task
is nonblocking when process puts or gets a message,
but the task which initiated exchange process, blocks.
There is also the getFutureObject method that works
in fully nonblocking manner – the initializing task can
check if the response is received and in the meantime
do other calculations.
1@Shared
2double a;
Recent Advances in Information Science
ISBN: 978-1-61804-365-8
68
3
4double c = 10.0;
5if (PCJ.myId() == i ) {
6PCJ.put(j, "a", c);
7}
8if (PCJ.myId() == j ) {
9PCJ.waitFor"a");
10 }
Listing 1: Example use of the PCJ put method.
The value of the variable c from PCJ thread i is
broadcasted to the thread j and is stored in the shared
variable a
3.3 Broadcast
Broadcasting is very similar to the put operation.
Source PCJ thread serializes value to broadcast and
sends to node0.Node0 uses a tree structure to broad-
cast that message to all nodes. After receiving the
message, it is sent down the tree, deserialized and
stored into specified variable of all PCJ thread stor-
ages. An example use of the broadcast is presented
in the listing 2. Please note that broadcast is asyn-
chronous.
10 @Shared
11 double a;
12
13 double c = 10.0;
14 if (PCJ.myId() == 0 ) {
15 PCJ.broadcast("a", c);
16 }
Listing 2: Example use of the PCJ broadcast.
The value of the variable c from PCJ thread 0 is
broadcasted to all nodes and stored in the shared
variable a
3.4 Synchronization
Synchronization is organized as follows: one task
sends a proper message to the group master. When
every task sends synchronization message, the group
master sends an adequate message to all tasks, using
the binary tree structure.
20 PCJ.barrier();
Listing 3: Example use of the PCJ synchronization of
the execution performed by all PCJ threads.
The synchronization of two PCJ thread is a lit-
tle more advanced functionality. Two threads, on the
same node or on different nodes, can synchronize their
execution as follows: one PCJ thread sends a message
to another and waits for the same message to come.
When the message comes before even started to wait,
the execution is not suspended at all.
25 if (PCJ.myId() == 0 ) {
26 PCJ.barrier(5);
27 }
Listing 4: Example use of the PCJ synchronization.
The synchronization of the execution performed by
PCJ threads 0 and 5 is performed.
3.5 Fault tolerance
PCJ library provides also basic resilience mecha-
nisms. The resilience extensions provide the program-
mer with the basic functionality which allows to de-
tect node failure. For this purposes, the Java exception
mechanism is used. It allows to detect execution prob-
lems associated with all intranode communication and
present it to the programmer. The programmer can
uptake proper actions to continue program execution.
The detailed solution (algorithm) how to recover from
the failure has to be decided and implemented by the
programmer.
The fault-tolerance implementation relies on the
assumption that node 0 never dies which is a reason-
able compromise since node 0 is the place where ex-
ecution control is performed. The probability of its
failure is much smaller than the probability of failure
of one of other nodes and can be neglected here.
The support for fault tolerance introduces up to
10% overhead when threads are communicating heav-
ily. When a node fails, node 0 is waiting for the
hearthbeat message from that node, and if it does not
get it, it assumes that the node is dead.
4 Related work
There are some projects that aim to enhance Java’s
parallel processing capabilities. Those include Par-
allel Java [8] or Java Grande project [9, 10] (though
they have not gained wider adoption), Titanium [11]
or ProActive [12]. New developments include paral-
lel stream implementation included in the new version
of Java distribution [13]. Most of the mentioned solu-
tions introduces extensions to the language. This re-
quires preprocessing of the code which causes delays
with the adoption to the changes in the Java. More-
over, the solutions are restricted to single JVM, there-
fore they can run only on the single physical node and
do not scale to a large number of cores. ProActive,
which allows to run an application on the relatively
large number of cores suffers form performance de-
ficiencies due to inefficient serialization mechanisms.
Recent Advances in Information Science
ISBN: 978-1-61804-365-8
69
Figure 2: The performance of the differential evolu-
tion code implemented using PCJ library. The ideal
scalling is presented as the doted line.
An extensive description of the related solutions to-
gether with some performance comparison can be
found elsewhere [14].
5 PCJ examples
The PCJ library has been successfully used to paral-
lelize a number of applications including typical HPC
benchmarks [15] receiving HPC Challenge Award at
recent Supercomputing Conference (SC 2014). Some
examples can be viewed on the [3].
Recently PCJ has been used to parallelize the
problem of a large graph traversing. In particular, we
have implemented Graph500 benchmark and evalu-
ated its performance. The obtained results are com-
pared to the standard MPI implementation of the
Graph500 showing similar scalability [16].
Another example is parallelization of the differ-
ential evolution on example mathematical function
as well as was to fine-tune the parameters of nema-
tode’s C. Elegans connectome model. The results
have shown that a good scalability and performance
was achieved with relatively simple and easy to de-
velop code. The simple parallelization based on the
equal job distribution amongst PCJ thread was not
enough since execution time of iterations performed
by different threads varies. Therefore the code has
been extended by the work load equalization imple-
mented using PCJ library. In result, a scaling close
to the ideal up to thousand of cores was achieved re-
ducing simulation time from days to minutes [17] (see
Fig. 2).
In this paper, we present also the performance
of the MolDyn benchmark from the Java Grande
Benchmark Suite implemented using PCJ library. It
performs a simple N-body calculation which involve
computing the motion of a number of particles (de-
Figure 3: The performance of the MolDyn benchmark
implemented using PCJ library. The ideal scalling is
presented as the doted line.
fined by a position, velocity, mass and possibly a
shape). These particles move according to Newtons
Laws of Motion and attract/repluse each other accord-
ing to a potential function.
The force acting on each particle is calculated
from the sum of each of the forces the other parti-
cles impart on it. The total force on each particle and
then apply a suitable numerical integration method to
calculate the change in velocity and position of each
particle over a discrete time-step.
The All-Pairs method is the simplest algorithm
for calculating the force. This is an O(N2)algorithm
as for Nparticles, the total acceleration on each par-
ticle requires O(N)calculations. This method is sim-
ple to implement but it is limited by the exponential
computational complexity of the algorithm.
30 /*move the particles and update velocities
*/
31 for (i = 0; i < mdsize; i++) {
32 one[i].domove(side);
33 }
34
35 /*compute forces */
36 rank = PCJ.myID();
37 nprocess = PCJ.thredCount();
38
39 for (i = rank; i < mdsize; i += nprocess)
{
40 one[i].force(side, rcoff, mdsize
, i);
41 }
Listing 5: PCJ Java implementation of the MolDyn
benchmark. The code for the movement of the
particles and forces computation.
Recent Advances in Information Science
ISBN: 978-1-61804-365-8
70
In the Java Grande Benchmark implementation,
atom’s information is replicated on all threads and al-
most all operations are performed by every thread.
The only parallelized part of the code is force cal-
culation as presented in the listing 5. Each PCJ
thread computes forces on the subset of particles (ev-
ery PCJ.threadCount() atom).
The calculated partial forces have to be sum up
over all threads. This task is performed by sending
calculated forces to the PCJ thread 0 and than sum-
ming them up. The communication is performed in
the asynchronous way and is overlapped with the cal-
culation of the forces. Than the result is broadcasted
to all PCJ threads (see listing 6) and used to calculate
new positions. The broadcast statement is executed
when all forces are gathered at PCJ thread 0, therefore
synchronization statement can be omitted.
50 if (PCJ.myId() != 0) {
51 PCJ.put(0, "r_xforce", tmp_xforce,
PCJ.myId());
52 }else {
53 PCJ.waitFor("r_xforce", PCJ.
threadCount() - 1);
54
55 double[][] r_xforce = PCJ.getLocal("
r_xforce");
56
57 for (int node = 1; node < PCJ.
threadCount(); ++node) {
58 for (i = 0; i < mdsize; ++i) {
59 tmp_xforce[i] += r_xforce[
node][i];
60 }
61 }
62 PCJ.broadcast("tmp_xforce",
tmp_xforce);
63 }
Listing 6: The code to gather forces calculated on the
different PCJ threads sum them up and distribute to
the all PCJ threads. All instructions are repeated for
all dimensions x y z (not shown here).
The simulation has been performed for N = 442
368 particles interacting with the Lenard-Jones poten-
tial. The periodic boundary conditions were applied
and no cut-off was used. The experiments were run
on the PC cluster consisting of 64 computing nodes
based on the Intel Xeon E5-2697 v3 CPU (28 core
each) with Infiniband interconnection. Each proces-
sor is clocked at 2.6 GHz. Every processing node has
at least 64 GB of memory. Nodes are connected with
Infiniband FDR and with 1Gb Ethernet. PCJ was run
using Oracle’s JVM v. 1.8.0. The calculations were
performed using the double precision floating point
arithmetic.
As presented in the Fig.3 the PCJ implementation
scales well up to the 32 cores, for the higher number
of cores the communication cost starts to dominate.
For the larger number of cores, the calculation of the
forces takes less time as it is proportional to the num-
ber of atoms allocated to the particular PCJ thread.
One should note that the scalability of the PCJ imple-
mentation is similar to the original code using MPI for
the communication. The resulting code is simple and
contains fewer lines of parallel primitives.
6 Conclusions and future work
The obtained results show good performance and
good scalability of the benchmarks and applications
implemented in Java with the PCJ library. The re-
sulting code is simple, usually contains fewer lines of
code than other solutions. This is obtained thanks to
the PGAS programming model and one-sided asyn-
chronous communication implemented in the PCJ.
Therefore, the parallelization is easier than in the other
programming models. It allows also for easy and fast
parallelization of ant data intensive processing. In this
case the parallelization can be obtained by the devel-
opment of simple code responsible for the data distri-
bution. The data intensive part can be performed using
existing code or even existing applications.
The communication and synchronization cost is
comparable to other implementations such as MPI re-
sulting in good performance and scalability.
The PCJ library provides additional features as
support for resilience. The support for GPU through
JCuda [18] is currently under tests and will be avail-
able soon.
All these features make PCJ very promising tool
for parallelization large scale applications on the mul-
ticore heterogeneous systems.
Acknowledgments
This work has been performed using the PL-Grid in-
frastructure. Partial support from CHIST-ERA con-
sortium and OCEAN project is acknowledged.
References:
[1] D. Mall´
on, G. Taboada, C. Teijeiro, J.Tourino,
B. Fraguela, A. G´
omez, R. Doallo, J. Mourino.
Performance Evaluation of MPI, UPC and
OpenMP on Multicore Architectures In: M.
Ropo, J. Westerholm, J. Dongarra (Eds.) Recent
Advances in Parallel Virtual Machine and Mes-
sage Passing Interface (Lecture Notes in Com-
puter Science 5759) Springer Berlin / Heidel-
berg 2009, pp. 174-184
[2] http://pcj.icm.edu.pl Accessed: 20.11.2015.
Recent Advances in Information Science
ISBN: 978-1-61804-365-8
71
[3] M. Nowicki, P. Bała. Parallel computations in
Java with PCJ library In: W. W. Smari and V.
Zeljkovic (Eds.) 2012 International Conference
on High Performance Computing and Simula-
tion (HPCS), IEEE 2012 pp. 381-387
[4] M. Nowicki, P. Bała. PCJ-new approach for
parallel computations in java In: P. Manninen,
P. Oster (Eds.) Applied Parallel and Scientific
Computing, (LNCS 7782), Springer, Heidelberg
(2013) pp. 115-125
[5] M. Nowicki, Ł. G´
orski, P. Grabarczyk, P. Bała.
PCJ - Java library for high performance com-
puting in PGAS model In: W. W. Smari and V.
Zeljkovic (Eds.) 2014 International Conference
on High Performance Computing and Simula-
tion (HPCS), IEEE 2014 pp. 202-209
[6] R. W. Numrich, J. Reid. Co-array Fortran for
parallel programming ACM SIGPLAN Fortran
Forum Volume 17(2), pp. 1-31, 1998
[7] W. Carlson, J. Draper, D. Culler, K. Yelick, E.
Brooks, K. Warren. Introduction to UPC and
Language Specification IDA Center for Comput-
ing 1999
[8] A. Kaminsky. Parallel java: A unified api
for shared memory and cluster parallel pro-
gramming in 100% java. In Parallel and Dis-
tributed Processing Symposium, 2007. IPDPS
2007. IEEE International, pages 1–8. IEEE,
2007.
[9] J. M. Bull, L. A. Smith, M. D. Westhead, D. S.
Henty and R. A. Davey. A Benchmark Suite for
High Performance Java, Concurrency: Practice
and Experience, 12, 375-388, 2000.
[10] Java Grande Project: benchmark
suite. https://www.epcc.ed.
ac.uk/research/computing/
performance-characterisation/
-and-benchmarking/
java-grande-benchmark-suite.
Accessed: 19.11.2015.
[11] P. Hilfinger, D. Bonachea, K. Datta, D. Gay, S.
Graham, B. Liblit, G. Pike, J. Su and K. Yelick.
Titanium Language Reference Manual U.C.
Berkeley Tech Report, UCB/EECS-2005-15,
2005 http://titanium.cs.berkeley.
edu/papers/EECS-2005-15.pdf Ac-
cessed: 3.11.2015
[12] D. Caromel, Ch. Delb´
e, Al. Di Costanzo, M.
Leyton, et al. Proactive: an integrated platform
for programming and running applications on
grids and p2p systems. Computational Methods
in Science and Technology, 12, 2006.
[13] http://docs.oracle.com/javase/
tutorial/collections/streams/
parallelism.html Accessed: 2.11.2015.
[14] M. Nowicki. Opracowanie nowych metod
programowania r´
ownoległego w Javie w opar-
ciu o paradygmat PGAS (Partitioned Global
Address Space). PhD thesis, University of
Warsaw 2015 http://ssdnm.mimuw.
edu.pl/pliki/prace-studentow/
st/pliki/marek-nowicki-d.pdf
Accessed: 20.11.2015.
[15] P. Luszczek, D. Bailey, J. Dongarra, J. Kep-
ner, R. Lucas, R. Rabenseifner, D. Taka-
hashi. The HPC Challenge (HPCC) Benchmark
Suite, SC06 Conference Tutorial, IEEE, Tampa,
Florida, November 12, 2006.
[16] M. Ryczkowska, M. Nowicki, P. Bała. The Per-
formance Evaluation of the Java Implementa-
tion of Graph500 In: R. Wyrzzykowski (Ed.)
PPAM 2015 Lecture Notes in Computer Science
(in press)
[17] Ł. G ´
orski, F. Rakowski, P. Bała. Parallel dif-
ferential evolution in the PGAS programming
model implemented with PCJ Java library In: R.
Wyrzykowski (Ed.) PPAM 2015 Lecture Notes
in Computer Science (in press)
[18] Y. Yan, M. Grossman, V. Sarkar JCUDA: A
programmer-friendly interface for accelerating
Java programs with CUDA. In Euro-Par 2009
Parallel Processing . Springer Berlin Heidelberg
2009, pp. 887-899.
Recent Advances in Information Science
ISBN: 978-1-61804-365-8
72
... More detailed information about the PCJ library can be found in [38,39]. ...
Article
Full-text available
Sorting algorithms are among the most commonly used algorithms in computer science and modern software. Having efficient implementation of sorting is necessary for a wide spectrum of scientific applications. This paper describes the sorting algorithm written using the partitioned global address space (PGAS) model, implemented using the Parallel Computing in Java (PCJ) library. The iterative implementation description is used to outline the possible performance issues and provide means to resolve them. The key idea of the implementation is to have an efficient building block that can be easily integrated into many application codes. This paper also presents the performance comparison of the PCJ implementation with the MapReduce approach, using Apache Hadoop TeraSort implementation. The comparison serves to show that the performance of the implementation is good enough, as the PCJ implementation shows similar efficiency to the Hadoop implementation.
Article
Full-text available
With the development of peta- and exascale size computational systems there is growing interest in running Big Data and Artificial Intelligence (AI) applications on them. Big Data and AI applications are implemented in Java, Scala, Python and other languages that are not widely used in High-Performance Computing (HPC) which is still dominated by C and Fortran. Moreover, they are based on dedicated environments such as Hadoop or Spark which are difficult to integrate with the traditional HPC management systems. We have developed the Parallel Computing in Java (PCJ) library, a tool for scalable high-performance computing and Big Data processing in Java. In this paper, we present the basic functionality of the PCJ library with examples of highly scalable applications running on the large resources. The performance results are presented for different classes of applications including traditional computational intensive (HPC) workloads (e.g. stencil), as well as communication-intensive algorithms such as Fast Fourier Transform (FFT). We present implementation details and performance results for Big Data type processing running on petascale size systems. The examples of large scale AI workloads parallelized using PCJ are presented.
Conference Paper
PCJ is a Java library for scalable high performance and computing and Big Data processing. The library implements the partitioned global address space (PGAS) model. The PCJ application is run as a multi-threaded application with the threads distributed over multiple Java Virtual Machines. Each task has its own local memory to store and access variables locally. Selected variables can be shared between tasks and can be accessed, read and modified by other tasks. The library provides methods to perform basic operations like synchronization of tasks, get and put values in an asynchronous one-sided way. Additionally, PCJ offers methods for creating groups of tasks, broadcasting and monitoring variables. The library hides details of inter-and intra-node communication-making programming easy and feasible. The PCJ library allows for easy development of highly scalable (up to 200k cores) applications running on the large resources. PCJ applications can be also run on the systems designed for data analytics such as Hadoop clusters. In this case, performance is higher than for native applications. The PCJ library fully complies with Java standards, therefore, the programmer does not have to use additional libraries, which are not part of the standard Java distribution. In this paper, we present details of the PCJ library, its API and example applications. The results show good performance and scalability. It is noteworthy that the PCJ library due to its performance and ability to create simple code has great promise to be successful for the parallelization of HPC and Big Data applications.
Article
Full-text available
UPC is a parallel extension of the C programming language intended for multiprocessors with a common global address space. A descendant of Split-C [CDG 93], AC [CaDr 95], and PCP [BrWa 95], UPC has two primary objectives: 1) to provide efficient access to the underlying machine, and 2) to establish a common syntax and semantics for explicitly paral-lel programming in C. The quest for high performance means in particular that UPC tries to minimize the overhead involved in communication among cooperating threads. When the underlying hardware enables a processor to read and write remote memory without inter-vention by the remote processor (as in the SGI/Cray T3D and T3E), UPC provides the pro-grammer with a direct and easy mapping from the language to low-level machine instructions. At the same time, UPC's parallel features can be mapped onto existing mes-sage-passing software or onto physically shared memory to make its programs portable from one parallel architecture to another. As a consequence, vendors who wish to imple-ment an explicitly parallel C could use the syntax and semantics of UPC as a basis for a standard.
Chapter
Graph-based computations are used in many applications. Increasing size of analyzed data and its complexity make graph analysis a challenging task. In this paper we present performance evaluation of Java implementation of Graph500 benchmark. It has been developed with the help of the PCJ (Parallel Computations in Java) library for parallel and distributed computations in Java. PCJ is based on a PGAS (Partitioned Global Address Space) programming paradigm, where all communication details such as threads or network programming are hidden. In this paper, we present Java implementation details of first and second kernel from Graph500 benchmark. The results are compared with the existing MPI implementations of Graph500 benchmark, showing good scalability of PCJ library.
Conference Paper
New ways to exploit parallelism of large scientific codes are still researched on. In this paper we present parallelization of the differential evolution algorithm. The simulations are implemented in Java programming language using PGAS programing paradigm enabled by the PCJ library. The developed solution has been used to test differential evolution on a number of mathematical function as well as to fine-tune the parameters of nematode's C. Elegans connectome model. The results have shown that a good scalability and performance was achieved with relatively simple and easy to develop code.
Chapter
This paper presents the application of the PCJ library for the parallelization of the selected HPC applications implemented in Java language. The library is motivated by partitioned global address space (PGAS) model represented by Co-Array Fortran, Unified Parallel C, X10 or Titanium. In the PCJ, each task has its own local memory and storesand access variables locally. Variables can be shared between tasks and can be accessed, read and modified by other tasks. The library provides methods to perform basic operations like synchronization of tasks, get and put values in asynchronous one-sided way. Additionally the library offers methods for creating groups of tasks, broadcasting and monitoring variables. The PCJ has ability to work on the multinode multicore systems hiding details of inter- and intranode communication. The PCJ library fully complies with Java standards therefore the programmer does not have to use additional libraries, which are not part of the standard Java distribution. In this paper the PCJ library has been used to run example HPC applications on the multicore nodes. In particular we present performance results for parallel raytracing, matrix multiplication and map-reduce calculations. The detailed information on performance of the reduction operation is also presented. The results show good performance and scalability compared to native implementations of the same algorithms. In particular, MPI C++ and Java 8 parallel streams have been used as a reference. It is noteworthy that the PCJ library due to its performance and ability to create simple code has great promise to be successful for parallelization of the HPC applications.
Conference Paper
In this paper we present PCJ - a new library for parallel computations in Java. The PCJ library implements partitioned global address space approach. It hides communication details and therefore it is easy to use and allows for fast development of parallel programs. With the PCJ user can focus on implementation of the algorithm rather than on thread or network programming. The design details with examples of usage for basic operations are described. We also present evaluation of the performance of the PCJ communication on the state of art hardware such as cluster with gigabit interconnect. The results show good performance and scalability when compared to native MPI implementations.
Article
We propose a grid programming approach using the ProAc-tive middleware. The proposed strategy addresses several grid concerns, which we have classified into three categories. I. Grid Infrastructure which handles the resource acquisition and creation using deployment descriptors and Peer-to-Peer. II. Grid Technical Services which can pro-vide non-functional transparent services like: fault tolerance, load balanc-ing, and file transfer. III. Grid Higher Level programming with: group communication and hierarchical components. We have validated our ap-proach with several grid programming experiences running applications on heterogeneous Grid resource using more than 1000 CPUs.
Article
Co-Array Fortran, formerly known as F--, is a small extension of Fortran 95 for parallel processing. A Co-Array Fortran program is interpreted as if it were replicated a number of times and all copies were executed asynchronously. Each copy has its own set of data objects and is termed an image. The array syntax of Fortran 95 is extended with additional trailing subscripts in square brackets to give a clear and straightforward representation of any access to data that is spread across images.References without square brackets are to local data, so code that can run independently is uncluttered. Only where there are square brackets, or where there is a procedure call and the procedure contains square brackets, is communication between images involved.There are intrinsic procedures to synchronize images, return the number of images, and return the index of the current image.We introduce the extension; give examples to illustrate how clear, powerful, and flexible it can be; and provide a technical definition.
Article
Abstract The Titanium language is a Java dialect for high - performance parallel scientific computing Titanium's di erences from Java include multi - dimensional arrays, an explicitly parallel SPMD model of computation with a global address space, a form of value class, and zone - based memory management This reference manual describes the di erences between Titanium and Java
Chapter
In this paper we present PCJ - a new library for parallel computations in Java inspired by the partitioned global address space approach. We present design details together with the examples of usage for basic operations such as a point-point communication, synchronization or broadcast. The PCJ library is easy to use and allows for a fast development of the parallel programs. It allows to develop distributed applications in easy way, hiding communication details which allows user to focus on the implementation of the algorithm rather than on network or threads programming. The objects and variables are local to the program threads however some variables can be marked shared which allows to access them from different nodes. For shared variables PCJ offers one-sided communication which allows for easy access to the data stored at the different nodes. The parallel programs developed with PCJ can be run on the distributed systems with different Java VM running on the nodes. In the paper we present evaluation of the performance of the PCJ communication on the state of art hardware. The results are compared with the native MPI implementation showing good performance and scalability of the PCJ.
Conference Paper
Parallel Java is a parallel programming API whose goals are (1) to support both shared memory (thread-based) parallel programming and cluster (message-based) paral- lel programming in a single unified API, allowing one to write parallel programs combining both paradigms; (2) to provide the same capabilities as OpenMP and MPI in an object oriented, 100% Java API; and (3) to be easily de- ployed and run in a heterogeneous computing environment of single-core CPUs, multi-core CPUs, and clusters thereof. This paper describes Parallel Java's features and architec- ture; compares and contrasts Parallel Java to other Java- based parallel middleware libraries; and reports perfor- mance measurements of Parallel Java programs.