ArticlePDF Available

Abstract and Figures

Sorting algorithms are among the most commonly used algorithms in computer science and modern software. Having efficient implementation of sorting is necessary for a wide spectrum of scientific applications. This paper describes the sorting algorithm written using the partitioned global address space (PGAS) model, implemented using the Parallel Computing in Java (PCJ) library. The iterative implementation description is used to outline the possible performance issues and provide means to resolve them. The key idea of the implementation is to have an efficient building block that can be easily integrated into many application codes. This paper also presents the performance comparison of the PCJ implementation with the MapReduce approach, using Apache Hadoop TeraSort implementation. The comparison serves to show that the performance of the implementation is good enough, as the PCJ implementation shows similar efficiency to the Hadoop implementation.
This content is subject to copyright. Terms and conditions apply.
Comparison ofsort algorithms inHadoop
andPCJ
Marek Nowicki*
Introduction
Sorting is one of the most fundamental algorithmic problems found in a wide range of
fields. One of them is data science—an interdisciplinary field that is focused on extract-
ing useful knowledge from huge amounts of data using methods and tools for data pro-
cessing and analysis. e basic metrics for data analysis include minimum, maximum,
median, and top-K values. It is easy to write simple O(n) algorithms that are not using
sorting to calculate the first three of those metrics, but finding the median value and
its variants require more work. Of course, Hoare’s quickselect algorithm [1] can be used
for finding the median of unsorted data, but its worst-case time complexity is
O(n2)
.
Moreover, some algorithms, like binary search, require data to be sorted before execu-
tion. Sequential implementation of sorting algorithms have been studied for decades.
Nowadays, it becomes more and more important to use parallel computing. Here is
where the efficient implementation of sorting in parallel is necessary. Existing O(n) sort-
ing algorithms, also adapted for parallel execution, like count sort [2, 3] or radix sort [4,
5], require specific input data structure, which limits their application to more general
cases.
Abstract
Sorting algorithms are among the most commonly used algorithms in computer sci‑
ence and modern software. Having efficient implementation of sorting is necessary for
a wide spectrum of scientific applications. This paper describes the sorting algorithm
written using the partitioned global address space (PGAS) model, implemented using
the Parallel Computing in Java (PCJ) library. The iterative implementation descrip‑
tion is used to outline the possible performance issues and provide means to resolve
them. The key idea of the implementation is to have an efficient building block that
can be easily integrated into many application codes. This paper also presents the
performance comparison of the PCJ implementation with the MapReduce approach,
using Apache Hadoop TeraSort implementation. The comparison serves to show that
the performance of the implementation is good enough, as the PCJ implementation
shows similar efficiency to the Hadoop implementation.
Keywords: Parallel computing, MapReduce, Partitioned global address space, PGAS,
Java, PCJ, Sorting, TeraSort
Open Access
© The Author(s) 2020. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material
in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco
mmons .org/licen ses/by/4.0/.
RESEARCH
Nowicki J Big Data (2020) 7:101
https://doi.org/10.1186/s40537-020-00376-9
*Correspondence:
faramir@mat.umk.pl
Faculty of Mathematics
and Computer Science,
Nicolaus Copernicus
University in Toruń, Chopina
12/18, 87‑100 Toruń, Poland
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 2 of 28
Nowicki J Big Data (2020) 7:101
Processing huge amounts of data, also called Big Data processing, is common in
data science applications. Using Java, or more general solution based on Java Virtual
Machine (JVM), for Big Data processing is a well known and established approach.
MapReduce [6] and its implementation as a part of the Apache Hadoop framework
is a good example. However, it has some limitations like disk-based communication
that causes performance problem in iterative computations, but works well with the
one-pass jobs it was designed for. Another good example of Java framework is Apache
Spark [7] that overcomes the Hadoop iterative processing limitations and, instead
of using disk for storing data between steps, it uses in-memory caching to increase
performance. ere are other data processing engines, like Apache Flink [8], that are
designed to process data streams in real-time. ese solutions outperform MapRe-
duce in real-time streaming applications but have large requirements for memory and
CPU speed [9].
e Parallel Computing in Java (PCJ) library [10], a novel approach to write parallel
applications in Java, is yet another example of a library that can be used for Big Data
processing. PCJ implements the Partitioned Global Address Space (PGAS) paradigm
for running concurrent applications. e PCJ library allows running concurrent appli-
cations on systems comprise one or many multicore nodes like standard workstation,
nodes with hundreds of computational threads like Intel KNL processors [11], comput-
ing clusters or even supercomputers [12].
e PCJ library won HPC Challenge award in 2014 and has been already successfully
used for parallelization of selected applications. One of the applications utilizing PCJ is
an application that uses a differential evolution algorithm to calculate the parameters of
C. Elegans connectome model [1315]. Another example is the parallelization of DNA
(nucleotide) and protein sequence databases querying to find the most similar sequences
to a given sequence [16, 17]. is solution was used in the ViraMiner application devel-
oping—the deep learning-based method for identifying viral genomes in human bio-
specimens [18]. e PCJ library was also used for the processing of large Kronecker
graphs that imitate real-world networks like online social networks [19, 20]. One more
example of the PCJ library usage is the approach to solve k-means clustering problem
[21] which is a popular benchmark in the data analytics field.
ere were works that attempt to implement MapReduce functionality in PGAS lan-
guages like X10 [22] or UPC [23, 24]. However, because those approaches extend the
base languages with map and reduce functions, their users are required to express paral-
lel algorithms in terms of the MapReduce framework. In this paper, the PGAS approach
is directly compared to the MapReduce solution.
is paper presents a sorting algorithm written using the PCJ library in the PGAS
paradigm implementation, and its performance comparison with the Hadoop TeraSort
implementation [25]. Both codes implement the same algorithm—a variation of the
sample sort algorithm [26]. e TeraSort implementation is a conventional and popular
benchmark used in the data analytics domain. It is important to compare exactly the
same algorithm, with the same time and space complexity, but having in mind that the
algorithm is written using different programming paradigms.
e remainder of this paper is organized as follows. Section2 introduces the MapRe-
duce and PGAS programming models. Section3 describes in details the implementation
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 3 of 28
Nowicki J Big Data (2020) 7:101
of the sorting algorithm in both models. Section4 contains a performance evaluation of
the implementation. Last sections5 and6 conclude this paper.
Programming models
is section gives an overview of the MapReduce and PGAS programming models.
MapReduce
MapReduce [6] is a programming model for processing large data sets. Processing data
in the MapReduce, as the name states, contains two stages: mapping (transforming) val-
ues, and reducing (combining) them.
One of the most known MapReduce frameworks is Apache Hadoop. Processing data
using Apache Hadoop is composed of five steps: load, map, shuffle, reduce, and store. An
example of MapReduce processing is presented in Fig.1.
PGAS
PGAS [27] is a programming model for writing general-purpose parallel applications
that can run on multiple nodes with many Central Processing Units (CPUs).
e main concept of the PGAS model is a global view of memory [27]. e global
view is irrespective of whether a machine has a true shared memory or the memory is
distributed.
Processors jointly execute a parallel algorithm and communicate via memory that is
conceptually shared among all processes [28]. Underneath, the global view is realized
by several memories that belong to different processors. In other words, global address
space is partitioned over the processors [29] (cf. Fig.2).
ere are many implementations of the PGAS model, like Chapel [30], Co-Array For-
tran [31], Titanium [32], UPC [33], X10 [34], APGAS [35] or, presented in this paper, the
PCJ library.
Reading input
HDFS
lorem
ipsum
dolor
sit
adipiscing
amet
mollis
amet
dolor
adipiscing
dolor
magna
Mapping
and filtering
Mapper 1
lorem, 1
ipsum, 1
dolor, 1
Mapper 3
mollis, 1
amet, 1
dolor, 1
Mapper 4
adipiscing, 1
dolor, 1
magna, 1
Mapper 2
adipiscing, 1
amet, 1
sit
Shuffling
and sorting
lorem, (1)
ipsum, (1)
dolor, (1,1,1)
adipiscing, (1,1)
amet, (1,1)
mollis, (1)
magna, (1)
Reducer 1
adipiscing, 2
amet, 2
Reducer 2
dolor, 3
ipsum, 1
Reducer 3
lorem, 1
mollis, 1
magna, 1
HDFS
Storing output
adipiscing, 2
amet, 2
dolor, 3
ipsum, 1
lorem, 1
magna, 1
mollis, 1
Reducing
Fig. 1 Processing data using MapReduce model. The schema represents a WordCount example with filtered
out words shorter than 4 characters
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 4 of 28
Nowicki J Big Data (2020) 7:101
Each PGAS implementation consists of three basic principles described in [27].
According to the first principle, each processor has its own local memory—storage, and
part of it can be marked as private to make it not visible to other processors. e second
principle is related to the flagging part of the processor’s storage as shared—available to
other processors. Implementation of sharing can be done through the network with soft-
ware support, directly by hardware shared memory, or by using (RDMA). e affinity to
a processor of every shared memory location is the third principle. Access time to the
local processor’s memory is short, whereas access to the memory of other processors,
possibly through the network, can lead to high access latency. e information about
memory affinity is available to the programmer to help producing efficient and scalable
application, as access to other processors memory can be orders of magnitude slower.
e PGAS model implementations vary, i.a. on the way that remote memory can be
accessed. For example, the way that the remote memory can be accessed by the threads
in the PCJ library is similar to the Co-Array Fortran and the UPC implementations,
where each thread can directly access other thread’s memory. However, the X10 and
APGAS implementations require that the memory can be accessed only at the current
place—accessing remote place requires starting activity on the remote place.
Some researchers [28] place the PGAS model in-between shared-memory models such
as OpenMP [36], and message-passing models like MPI [37]. e idea that all processes
of parallel application operate on one single memory is inherited from the shared-mem-
ory model, whilst the certain communication cost on accessing data on other processes
is inherited from the message-passing model.
The PCJ library
e PCJ library [10] is the novel approach to write parallel applications in Java. e
application can utilize both multiple cores of a node and multiple nodes of a cluster. e
PCJ library works in Java8 but can be used with the newest Java version without any
problem. It is due to the fact, that the library complies with the Java standards, not using
any undocumented functionality, like infamous sun.misc.Unsafe class, and does
not require any additional library that is not a part of the standard Java distribution.
e PCJ library implements the PGAS programming model. It fulfils the basic prin-
ciples described in the previous section. Implicitly every variable is marked as pri-
vate—local to the thread. Multiple PCJ threads, i.e. PCJ executable units (tasks), can
be running on single JVM, and the standard sharing data between threads inside JVM
is available. A programmer can mark class fields as shareable. e shareable variable
network
CPUCPU CPUCPUCPUCPUCPUCPU
PartitionedAddressSpace
Memory
Fig. 2 View of the memory in the PGAS model. Each computing node, possibly consisting of many CPUs,
has its own memory. The computing nodes are connected through the network. All of the nodes memories’
address space conceptually is treated as a single global memory address space
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 5 of 28
Nowicki J Big Data (2020) 7:101
value can be accessed by PCJ threads through library methods invocation. at makes
the second principle fulfilled. e affinity is also fulfilled as each shareable variable is
placed on a specific PCJ thread. Diagram of memory affinity and its division into pri-
vate and shared variables in the PCJ library is presented in Fig.3.
e main construct of an application using PCJ is a PCJ class. is class contains
fundamental static methods for implementing parallel constructs like thread number-
ing, thread synchronization and data exchanging.
e communication details are hidden from the user perspective and the methods
are the same when used for intra- and inter-node communication.
Most of the methods use one-sided asynchronous communication that makes
programming easy and allows to utilize overlapping communication and computa-
tion to large extend. e asynchronousness is achieved by returning a future object
implementing PcjFuture<T> interface that has methods for waiting for a specified
maximum time or unbounded waiting for the computation to complete. ere exist a
synchronous variant of each asynchronous method that is just wrapper for the asyn-
chronous method with an invocation of the unbounded waiting method.
Despite calling PCJ executable units as threads, the execution using PCJ uses a con-
stant number of PCJ threads in whole execution. In the current stable version of the
PCJ library, the StartPoint interface, an entry point for execution, is the param-
eter of PCJ.executionBuilder(-) method. e method returns the builder
object for setting up the computing nodes with methods for starting execution—the
start() or deploy() methods. e architectural details of the execution are pre-
sented in Fig.4. e multiple PCJ threads are part of the JVM that is running on the
physical node. Communication between PCJ threads within JVM uses local workers.
Communication between PCJ threads on different nodes uses sockets to transfer data
through the network. e transferred data is handled by remote workers.
Physical node
CPU
PCJ thread 4
X[j]
n
b[i]
a
CPU
PCJ thread 5
X[j]
n
b[i]
a
CPU
PCJ thread 6
X[j]
n
b[i]
a
CPU
PCJ thread 7
X[j]
n
b[i]
a
JVM
Physical node
CPU
PCJ thread 0
X[j]
n
b[i]
a
CPU
PCJ thread 1
X[j]
n
b[i]
a
CPU
PCJ thread 2
X[j]
n
b[i]
a
CPU
PCJ thread 3
X[j]
n
b[i]
a
JVM
[[[[[[[Localvariables
iiiiii Shared variables
Fig. 3 Diagram of PCJ computing model (from [38]). Arrows present possible communication using library
methods acting on shareable variables
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 6 of 28
Nowicki J Big Data (2020) 7:101
Each PCJ thread has its own set of shareable variables – the variables that can be used
for exchanging data between PCJ threads. Each shareable variable is a field from a regu-
lar class. e class with shareable variables is called storage. To have access to such vari-
able, an enum class has to be created with @Storage annotation pointing to the class
containing the variable, with the name of the variable as an enum constant. In one code
base, there can be many storages, and the ones that will be used in current execution
have to be registered using PCJ .registerStorage(-) method or, preferably, by
annotating StartPoint class by @RegisterStorage annotation with proper enum
class name as parameter. To access the shareable variable, PCJ thread has to provide the
id of a peer PCJ thread and the variable name as an enum constant name.
More detailed information about the PCJ library can be found in [38, 39].
Methods
Apache Hadoop is the most widespread and well-known framework for processing huge
amount of data. It works well with the non-iterative jobs when the intermediate step
data does not need to be stored on disk.
ere are papers [40, 41] that show the PCJ implementation of some benchmarks
scales very well and outperforms the Hadoop implementation, even by a factor of 100.
One of the benchmarks calculates an approximation of
π
value applying the quasi-Monte
Carlo method (employing 2-dimensional Halton sequence) using the code included in
Apache Hadoop examples package. Other application processes large Kronecker graphs
that imitate real-world networks with Breadth-First Search (BFS) algorithm. Another
was WordCount benchmark, based on the code included in Apache Hadoop examples
package, that counts how often words occur in an input file. However, one could argue
that these benchmarks, probably omitting the last one, presented in the aforementioned
papers were not perfectly suited for Hadoop processing. For this reason, a conventional,
widely used benchmark for measuring the performance of Hadoop clusters, a TeraSort
benchmark, was selected and evaluated.
Physical node
CPU
PCJ thread 0
CPU
PCJ thread 1
CPU
PCJ thread 2
CPU
PCJ thread 3
JVM
Local
Workers
Remote
Workers
Physical node
CPU
PCJ thread 4
CPU
PCJ thread 5
CPU
PCJ thread 6
CPU
PCJ thread 7
JVM
Local
Workers
Remote
Workers
NETWORK
Fig. 4 Diagram of PCJ communication model. Arrows present local and remote communication
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 7 of 28
Nowicki J Big Data (2020) 7:101
e TeraSort is one of the widely used benchmarks for Hadoop. It measures the time
to sort a different number of 100-byte records. e input file for the TeraSort benchmark
can be created using teragen application from Apache Hadoop package. e applica-
tion generates a file(s) with random records. Each record is 100-byte long and consists of
a 10-byte key and 90-byte value.
Sample sort algorithm
e TeraSort is an implementation of a sample sort algorithm [26].
e sample sort (or Samplesort) algorithm is a divide-and-conquer algorithm. It is a
generalization of the quicksort algorithm. It uses
p
1
pivots (or splitters) whereas
quicksort uses only one pivot. e pivots elements are sampled from the input data and
then sorted using another sorting algorithm. e input data is divided into p buckets
accordingly to pivots values. en the buckets are sorted. In the original sample sort
algorithm, the buckets are sorted recursively using the sample sort algorithm, but if a
bucket’s size is below some threshold, the other sorting algorithm is used. Eventually, the
concatenation of the buckets produces the sorted output.
e algorithm is well suited for parallelization. e number of pivots is set as equal
to the number of computational units (processors)—p. Input data is split evenly among
processors. Proper selection of pivots is a crucial step of the algorithm, as the bucket
sizes are determined by the pivots. Ideally, the bucket sizes are approximately the same
among processors, therefore each processor spends approximately the same time on
sorting.
e average-case time complexity of the parallel algorithm, where
p1
is the number
of pivots and thus there are p processors, and n is the number of input elements, is as
follows. Finding
p
1
pivots cost is O(p), sorting pivots is
, broadcasting
sorted pivots is
, reading input data and placing elements into buckets by p
processors is
O
(
n
p
log p
)
, scattering buckets to proper processors is
O
(
n
p)
, sorting buckets
by p processors is
O
(
n
p
log
n
p)
, concatenation of the buckets is
O(log p)
. In total, the aver-
age-case time complexity of the algorithm is:
In the worst-case, all but one bucket could have only 1element, and the rest elements
would belong to one bucket. e overall time complexity in the worst-case scenario is:
In the previous calculations, it is assumed that the average-case and worst-case time
complexity of the inner sorting algorithm is
O
(
nlog n)
and of finding the proper bucket
is
O(log n)
.
Hadoop TeraSort implementation
e TeraSort, as mentioned before, is an implementation of the sample sort algorithm
and is written using standard map/reduce sort [42].
O
n
p
log n
p
+plog p
O(nlog n
+
plog p)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 8 of 28
Nowicki J Big Data (2020) 7:101
e used implementation of TeraSort for Hadoop was the one included in the
Apache Hadoop examples package. is code was used to win annual general-purpose
terabyte sort benchmark in 2008 [25].
In the Apache Hadoop examples package, there is also the trivial implementation
of Sort program that uses the framework to fragment and sort the input values. How-
ever, it requires the use of TeraInputFormat and TeraOutputFormat classes
from TeraSort implementation to properly read and write the generated input data.
Removing partitioning code from TeraInputFormat and leaving just the code for
records (key and value) storing resulted in generating the wrong output sequence—
the validation of the output sequence failed.
e TeraSort implementation starts with records sampling. e input sampling uses
the default number of 100,000 sampled records. e sampled records are sorted and
evenly selected as split points and written into a file in Hadoop Distributed File Sys-
tem (HDFS). e sampling is done just before starting mappers tasks.
e benchmark uses a custom partitioner and the split points to ensure that all of
the keys in a reducer i are less than each key in a reducer
i+1
. e custom parti-
tioner uses a trie data structure [43]. e trie is used for finding the correct partition
quickly. e split file is read by the custom partitioner to fill the trie. In the imple-
mentation, the trie has a root with 256children—intermediate nodes, one for each
possible byte value, and each of the children has 256children – the second level of
intermediate nodes, again for each possible byte value. e next level of trie has leaf
nodes. Each leaf node contains information about possible index ranges of split points
for a given key prefix. Example of the trie is presented in Fig.5. Figuring out the right
partition for the given key is done by looking at first and then second-byte value, and
then comparing key with the associated split points.
e mapping function is the identity function, as the records are not modified dur-
ing sorting.
e key/value pairs are sorted before passing them to reducers tasks. Records are
sorted comparing key’s data using a standard byte to byte comparison technique in
the shuffle step.
0x02 0x00 0xFF k 0x01 …………...… ………...…
k
0x02
k
0x00
k
0xFF
k
m
k
0x01 …………...……………….
k,m
0x02
k,m
0x00
k,m
0xFF
k,m
n
k,m
0x01 …… ………………...…….
Fig. 5 Diagram of trie data structure. The greyed circle is a root of a trie, brown circles are intermediate nodes,
and green circles are leaf nodes; numbers in hexadecimal form represent the index and byte associated with
a node. A red outline represents the path for finding the partition associated with bytes: k,m,n
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 9 of 28
Nowicki J Big Data (2020) 7:101
e reducer function is also an identity function. However, the reducer receives all val-
ues associated with the key as the list, thus it applies the identity function to each value
in the input list returning multiple pairs with the same key and various values.
In the end, the sorted data, i.e. the returned key/value pairs, is stored back to HDFS.
e directory with results contains multiple output files—one file per reducer.
Full code of the benchmark is available at GitHub [44]. e directory contains many
files, but the benchmark consists of 5Java files: TeraSort.java, TeraSortCong-
Keys.java, TeraInputFormat.java, TeraOutputFormat.java, and Tera-
Scheduler.java. ose files in total contain 831physical lines of code as reported by
cloc application [45] and 617logical lines of code as reported by lloc application [46].
PCJ implementation
e PCJ implementation of TeraSort benchmark is a variation of the sample sort
algorithm.
e algorithm used here is almost the same as the one used for the TeraSort algorithm.
It samples 100,000 records and evenly selects one pivot per PCJ thread (thus the imple-
mentation name is OnePivot). ere exists a simpler pivots selecting algorithm, where
instead of sampling 100,000 records, each PCJ thread takes the constant number (e.g.
3 or 30) of pivots, but it generates not as good data distribution (the implementation
name is MultiplePivots). However, the splits calculating time in both algorithms is negli-
gible comparing to total execution time. Moreover, the performance is not much worse
as presented in [47].
e execution is divided into 5steps, similar to Edahiro’s Mapsort described in [48].
Figure6 presents the overview of the algorithm as a flow diagram. Table1 contains a
description of the algorithm steps. A detailed description of the basic algorithm, but in
the multiple pivots per PCJ thread variant, with code listings is available in [47].
e basic PCJ implementation uses a single file as an input file and writes the result to
a single output file. A total number of records is derived from the input file size. Every
thread reads its own portion of the input file. e number of thread’s records is roughly
equal for all threads. If the total number of records divides with a remainder, the threads
with id less than remainder have one record more to process.
Implementation variants
e basic implementation was optimized to obtain better performance, resulting in new
variants of the implementation. e descriptions of optimized implementations are pre-
sented in Sect.4.
reading
pivots
reading
input and
placing
records
into proper
buckets
exchanging
buckets
data
between
threads
sorting
data
stored in
buckets
writing
data from
buckets
Fig. 6 Sort algorithm steps
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 10 of 28
Nowicki J Big Data (2020) 7:101
Full source codes of all benchmark implementations are available at GitHub [52].
Each PCJ implementation is just one file that contains, depending on the variant, 330–
410physical lines of code as reported by cloc application [45] and 226–282logical lines
of code as reported by lloc application [46].
Results
e performance results presented in the paper have been obtained using the Hasso-
Plattner Institute (HPI) Future SOC Lab infrastructure.
Hardware andsoftware
e performance results presented in the paper has been obtained using the 1000Core
Cluster of HPI Future SOC Lab. Table2 contains an overview of the used hardware.
Table3 contains information about the software used for benchmarks.
Apache Hadoop configuration Previous Hadoop benchmarks were done using the
dedicated Hadoop cluster. However, the HPI Future SOC Lab cluster, used to obtain
data for this paper, is a general-purpose cluster. To compare PCJ with Apache Hadoop
it was necessary to properly set up and launch the Hadoop cluster on the Future SOC
Lab cluster with SLURM submission system. e standard mechanism of starting up the
Hadoop cluster uses an ssh connection that is unavailable between nodes of the cluster.
A job that requests eight nodes was selected to workaround the issue. e job master
node was selected to be the Hadoop Master that starts namenode, secondarynamenode
and resourcemanager as daemon processes on the node, and datanodes (and for some
benchmarks nodemanagers) daemons on all allocated nodes by executing srun com-
mand in the background. anks to the cluster configuration, there was no time limit for
a job, and thus the job could run indefinitely.
Table 1 Description ofalgorithm steps
Step Description
Reading pivots Pivots are read evenly from a specific portion of the input file by each thread. Then PCJ
Thread-0 performs the reduce operation for gathering pivots data from other threads.
The list is being sorted using standard Java sort algorithm [49]. The possible duplicate
records are removed from the list. Then the evenly placed pivots are taken from the list
and broadcasted to all the threads. A thread starts reading the input file when it receives
the list
Reading input Pivots are the records that divide input data into buckets. Each thread has to have its own
set of buckets that will be used for exchanging data between threads. Each bucket is
a list of records. While reading input, the record’s bucket is deducted from its possible
insert place in pivots list by using Java built‑in binary search method. The record is added
to the right bucket
Exchanging buckets After reading the input file, it is necessary to send the data from the buckets to the threads
that are responsible for them. The responsibility here means sorting and writing to the
output file. After sending buckets data to all other threads, it is necessary to wait for
receiving data from all of them
Sorting After receiving every buckets’ data it is time to sort. Each bucket is shredded into smaller
arrays—one array per source thread. It is necessary to flatten the array and then sort the
whole big array of records. Standard Java sort algorithm [49] for non‑primitive types is
used for sorting the array. The sort algorithm, called timsort, is a stable, adaptive, iterative
mergesort, which implementation is adapted from Tim Peters’s list sort for Python [50],
that uses techniques from [51]
Writing output Writing buckets data to a single output file in the correct order is the last step. This is the
most sequential part of the application. Each thread has to wait for its turn to write data
to the output file
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 11 of 28
Nowicki J Big Data (2020) 7:101
e Hortonworks Documentation [53], as well as IBM Memory calculator worksheet
[54], describe the way of calculating the memory settings for Apache Hadoop. How-
ever, it is not well suited for the Future SOC Lab cluster, as there is only 1disk, and
the calculated values cause InvalidResourceRequestException exception to
be thrown while submitting a job to the Hadoop cluster. Manipulating the calculated
values can fix the issue, but generating input with
107
records takes more than 8min-
utes while it can take less than 2minutes on better configurations. Eventually, the
memory configuration values were selected differently.
Almost all of the default values of Apache Hadoop configuration were left unmodi-
fied. e most important changes to configuration files are presented below.
In the yarn-site.xml file, the physical memory available for containers (yarn.node-
manager.resource.memory-mb) and the maximum allocation for every container
(yarn.scheduler.maximum-allocation-mb) were set to 821600 MB, the minimum
allocation for every container (yarn.scheduler.minimum-allocation-mb) was set to
128MB, the enforcement of virtual memory limits were turned off (yarn.nodeman-
ager.vmem-check-enabled) and the number of vcores that can be allocated for con-
tainers (yarn.nodemanager.resource.cpu-vcores) and the maximum allocation of
vcores (yarn.scheduler.maximum-allocation-vcores) was set to 80.
Table 2 1000Core Cluster ofHPI Future SOC Lab Hardware overview
Hardware
Cluster 25 computational nodes
using at most 8 nodes (job resource request limit)
Nodes 4 Intel Xeon E7‑4870 processors
with maximum 2.40 GHz clock speed
Processor 10 cores (each 2 threads in hyper‑threading mode)
in total 80 threads per node
RAM 1 TB of RAM on each node more than 820 GB free, i.e. available for the user
Network 10‑Gigabit ethernet
Intel 82599ES 10‑Gigabit Ethernet Controller
Storage /home directory—is a NAS mounted drive with 4 TB capacity (about
1.5 TB free); available from all nodes
/tmp directory—1 TB SSD drive; exclusively mounted on each node
Table 3 1000Core Cluster ofHPI Future SOC Lab Software overview
Software
Operating system: Ubuntu Linux
18.04.4 LTS (Bionic
Beaver)
Job scheduler: SLURM, version 18.08.7
Java Virtual Machine: Oracle JDK 13
Apache Hadoop: Stable version: 3.2.1
PCJ: Stable version: 5.1.0
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 12 of 28
Nowicki J Big Data (2020) 7:101
e value of memory requested for all map tasks and reduce tasks is set to 40,000MB
(mapreduce.map.resource.memory-mb and mapreduce.reduce.resource.memory-mb in
mapred-site.xml file) and application master memory is set to 128,000 MB (yarn.app.
mapreduce.am.resource.memory-mb in mapred-site.xml file). e arbitrarily selected
value 40,000MB allows for full utilization of the memory, even if not all assigned tasks
to cores use the whole accessible memory. e value also does not force the scheduler
to use the selected number of tasks but allows to dynamically set the proper number of
mappers and reducers.
e dfs.replication value of hdfs-site.xml file is set to 1, as presented tests do not need
to have resilient solution and benchmarks should measure TeraSort implementations,
not HDFS.
PCJ configuration e PCJ runtime configuration options were left unmodified except
a -Xmx820g parameter, which means that maximum Java heap size is set to 820GB, a
-Xlog:gc*:file=gc-{hostname}.log parameter that enables printing all massages of garbage
collectors into the separate output files, and a -Dpcj.alive.timeout=0 parameter, that dis-
ables PCJ mechanism for active checking of nodes liveness.
Benchmarks
e results were obtained for a various number of records (from
107
up to
1010
records),
i.e. size of input files is from about 1GB up to about 1TB. e input files were generated
using teragen application from Apache Hadoop package. e application generates
the official GraySort input data set [55]. Both the PCJ implementation and the Hadoop
implementation were sorting exactly the same input sequences and produces single or
multiple output files. Of course, generated output files for both implementations pro-
duce the same output sequence.
Results are based on the total time needed for the benchmark execution excluding the
time needed to start the application. It is the easiest measurement to test, and it clearly
presents the effectiveness of the implementation. e PCJ implementation outputs total
time, while for the Hadoop implementation the total time was calculated as the time
elapsed between terasort.TeraSort: starting and terasort.TeraSort: done log messages
written with millisecond precision.
As the PCJ and Apache Hadoop both run on JVM, to mitigate garbage collection,
warming-up and just-in-time compilation influences on the measurements, benchmark
applications had been run several times (at least 5 times), and the shortest execution
time was taken as a result.
OnePivot scaling
Figure7 presents the total time needed to execute the basic sort implementation in PCJ
depending on the total thread used. e benchmark was run on 1, 2, 4and 8nodes with
80threads per node (80, 160, 320and 640threads in total).
The small data sizes do not show scalability. It is visible for larger records count
109
. With the higher count of records to sort, the scalability of the execution is bet-
ter. Unfortunately, due to insufficient disk space, there was no possibility to check
the scalability for
1010
records. When sorting
107
and
108
records, the time needed
for execution is almost constant, irrespective to the number of used threads. It is an
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 13 of 28
Nowicki J Big Data (2020) 7:101
expected behaviour as the bucket sizes are small and maintenance, as reading pivots
or exchanging data, consumes time.
Moreover, the vast amount of time is spent on sequentially writing output file as
presented in Fig.8. The figure shows the time spent on each step for execution on
8 nodes, 80 threads per node, in regard to PCJ Thread-0, as different steps can be
executed on PCJ threads at the same point of time.
0
300
600
900
1200
1500
1800
2100
2400
80 160 320 640 0
5
10
15
20
25
30
35
40
Time [seconds]
Time [minutes]
Threads
10
7
records
108 records
109 records
Fig. 7 Strong scaling of the basic sort implementation in PCJ depending on the allocated thread number.
Execution on 1, 2, 4 and 8 nodes (80 threads per node). The dashed line represents the ideal scaling for
109
records
0
300
600
900
1200
1500
107 records 108 records 109 records
0
5
10
15
20
25
Time [seconds]
Time [minutes]
Data size
Pivots Reading Sending Sorting Writing
37.1
169.2
1557.4
Fig. 8 Time spent on each step depending on the number of records to process. Time in regard to the PCJ
Thread-0; using 640 threads (8 nodes, 80 threads per node)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 14 of 28
Nowicki J Big Data (2020) 7:101
Various writing modes
To reduce the negative performance of sequentially writing data into the output file, the
next two variants of implementation were created.
e first variant of implementation, called ConcurrentWrite, uses memory-mapped
file technique to concurrently write data to a single output file without the need of
self-managing of sequenced access to the file. At the beginning of execution, the out-
put file with proper size was created by PCJ read-0 using setLength(-) method
of RandomAccessFile class. is is an important step, as concurrently changing size
by many threads using FileChannel class, used to map the file into memory, causes
exceptions of system error to be thrown. In the writing step, the adequate file portion
was mapped into memory in read-write mode, modified and finally saved. Size of the
mapped region of the file in the benchmarks was set to contain maximum 1,000,000
records. Using this technique, the operating system is responsible for synchronization
and writing data into the disk.
In the second implementation variant, called MultipleFiles, each PCJ thread creates
one output file that the thread is exclusively responsible for. In this scenario, no addi-
tional synchronization is required, as the filenames rely only on threadId. is variant
produces not exactly the same output as OnePivot and ConcurrentWrite, as it consists
of multiple output files that have to be concatenated to produce a single output file. e
concatenation time, using standard cat command, of 640 files, each of 156,250,000
bytes size (in total about 100GB—the size of
109
records), takes at least 9m53s (median:
12m21s, mean: 13m49s, maximum: 23m38s) based on 50 repeats. However, the Hadoop
benchmark is also producing multiple output files, so the concatenation time is not cal-
culated for the total execution time in the presented results.
Figure9 presents the time spent on each step when sorting
109
records, in regard to
PCJ read-0, using 640 threads (8 nodes, 80 threads per node). e results demon-
strate, that when less synchronization is required, the time needed for writing is shorter.
However, adding concatenation time to MultipleFiles results in similar or slightly worse
0
300
600
900
1200
1500
OnePivot ConcurrentWrite MultipleFiles
0
5
10
15
20
25
Time [seconds]
Time [minutes]
Writing mode
Pivots Reading Sending Sorting Writing
1557.4
1054.3
364.7
Fig. 9 Time spent on each step depending on the writing output implementation variant. Time in regard to
the PCJ Thread-0; using 640 threads (8 nodes, 80 threads per node); sorting
109
records
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 15 of 28
Nowicki J Big Data (2020) 7:101
total time than ConcurrentWrite. Moreover, the concurrently writing data into the same
output file is about 33% faster than waiting for own turn by the PCJ thread to write its
own portion of output data.
Figures10, 11 and 12 presents a timeline of execution OnePivot, ConcurrentWrite and
MultipleFiles benchmark implementation, respectively, sorting
107
records using 640
threads (8 nodes, 80 threads per node). e bars represent the number of threads exe-
cuting adequate algorithm step without displaying waiting threads. e mix step is when
a thread executes more than one algorithm step in the same timestep. e resolution of
data is 100 milliseconds. Each figure’ x-axis is span to the same value for better compari-
son, and the vertical dashed line represents the finish of execution.
0
80
160
240
320
400
480
560
640
0 5 10 15 20 25 30 35
Running threads
Seconds
Pivots
Reading
Sending
Sorting
Writing
Mix
Fig. 10 Timeline of execution OnePivot implementation while sorting
107
records. Execution on 640 threads
(8 nodes, 80 threads per node). The resolution of data is 100 milliseconds. Waiting threads are not displayed
0
80
160
240
320
400
480
560
640
0 5 10 15 20 25 30 35
Running threads
Seconds
Pivots
Reading
Sending
Sorting
Writing
Mix
Fig. 11 Timeline of execution ConcurrentWrite implementation while sorting
107
records. Execution on 640
threads (8 nodes, 80 threads per node). The resolution of data is 100 milliseconds. Waiting threads are not
displayed. The vertical dashed line represents the finish of execution. The X‑axis is span to the same value as
for OnePivot implementation
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 16 of 28
Nowicki J Big Data (2020) 7:101
e seemingly empty plot in Fig.10 starting at 20 s means sequentially writing results
into the output file. e time needed for writing data by single PCJ thread in Concur-
rentWrite implementation (cf. Fig.11) takes much longer, as many PCJ threads are exe-
cuting the writing step for several seconds. e best implementation is MultipleFiles (cf.
Fig.12) and also allows for writing output data into the not-shared file system.
One more thing worth noting is overlapping execution of the steps among threads.
Some threads may still wait for buckets data from all other threads, but at the same time,
the threads that have received all buckets proceed with sorting and writing. It is thanks
to the asynchronous communication that is one of the main principles and advantages of
the PGAS programming model, and thus especially the PCJ library.
Output drive
Up to now, the output data in the benchmarks was written into the shared folder on
NAS drive. However, thanks to the MultipleFiles implementation, each PCJ threads can
use its own path, that is not shared anymore. at led to the next performance measure-
ment presented in Fig.13. e /home path indicates the directory on NAS mounted
drive, whereas the /tmp directory is located on the only locally accessible SSD drive.
At first glance, there is a strange data point, when processing
109
records on 80 threads
(1 node) and writing to local SSD drive, as the total execution time is very similar to the
processing using 2 nodes. It is the result of the internal processing of the bucket data
exchanging inside a node that can be done concurrently without the bottleneck of the
network connection. e behaviour does not occur when writing to the NAS drive, as
the writing cannot be done in a true parallel way.
Figure14 presents the total execution time of the MultipleFiles implementation, saving
output data into /tmp directory, using in total 80 threads on 1, 2 and 4 nodes, as well
as 8 nodes with 10, 20, 40 and 80 threads per node. e XnYt label means Y threads per
node on X nodes.
0
80
160
240
320
400
480
560
640
0 5 10 15 20 25 30 35
Running threads
Seconds
Pivots
Reading
Sending
Sorting
Writing
Mix
Fig. 12 Timeline of execution MultipleFiles implementation while sorting
107
records. Execution on 640
threads (8 nodes, 80 threads per node). The resolution of data is 100 milliseconds. Waiting threads are not
displayed. The vertical dashed line represents the finish of execution. The X‑axis is span to the same value as
for OnePivot implementation
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 17 of 28
Nowicki J Big Data (2020) 7:101
Concurrent send
e concurrent writing of multiple output files onto local SSD drive gives about tenfold
performance gain compared to sequentially writing a single output file onto the NAS
mounted drive. is consideration resulted in the fourth implementation called Concur-
rentSend. e implementation is based on concurrently sending data from the buckets
while reading input data. Bucket data is sent when the bucket is full, i.e. bucket con-
tains a predetermined number of items. Figure15 shows the total execution time of the
0
300
600
900
1200
80 160 320 640
0
5
10
15
20
Time [seconds]
Time [minutes]
Threads
/home: 10
7
records
/home: 108 records
/home: 109 records
/tmp: 107 records
/tmp: 108 records
/tmp: 109 records
Fig. 13 Scaling of the MultipleFiles implementation depending on the output path and the allocated thread
number. Execution on 1, 2, 4 and 8 nodes (80 threads per node). The /home path is for a directory on NAS
mounted drive, and /tmp directory is located on the only locally accessible SSD drive
0
180
360
540
720
107 records 108 records 109 records
0
3
6
9
12
Time [seconds]
Time [minutes]
Data size
1n80t
2n40t
4n20t
8n10t
8n20t
8n40t
8n80t
Fig. 14 The total execution time of the MultipleFiles implementation using the constant number of threads
(80) and the constant number of nodes (8). Execution on 1, 2 and 4 nodes (80 threads in total) and on 8
nodes (10, 20, 40, 80 threads per node). The XnYt label means Y threads per node on X nodes. The /tmp is
used as an output directory
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 18 of 28
Nowicki J Big Data (2020) 7:101
ConcurrentSend implementation depending on the records count using 640 threads (8
nodes, 80 threads per node). e MultipleFiles implementation execution time is plotted
for the reference.
ere is no visible increase in performance over MultipleFiles execution for smaller
input data sizes. e visible gain is for input data with
1010
records. In this scenario, the
overlap of reading input data and sending filled buckets outperforms the time losses on
checking bucket size, preparing a message to send, and processing incoming data.
Selecting the proper bucket size is very important for the performance—for the next
benchmarks, the bucket size was set to 1000 records.
Writing intoHDFS
All previous benchmarks were focused on the performance of the PCJ TeraSort algo-
rithm implementation that uses NAS or locally mounted drives to store input and out-
put data. However, one of the main advantages of Apache Hadoop is the usage of HDFS
that can store really big files in a resilient way across many datanodes. Having that in
mind, the fifth implementation, called HdfsConcurrentSend, has been created. e
implementation is based on the ConcurrentSend version, but instead of using standard
Java IO classes to read and write files, the HDFS mechanism was used. Each datanode
uses the only locally accessible /tmp directory on the SSD drive for storing data blocks.
Submitting Hadoop TeraSort job to the Hadoop cluster was done with additional
parameters, that set up the default number of possible map tasks and reduce tasks
(-Dmapreduce.job.maps
=Ntasks
and -Dmapreduce.job.reduces
=Ntasks
, where
Ntasks
was calculated as a product of nodes count and 80). e value
Ntasks
means the upper
limit of mappers and reducers is
Ntasks
, so concurrently there can be
2Ntasks
threads used.
e limit works for reduce tasks as stated in Job Counters at the end of job executions,
0
300
600
900
1200
1500
1800
2100
107 records 108 records 109 records1010 records
0
5
10
15
20
25
30
35
Time [seconds]
Time [minutes]
Data size
100 1000 10000 100000 MultipleFiles
Fig. 15 The total execution time of the ConcurrentSend implementation depending on the bucket size and
records count. Execution using 640 threads (8 nodes, 80 threads per node). The MultipleFiles implementation
execution time is plotted for the reference
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 19 of 28
Nowicki J Big Data (2020) 7:101
whilst the number of map tasks depends on the input file size: for
107
and
108
records it
is 640 mappers; for
109
records it is 1280 mappers; for
1010
records it is 7680 mappers.
External mode Fig. 16 presents the total execution time of sorting
107
,
108
and
109
records using Apache Hadoop and PCJ implementations of TeraSort algorithm. e
compute nodes were not placed on the nodes with datanodes (the external mode). In
that situation, for Apache Hadoop, the proper number of nodemanagers were run on
the nodes without datanodes before submitting sorting job to the Hadoop cluster. e
executions using the PCJ library were done as usual by starting PCJ threads on the nodes
without datanodes. is behaviour is natural in the SLURM submission system, as there
was one job with allocated nodes for master and datanodes and the new jobs allocate
different nodes.
e results show that the PCJ execution performance is better in comparison to the
Hadoop execution. is mode is unnatural for Hadoop, and it cannot take advantages of
data placement. For the
109
records both solutions scale, but the scaling of PCJ is much
better. e results obtained for a smaller number of records gives lower scaling for PCJ
and even the higher execution time for the Hadoop execution.
Internal mode As aforementioned, the external mode is not natural for Hadoop. e
following benchmark results were obtained in the internal mode. e internal means
that the computing nodes were placed on the same nodes as datanodes and the total
number of nodemanagers was constant during all of the benchmarks. It means that the
containers which execute tasks could be placed on every node without limit. e execu-
tions using the PCJ library were done using an unnatural way for SLURM by externally
attaching the execution to the Hadoop Cluster job.
0
300
600
900
1200
1500
1800
2100
2400
80 160 320 640
0
5
10
15
20
25
30
35
40
Time [seconds]
Time [minutes]
Threads
Hadoop: 107 records
Hadoop: 108 records
Hadoop: 109 records
PCJ: 107 records
PCJ: 108 records
PCJ: 109 records
Fig. 16 The total execution time for Hadoop and PCJ implementations of TeraSort algorithm in the external
mode. Sorting
107
,
108
and
109
records. The compute nodes were not placed on the nodes with datanodes.
Dashed lines represent the ideal scaling for
109
records for PCJ and Hadoop executions
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 20 of 28
Nowicki J Big Data (2020) 7:101
e total execution times of sorting
107
,
108
,
109
and
1010
records using Apache
Hadoop and PCJ implementations of TeraSort algorithm are presented in Fig.17. e
compute nodes were placed on the same nodes as datanodes (the internal mode).
e internal results of Hadoop sorting are approximately constant what suggests that
there is no limit of nodes the containers are running on. e Hadoop cluster has chosen
to run the map and reduce tasks on all of the nodes to minimize the data that has to be
transferred from datanodes to computing nodes. In the PCJ case for
107
,
108
and
109
records, PCJ uses exactly 80 threads per node. However, PCJ could not process
1010
in
that configuration on less than 4 nodes, as Garbage Collector was constantly pausing the
execution. e presented results for
1010
was collected using exactly 8 nodes with 10, 20,
40 and 80 threads per node. e PCJ execution is also taking advantage of the execution
on the same nodes as datanodes and eventually, using 640 threads, the performance is
much better for PCJ.
Figure18 presents the total execution time of the PCJ ConcurrentSend implementa-
tion while writing results directly into the /tmp directory and through HDFS in the
internal mode. e HDFS is also using the /tmp directory to store data. e data in the
non-HDFS variant is read from the shared drive while the HDFS variant also read data
from HDFS.
Hadoop memory setting e big gap in the total execution time for PCJ and Hadoop
causes the necessity to verify the proper value of maximum memory for map tasks
and reduce tasks. Up to now, the value was set to 40GB. e maximum number of
map and reduce tasks are set to 80 for each type of task resulting in a total of 160
tasks. Taking that into account, the benchmarks with 5GB, 10GB, 20GB and 40GB
maximum memory values were executed. Figure19 shows the total execution time
using
Ntasks
=
640
on 8 nodes depending on the maximum memory for a map and
reduce tasks.
0
300
600
900
1200
1500
1800
2100
2400
2700
3000
3300
80 160 320 640
0
5
10
15
20
25
30
35
40
45
50
55
Time [seconds]
Time [minutes]
Threads
Hadoop: 10
7
records
108 records
109 records
1010 records
PCJ: 10
7
records
108 records
109 records
1010 records
Fig. 17 The total execution time for Hadoop and PCJ implementations of TeraSort algorithm in the internal
mode. Sorting
107
,
108
,
109
and
1010
records. The compute nodes were placed on the same nodes as
datanodes
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 21 of 28
Nowicki J Big Data (2020) 7:101
e sort execution time is shortest for the memory limit set to 10GB and 20GB.
For the largest input
1010
, the memory limit set to 10GB gives the best performance.
Having in mind that on each node there is at least 820GB RAM and there are 40
physical cores (80 logical cores with hyper-threading), setting the maximum memory
allocation pool for each JVM to 10GB, utilizes the whole memory without oversub-
scription. e processing of
1010
records, with the value of memory requested for all
map tasks and reduce tasks set to 5GB, failed due to OutOfMemoryError.
0
300
600
900
1200
1500
107 records 108 records 109 records 1010 records
0
5
10
15
20
25
Time [seconds]
Time [minutes]
Data size
/tmp HDFS (internal)
Fig. 18 The total execution time of the PCJ ConcurrentSend implementation while writing directly into /
tmp directory and through HDFS in the internal mode. Execution using 640 threads (8 nodes, 80 threads
per node). The /tmp directory is located on the only locally accessible SSD drive. The compute nodes were
placed on the same nodes as datanodes and also uses /tmp directory to store data
0
300
600
900
1200
1500
1800
2100
2400
107 records 108 records 109 records1010 records
0
5
10
15
20
25
30
35
40
Time [seconds]
Time [minutes]
Data size
5 GB 10 GB 20 GB 40 GB
Fig. 19 The total execution time of Apache Hadoop in regard to the value of memory requested for all map
and reduce tasks. Sorting
107
,
108
,
109
and
1010
records. The compute nodes were placed on the same nodes
as datanodes. Execution with
Ntasks
=
640
on 8 nodes. There is no data point for 5 GB and
1010
records
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 22 of 28
Nowicki J Big Data (2020) 7:101
Figure20 presents total execution time in a form of statistical box plots for sorting
1010
records collected in internal mode using 640 threads (
Ntasks
=
640
for Hadoop
execution), based on 50 executions. e values for Hadoop are presented for memory
requested value for all mappers and reducers set to 10GB, 20GB and 40GB. e PCJ
execution has set the -Xmx820g JVM parameter on each node. e ends of whiskers are
minimum and maximum values, a cross (
×
) represents an average value, a box repre-
sents values between 1st and 3rd quartiles, and a band inside the box is a median value.
e performance of PCJ is similar to the Hadoop when the memory is set to 10GB
and 20GB. e averages are very similar, but the minimal execution time is lower
for the Hadoop execution with the 10GB value of maximum requested memory for
all map and reduce tasks. Still, the PCJ execution times are consistent, whereas the
Hadoop gives a broad range of total execution time.
Figure21 contains the plots with the strong scalability of the Apache Hadoop and
PCJ implementations of TeraSort benchmark for a various number of records (i.e.
107
,
108
,
109
and
1010
records). e Hadoop version was run in the internal mode with
the 10GB value of maximum requested memory for each mapper and reducer. e
Hadoop scheduler independently could allocate tasks to the nodes, as the 8 nodes
were constantly running nodemanagers. e PCJ results were gathered in two sce-
narios. In the first scenario PCJ used 1, 2, 4 and 8 nodes (80 threads per node; Xn80t)
like in Fig.17. In the second scenario PCJ used 8 nodes with 10, 20, 40 and 80threads
per node (8nXt).
e total execution time depends largely on the choice of the number of nodes and
threads in regards to the size of the data to process. Generally, for the constant number
of processing threads, the application is more efficient using more nodes. e PCJ imple-
mentation outperforms Hadoop for smaller input data sizes (
107
and
108
records), the
performance is roughly the same for
109
records. Hadoop is more efficient for processing
1200
1500
1800
2100
2400
2700
3000
10 GB
Hadoop
20 GB
Hadoop
40 GB
Hadoop
PCJ
20
25
30
35
40
45
50
Time [seconds]
Time [minutes]
Fig. 20 Statistics for execution time for sorting
1010
records. Ends of whiskers are minimum and maximum
values, a cross (
×
) represents an average value, a box represents values between 1st and 3rd quartiles, and a
band inside the box is a median value. Note that the y‑axis does not start at 0
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 23 of 28
Nowicki J Big Data (2020) 7:101
large data sets (
1010
records) with a lower number of threads, but when looking at the
total execution time for the largest number of threads (640) the difference is almost not
visible.
Selecting a proper number of mappers and reducers is the role of Hadoop scheduler
and the user does not have to think about that. e PCJ solution requires the user to
start the execution on a right number of nodes and using a proper number of threads
per node. A good rule of thumb is executing the application on all of the available nodes
and starting as many threads as the number of logical cores on nodes having in mind the
size of the input to process per thread. e performance results in that situation should
not be worse than for other solutions.
Discussion
In this study, the PCJ library was used to prepare the parallel implementation of the
TeraSort algorithm in the PGAS paradigm. Preparing a parallel application using the PCJ
library was straightforward. It could be done iteratively. When the current performance
was not satisfying, it was relatively easy to change a part of the code to make the applica-
tion run more effectively. e user expertise in programming in Java was the only poten-
tially limiting factor for the implemented optimization. ere was no upfront special
knowledge about the underlying technology that the user had to know, to write a parallel
code. e resulted code is easy to read and maintain. Each step of the algorithm is clearly
visible and each executed action is directly written.
e basic implementation performance was not satisfactory, as only a small fraction
of execution was done in a parallel way. e writing step consumed most of the run
time.
0
1
2
3
Time [minutes]
10
7
records
0
1
2
3 108 records
0
7
14
21 109 records
0
20
40
60
80 160 320 640
Threads
1010 records
Hadoop PCJ (Xn80t) PCJ (8nXt)
Fig. 21 Strong scaling of the Hadoop and PCJ implementations of TeraSort benchmark for various number
of records. Each map and reduce task in Hadoop was requested with 10 GB of memory. The compute nodes
were placed on the same nodes as datanodes. The results for PCJ were gathered using 1, 2, 4 and 8 nodes
(with 80 threads per node; Xn80t) and using 8 nodes (with 10, 20, 40 and 80 threads per node; 8nXt)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 24 of 28
Nowicki J Big Data (2020) 7:101
at led to next implementations which wrote results concurrently using a single
memory-mapped output file and multiple output files. e latter performed better as
no synchronization was involved in writing. Moreover, the possibility to write data into
multiple output files allowed for the use of the local drive. Writing onto only locally
accessible drive makes the implementation even more efficient. Presented results show
that writing data onto the only locally accessible drive results in a better performance.
is occurs regardless of the number of records to process. However, the more records
to process, the performance gain is higher.
e execution using the same number of threads on a different number of nodes
shows that the performance is the highest on the largest number of nodes. It is due to
the high copying mechanism contention when data is being exchanged on a node with
a larger number of threads. Moreover, doubling the number of threads on a node does
not decrease the execution time by a factor of two. For the smaller input data sizes, the
execution time takes even more than two times longer when using 640 threads (more
than 16 s) than when using 80 threads (less than 6 s).
An overlapping of the read and exchange buckets’ data resulted in the subsequent sort
implementation. e possibility of asynchronous sending and receiving of data, almost
like when using local machine memory, is one of the main advantages of the PGAS
programming model. e implementation outperformed the nonoverlapping read and
exchange implementation version for large data size. However, selecting the proper
bucket size is crucial for the performance. Too small bucket size causes a lot of small
messages and the latency consumes the eventual performance gain. On the other hand,
too big bucket size results in the sending of the bucket data at the end of the data reading
stage, like in nonoverlapping version, but with the additional work during every insert—
checking if the bucket is full.
e last presented implementation used HDFS for reading input data and writing out-
put data. e implementation was a little harder to write than the previous ones as it
required additional libraries and setup for HDFS connection and special treatment of
input and output files in HDFS. e comparison of the performance result of reading
data from the shared drive and writing directly onto the locally accessible drive, with
the performance when reading and writing data using HDFS shows the overhead of
using the HDFS. e overhead is big for small data sizes (doubles the execution time for
107
elements) and decreases with increasing the data size—execution takes about 3-4%
longer for
109
and
1010
elements.
In contrast to the PCJ implementations, the Hadoop implementation assumes upfront
knowledge about the way that the Hadoop processes the data. e main part of the algo-
rithm—dividing data into buckets and passing data between map and reduce steps—
was taken care of internally by Hadoop by providing a special user-defined partitioner.
Moreover, the record sorting in the partition is also done internally by Hadoop before
passing the key/value pairs to reducers.
e Hadoop cluster configuration plays a crucial role in the possible maximum per-
formance. e best performing Hadoop configuration performs similar or only slightly
better to the PCJ implementation of TeraSort algorithm. However, there was almost no
configuration change for the PCJ execution.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 25 of 28
Nowicki J Big Data (2020) 7:101
Previous studies [40, 41] show that the PCJ implementation of some benchmarks out-
performs the Hadoop implementation, even by a factor of 100. Other studies that com-
pare PCJ with APGAS and Apache Spark [56] show that the PCJ implementation has the
same performance as Apache Spark, and for some benchmarks it can be almost 2-times
more efficient. However, not all of the studies were perfectly suited for MapReduce pro-
cessing. e present study shows similar performance for Hadoop and PCJ while run-
ning a conventional, widely used benchmark for measuring the performance of Hadoop
clusters. Similar performance results can be obtained by specific problems’ solutions
that utilize HDFS, which may suggest that the benchmark is too highly dependent on
I/O.
Presented benchmarking results for the sorting algorithm’s PCJ and Apache Hadoop
implementations were based on the time spent on each processing step and the total
execution time. is paper does not deal with the algorithms in a wider context on
real applications, where sorting is only one of the many steps needed to get the desired
results. Moreover, the work does not try to determine if the usage of the PCJ library
is easier or harder than using the Apache Hadoop, as it involves a personal judgement
based on the user’s knowledge and experience.
Conclusion
is paper described the PCJ library sorting algorithm implementation and compared
its performance with Apache Hadoop TeraSort implementation.
e implementation using the PCJ library was presented in an iterative way that
shows the possible performance problems and the ways to overcome them. e reported
results of the concluding implementation show a very good performance of the PCJ
implementation of the TeraSort algorithm. e comparison of TeraSort implementations
indicates that PCJ performance is similar to Hadoop for a properly configured cluster
and even more efficient when using on clusters with drawbacks in the configuration.
Additionally, the source code written using PCJ is shorter in terms of physical and logi-
cal lines of code, and more straightforward—e.g. shuffling the data between threads is
directly written into the code. Understanding of the code does not require a deep knowl-
edge about underneath actions taken by background technology, as it is needed for the
Hadoop MapReduce framework. Improper partitioning of the Hadoop input format
class can produce incorrect results. Moreover, PCJ can also benefit from using HDFS as
a standalone resilient filesystem. e advantages of HDFS do not necessarily force use of
Hadoop MapReduce as a processing framework.
e PGAS programming model, represented by the PCJ library in this case, can be
very competitive with the MapReduce model. Not only when considering possible per-
formance gains but also in productivity. It provides a simple abstraction for writing a
parallel application in terms of a single global memory address space and gives control
over data layout and its placement among processors. Moreover, the PGAS model is a
general-purpose model that suits many problems, in contrast to the MapReduce model,
where the problem has to be adapted for the model (cf. [9, 40, 57, 58]).
e plan for a future study is to compare sort algorithms implemented using Apache
Spark and various PGAS model implementations. Future approaches should also investi-
gate the performance of other sort algorithms.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 26 of 28
Nowicki J Big Data (2020) 7:101
Abbreviations
BFS: Breadth‑first search; CPU: Central processing unit; HDFS: Hadoop distributed file system; JVM: Java virtual machine;
PCJ: Parallel computing in java; PGAS: Partitioned global address space; RDMA: Remote direct memory access.
Acknowledgements
The work presented in the paper is a results of the HPI grants [47, 59, 60]. The author would like to thank Future SOC Lab,
Hasso Plattner Institute for Digital Engineering for awarding access to the 1000 Core Cluster and the provided support.
Authors’ contributions
The sole author did all programming and writing. The author read and approved the final manuscript.
Funding
No funding has been received for the conduct of this work and preparation of this manuscript.
Availability of data and materials
All the source codes are included on the websites listed in the References section. The processed data were generated
using the teragen application from publicity available Apache Hadoop package.
Competing interests
The author declares that he has no competing interests.
Received: 30 July 2020 Accepted: 1 November 2020
References
1. Hoare CA. Algorithm 65: find. Commun ACM. 1961;4(7):321–2.
2. Sun W, Ma Z. Count sort for GPU computing. In: 2009 15th international conference on parallel and distributed
systems. IEEE; 2009. p. 919–924.
3. Kolonias V, Voyiatzis AG, Goulas G, Housos E. Design and implementation of an efficient integer count sort in CUDA
GPUs. Concurr Comput. 2011;23(18):2365–81.
4. Merrill D, Grimshaw A. High Performance and Scalable Radix Sorting: a Case Study of Implementing Dynamic Paral‑
lelism for GPU Computing. Parallel Processing Letters. 2011;21(02):245–72.
5. Gogolińska A, Mikulski Ł, Piątkowski M. GPU Computations and Memory Access Model Based on Petri Nets. In:
Transactions on Petri Nets and Other Models of Concurrency XIII. Springer; 2018: 136–157.
6. Dean J, Ghemawat S. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM.
2008;51(1):107–13.
7. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, et al. Resilient distributed datasets: A fault‑tolerant
abstraction for in‑memory cluster computing. In: Proceedings of the 9th USENIX conference on Networked Systems
Design and Implementation. USENIX Association; 2012:2.
8. Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S, Tzoumas K. Apache Flink: stream and batch processing in a
single engine. Bull IEEE Comput Soc Tech Committ Data Eng. 2015;36:4.
9. Mishra P, Mishra M, Somani AK. Applications of Hadoop Ecosystems Tools NoSQL. New York: Chapman and Hall;
2017. p. 173–90.
10. PCJ homepage. https ://pcj.icm.edu.pl. Accessed 26 Nov 2019.
11. Nowicki M, Górski Ł, Bała P. Evaluation of the parallel performance of the Java and PCJ on the Intel KNL based
systems. In: International conference on parallel processing and applied mathematics. 2017; p. 288–97.
12. Nowicki M, Górski Ł, Bała P. Performance evaluation of parallel computing and Big Data processing with Java and
PCJ library. Cray User Group. 2018;.
13. Rakowski F, Karbowski J. Optimal synaptic signaling connectome for locomotory behavior in Caenorhabditis
elegans: design minimizing energy cost. PLoS Comput Biol. 2017;13(11):e1005834.
14. Górski Ł, Rakowski F, Bała P. Parallel differential evolution in the PGAS programming model implemented with PCJ
Java library. In: International conference on parallel processing and applied mathematics. Springer; 2015. p. 448–58.
15. Górski Ł, Bała P, Rakowski F. A case study of software load balancing policies implemented with the PGAS program‑
ming model. In: 2016 International conference on high performance computing simulation (HPCS); 2016. p. 443–8.
16. Nowicki M, Bzhalava D, Bała P. Massively parallel sequence alignment with BLAST through work distribution imple‑
mented using PCJ library. In: Ibrahim S, Choo KK, Yan Z, Pedrycz W, editors. International conference on algorithms
and architectures for parallel processing. Cham: Springer; 2017. p. 503–12.
17. Nowicki M, Bzhalava D, Bała P. Massively Parallel Implementation of Sequence Alignment with Basic Local Alignment
Search Tool using Parallel Computing in Java library. J Comput Biol. 2018;25(8):871–81.
18. Tampuu A, Bzhalava Z, Dillner J, Vicente R. ViraMiner: deep learning on raw DNA sequences for identifying viral
genomes in human samples. BioRxiv. 2019:602656.
19. Ryczkowska M, Nowicki M, Bała P. The performance evaluation of the Java implementation of Graph500. In:
Wyrzykowski R, Deelman E, Dongarra J, Karczewski K, Kitowski J, Wiatr K, editors. Parallel processing and applied
mathematics. Cham: Springer; 2016. p. 221–30.
20. Ryczkowska M, Nowicki M, Bała P. Level‑synchronous BFS algorithm implemented in Java using PCJ library. In: 2016
International conference on computational science and computational intelligence (CSCI). IEEE; 2016. p. 596–601.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 27 of 28
Nowicki J Big Data (2020) 7:101
21. Istrate R, Barkoutsos PK, Dolfi M, Staar PWJ, Bekas C. Exploring graph analytics with the PCJ toolbox. In: Wyrzykowski
R, Dongarra J, Deelman E, Karczewski K, editors. Parallel processing and applied mathematics. Cham: Springer Inter‑
national Publishing; 2018. p. 308–317.
22. Dong H, Zhou S, Grove D. X10‑enabled MapReduce. In: Proceedings of the fourth conference on partitioned global
address space programming model; 2010. p. 1–6.
23. Teijeiro C, Taboada GL, Tourino J, Doallo R. Design and implementation of MapReduce using the PGAS program‑
ming model with UPC. In: 2011 IEEE 17th international conference on parallel and distributed systems. IEEE; 2011. p.
196–203.
24. Aday S, Darkhan AZ, Madina M. PGAS approach to implement mapreduce framework based on UPC language. In:
International conference on parallel computing technologies. Springer; 2017. p. 342–50.
25. O’Malley O. TeraByte Sort on Apache Hadoop. Yahoo. http://sortb enchm ark.org/Yahoo Hadoo p.pdf. 2008. p. 1–3.
26. Frazer WD, McKellar AC. Samplesort: a sampling approach to minimal storage tree sorting. JACM.
1970;17(3):496–507.
27. Almasi G. PGAS (Partitioned Global Address Space) Languages. In: Padua D, editor. Encyclopedia of parallel comput‑
ing. Boston: Springer; 2011. p. 1539–1545.
28. De Wael M, Marr S, De Fraine B, Van Cutsem T, De Meuter W. Partitioned Global Address Space languages. ACM
Comput Surv. 2015;47(4):62.
29. Culler DE, Dusseau A, Goldstein SC, Krishnamurthy A, Lumetta S, Von Eicken T, et al. Parallel programming in Split‑C.
In: Supercomputing’93. Proceedings of the 1993 ACM/IEEE conference on supercomputing. IEEE; 1993. p. 262–73.
30. Deitz SJ, Chamberlain BL, Hribar MB. Chapel: Cascade High‑Productivity Language. An overview of the chapel paral‑
lel programming model. Cray User Group. 2006.
31. Numrich RW, Reid J. Co‑array Fortran for parallel programming. In: ACM SIGPLAN Fortran Forum, vol. 17. ACM;
1998:1–31.
32. Yelick K, Semenzato L, Pike G, Miyamoto C, Liblit B, Krishnamurthy A, et al. Titanium: a high‑performance Java dialect.
Concurr Comput. 1998;10(11–13):825–36.
33. Consortium U, et al. UPC Language Specifications v1.2. Ernest Orlando Lawrence Berkeley NationalLaboratory,
Berkeley, CA (US); 2005.
34. Charles P, Grothoff C, Saraswat V, Donawa C, Kielstra A, Ebcioglu K, et al. X10: an Object‑oriented approach to non‑
uniform cluster computing. In: ACM SIGPLAN Notices, vol. 40. ACM; 2005. p. 519–38.
35. Tardieu O. The APGAS library: resilient parallel and distributed programming in Java 8. In: Proceedings of the ACM
SIGPLAN workshop on X10; 2015. p. 25–6.
36. Dagum L, Menon R. OpenMP: an industry‑standard API for shared‑memory programming. Comput Sci Eng.
1998;1:46–55.
37. Clarke L, Glendinning I, Hempel R. The MPI message passing interface standard. In: Programming environments for
massively parallel distributed systems. Springer; 1994. p. 213–18.
38. Nowicki M, Ryczkowska M, Górski Ł, Szynkiewicz M, Bała P. PCJ‑a Java library for heterogenous parallel computing.
Recent Adv Inf Sci. 2016;36:66–72.
39. Nowicki M, Górski Ł, Bała P. PCJ–Java Library for Highly Scalable HPC and Big Data Processing. In: 2018 international
conference on high performance computing and simulation (HPCS). IEEE; 2018. p. 12–20.
40. Ryczkowska M, Nowicki M. Performance comparison of graph BFS implemented in MapReduce and PGAS program‑
ming models. In: International conference on parallel processing and applied mathematics. Springer; 2017. p.
328–37.
41. Nowicki M, Ryczkowska M, Górski Ł, Bała P. Big Data analytics in Java with PCJ library: performance comparison
with Hadoop. In: Wyrzykowski R, Dongarra J, Deelman E, Karczewski K, editors. International conference on parallel
processing and applied mathematics. Cham: Springer; 2017. p. 318–27.
42. Apache Hadoop TeraSort package. https ://hadoo p.apach e.org/docs/r3.2.1/api/org/apach e/hadoo p/examp les/teras
ort/packa ge‑summa ry.html. Accessed 26 Nov 2019.
43. Sahni S. Tries. In: Mehta DP, Sahni S, editors. Handbook of data structures and applications. New York: CRC; 2004.
44. Hadoop implementation of the TeraSort benchmark. https ://githu b.com/apach e/hadoo p/tree/780d4 f416e 3cac3
b9e81 88c65 8c6c8 438c6 a865b /hadoo p‑mapre duce‑proje ct/hadoo p‑mapre duce‑examp les/src/main/java/org/
apach e/hadoo p/examp les/teras ort. Accessed 10 Jan 2020.
45. AlDanial/cloc: cloc counts blank lines, comment lines, and physical lines of source code in many programming
languages. https ://githu b.com/AlDan ial/cloc. Accessed 28 Jan 2020.
46. Artur Bosch / lloc ‑ Logical Lines of Code. https ://gitla b.com/artur bosch /lloc/tree/7f5ef af797 d33a5 eebb3 38c21
63780 75710 22fab . Accessed 28 Jan 2020.
47. Nowicki M. Benchmarking the Sort Algorithm on Ethernet Cluster. Technical Report. In: HPI Future SOC Lab: Pro‑
ceedings 2019 (in press).
48. Pasetto D, Akhriev A. A comparative study of parallel sort algorithms. In: Proceedings of the ACM international
conference companion on Object oriented programming systems languages and applications companion. ACM;
2011:203–204.
49. Arrays (Java SE 13 & JDK 13). https ://docs.oracl e.com/en/java/javas e/13/docs/api/java.base/java/util/Array
s.html#sort(java.lang.Objec t%5B%5D). Accessed 07 Jul 2020.
50. Python timsort. http://svn.pytho n.org/proje cts/pytho n/trunk /Objec ts/lists ort.txt. Accessed 07 Jul 2020.
51. McIlroy P. Optimistic sorting and information theoretic complexity. In: Proceedings of the fourth annual ACM‑SIAM
symposium on discrete algorithms; 1993. p. 467–74.
52. PCJ implementations of the TeraSort benchmark. https ://githu b.com/hpdcj /PCJ‑TeraS ort/tree/a1c2c b3395 11e9b
cd3be fb892 f82c5 22c7f bd1c3 /src/main/java/pl/umk/mat/faram ir/teras ort. Accessed 01 July 2020.
53. Hortonworks Documentation: 11. Determine YARN and MapReduce memory configuration settings. https ://docs.
cloud era.com/HDPDo cumen ts/HDP2/HDP‑2.0.6.0/bk_insta lling _manua lly_book/conte nt/rpm‑chap1 ‑11.html.
Accessed 5 Nov 2020.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 28 of 28
Nowicki J Big Data (2020) 7:101
54. IBM Knowledge Center: Memory calculator worksheet. https ://www.ibm.com/suppo rt/knowl edgec enter /en/SSPT3
X_4.0.0/com.ibm.swg.im.infos phere .bigin sight s.dev.doc/doc/biga_cachi ng_works heet.html. Accessed 5 Nov 2020.
55. GraySort Benchmark. Sort Benchmark Home Page. http://sortb enchm ark.org. Accessed 6 Oct 2020.
56. Posner J, Reitz L, Fohry C. Comparison of the HPC and big data Java libraries spark, PCJ and APGAS. In: 2018 IEEE/
ACM parallel applications workshop, alternatives To MPI (PAW‑ATM). IEEE; 2018. p. 11–22.
57. Menon RK, Bhat GP, Schatz MC. Rapid Parallel Genome Indexing with MapReduce. In: Proceedings of the second
international workshop on MapReduce and its applications; 2011. p. 51–8.
58. Wodo O, Zola J, Pokuri BSS, Du P, Ganapathysubramanian B. Automated, high throughput exploration of process‑
structure‑property relationships using the MapReduce paradigm. Mater Disc. 2015;1:21–8.
59. Nowicki M. Benchmarking Java on Ethernet Cluster. Technical Report. In: HPI Future SOC Lab: Proceedings 2019 (in
press).
60. Nowicki M. Benchmarking the TeraSort algorithm on Ethernet Cluster. Technical Report. In: HPI Future SOC Lab:
Proceedings 2020 (in press).
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... artifact-id: pcj). [20]. Arrows present local (inside JVM) and remote (through the network) communication. ...
... addNode(" localhost :8 91 " ) 19 . addNode(" localhost :9 95 " ) 20 . addNode(" example . ...
Full-text available
Conference Paper
Machine learning and Big Data workloads are becoming as important as traditional HPC ones. AI and Big Data users tend to use new programming languages such as Python, Julia, or Java, while the HPC community is still dominated by C/C++ or Fortran. Hence, there is a need for new programming libraries and languages that will integrate different applications and allow them to run on large computer infrastructure. Since modest computers are multinode and multicore, parallel execution is an additional challenge here. For that purpose, we have developed the PCJ library, which introduces parallel programming capabilities to Java using the Partitioned Global Address Space model. It does not modify language nor running environment (JVM). The PCJ library allows for easy development of parallel code and runs it on laptops, workstations, supercomputers, and the cloud. This paper presents an overview of the PCJ library and its usage in parallelizing selected workloads, including HPC, AI, and Big Data. The performance and scalability are presented. We present recent addition to the PCJ library, which are collective operations. The collective operations significantly reduce the number of lines of code to write, ensuring good performance.
... The PCJ implementation scales well and outperforms the Hadoop implementation by a factor of 100 [5], but not all benchmarks were well suited for Hadoop processing. Paper [6] compares the PCJ library and Apache Hadoop using a conventional, widely used benchmark for measuring the performance of Hadoop clusters, and shows that the performance of applications developed with the PCJ library is similar or even better than the Apache Hadoop solution. The PCJ library was also used to develop code for the evolutionary algorithm which has been used to find a minimum of a simple function as defined in the CEC'14 Benchmark Suite [7]. ...
... Diagram of PCJ communication model (from[6]). Arrows present local and remote communication ...
Full-text available
Article
With the development of peta- and exascale size computational systems there is growing interest in running Big Data and Artificial Intelligence (AI) applications on them. Big Data and AI applications are implemented in Java, Scala, Python and other languages that are not widely used in High-Performance Computing (HPC) which is still dominated by C and Fortran. Moreover, they are based on dedicated environments such as Hadoop or Spark which are difficult to integrate with the traditional HPC management systems. We have developed the Parallel Computing in Java (PCJ) library, a tool for scalable high-performance computing and Big Data processing in Java. In this paper, we present the basic functionality of the PCJ library with examples of highly scalable applications running on the large resources. The performance results are presented for different classes of applications including traditional computational intensive (HPC) workloads (e.g. stencil), as well as communication-intensive algorithms such as Fast Fourier Transform (FFT). We present implementation details and performance results for Big Data type processing running on petascale size systems. The examples of large scale AI workloads parallelized using PCJ are presented.
... The used implementation of Terasort for Hadoop was included in the Apache Hadoop examples package. This code was used to win the annual general-purpose terabyte sort benchmark in 2008 [25]. There are three steps involved in Terasort benchmarking suite: ...
Article
This paper provides a complete guide to the development, testing, and monitoring of a low-cost big data cluster through a detailed step-by-step configuration and installation of Apache Hadoop using 9 Raspberry Pis 4B. For the tests and performance evaluation, were used the Terasort and TestDFSIO benchmarks. The benchmarks were performed in different sizes of data files (250 MB up to 1 GB) and different slaves nodes quantity (2, 4, and 8). The results showed that the combination of Raspberry Pi and Apache Hadoop can be a very efficient and robust solution to get a low-cost big data cluster, considering its costs/benefits. Using a Raspberry Pi 3B+ as a monitoring server, we installed the Zabbix and Grafana tools, making it possible to collect important information in real-time, helping to better monitoring of the cluster’s devices and better visualization of the behavior and performance of the cluster.
Article
Large-scale computing and data processing with cloud resources is gaining popularity. However, the usage of the cloud differs from traditional high-performance computing (HPC) systems and both algorithms and codes have to be adjusted. This work is often time-consuming and performance is not guaranteed. To address this problem we have developed the PCJ library (parallel computing in Java), a novel tool for scalable HPC and big data processing in Java. In this article, we present a performance evaluation of parallel applications implemented in Java using the PCJ library. The performance evaluation is based on the examples of highly scalable applications of different characteristics focusing on CPU, communication or I/O. They run on the traditional HPC system and Amazon web services Cloud as well as Linaro Developer Cloud. For the clouds, we have used Intel x86 and ARM processors for running Java codes without changing any line of the program code and without the need for time-consuming recompilation. Presented applications have been parallelized using the partitioned global address space programming model and its realization in the PCJ library. Our results prove that the PCJ library, due to its performance and ability to create simple portable code, has great promise to be successful for the parallelization of various applications and run them on the cloud with a performance close to HPC systems.
Full-text available
Article
Despite its clinical importance, detection of highly divergent or yet unknown viruses is a major challenge. When human samples are sequenced, conventional alignments classify many assembled contigs as "unknown" since many of the sequences are not similar to known genomes. In this work, we developed ViraMiner, a deep learning-based method to identify viruses in various human biospecimens. ViraMiner contains two branches of Convolutional Neural Networks designed to detect both patterns and pattern-frequencies on raw metagenomics contigs. The training dataset included sequences obtained from 19 metagenomic experiments which were analyzed and labeled by BLAST. The model achieves significantly improved accuracy compared to other machine learning methods for viral genome classification. Using 300 bp contigs ViraMiner achieves 0.923 area under the ROC curve. To our knowledge, this is the first machine learning methodology that can detect the presence of viral sequences among raw metagenomic contigs from diverse human samples. We suggest that the proposed model captures different types of information of genome composition, and can be used as a recommendation system to further investigate sequences labeled as "unknown" by conventional alignment methods. Exploring these highly-divergent viruses, in turn, can enhance our knowledge of infectious causes of diseases.
Full-text available
Preprint
Despite its clinical importance, detection of highly divergent or yet unknown viruses is a major challenge. When human samples are sequenced, conventional alignments classify many assembled contigs as “unknown” since many of the sequences are not similar to known genomes. In this work, we developed ViraMiner, a deep learning-based method to identify viruses in various human biospecimens. ViraMiner contains two branches of Convolutional Neural Networks designed to detect both patterns and pattern-frequencies on raw metagenomics contigs. The training dataset included sequences obtained from 19 metagenomic experiments which were analyzed and labeled by BLAST. The model achieves significantly improved accuracy compared to other machine learning methods for viral genome classification. Using 300 bp contigs ViraMiner achieves 0.923 area under the ROC curve. To our knowledge, this is the first machine learning methodology that can detect the presence of viral sequences among raw metagenomic contigs from diverse human samples. We suggest that the proposed model captures different types of information of genome composition, and can be used as a recommendation system to further investigate sequences labeled as “unknown” by conventional alignment methods. Exploring these highly-divergent viruses, in turn, can enhance our knowledge of infectious causes of diseases.
Full-text available
Conference Paper
In this paper, we present PCJ (Parallel Computing in Java), a novel tool for scalable high-performance computing and big data processing in Java. PCJ is Java library implementing PGAS (Partitioned Global Address Space) programming paradigm. It allows for the easy and feasible development of computational applications as well as Big Data processing. The use of Java brings HPC and Big Data type of processing together and enables running on the different types of hardware. In particular, the high scalability and good performance of PCJ applications have been demonstrated using Cray XC40 systems. We present performance and scalability of PCJ library measured on Cray XC40 systems with standard benchmarks such as ping-pong, broadcast, and random access. We describe parallelization of example applications of different characteristics including FFT and 2D stencil. Results for standard Big Data benchmarks such as word count are presented. In all cases, measured performance and scalability confirm that PCJ is a good tool to develop parallel applications of different type.
Chapter
In modern systems CPUs as well as GPUs are equipped with multi-level memory architectures, where different levels of the hierarchy vary in latency and capacity. Therefore, various memory access models were studied. Such a model can be seen as an interface abstracting the user from the physical architecture details. In this paper we present a general and uniform GPU computation and memory access model based on bounded inhibitor Petri nets (PNs). Its effectiveness is demonstrated by comparing its throughputs to practical computational experiments performed with the usage of Nvidia GPU with CUDA architecture.
Conference Paper
PCJ is a Java library for scalable high performance and computing and Big Data processing. The library implements the partitioned global address space (PGAS) model. The PCJ application is run as a multi-threaded application with the threads distributed over multiple Java Virtual Machines. Each task has its own local memory to store and access variables locally. Selected variables can be shared between tasks and can be accessed, read and modified by other tasks. The library provides methods to perform basic operations like synchronization of tasks, get and put values in an asynchronous one-sided way. Additionally, PCJ offers methods for creating groups of tasks, broadcasting and monitoring variables. The library hides details of inter-and intra-node communication-making programming easy and feasible. The PCJ library allows for easy development of highly scalable (up to 200k cores) applications running on the large resources. PCJ applications can be also run on the systems designed for data analytics such as Hadoop clusters. In this case, performance is higher than for native applications. The PCJ library fully complies with Java standards, therefore, the programmer does not have to use additional libraries, which are not part of the standard Java distribution. In this paper, we present details of the PCJ library, its API and example applications. The results show good performance and scalability. It is noteworthy that the PCJ library due to its performance and ability to create simple code has great promise to be successful for the parallelization of HPC and Big Data applications.
Article
Basic Local Alignment Search Tool (BLAST) is an essential algorithm that researchers use for sequence alignment analysis. The National Center for Biotechnology Information (NCBI)-BLAST application is the most popular implementation of the BLAST algorithm. It can run on a single multithreading node. However, the volume of nucleotide and protein data is fast growing, making single node insufficient. It is more and more important to develop high-performance computing solutions, which could help researchers to analyze genetic data in a fast and scalable way. This article presents execution of the BLAST algorithm on high performance computing (HPC) clusters and supercomputers in a massively parallel manner using thousands of processors. The Parallel Computing in Java (PCJ) library has been used to implement the optimal splitting up of the input queries, the work distribution, and search management. It is used with the nonmodified NCBI-BLAST package, which is an additional advantage for the users. The result application-PCJ-BLAST-is responsible for reading sequence for comparison, splitting it up and starting multiple NCBI-BLAST executables. Since I/O performance could limit sequence analysis performance, the article contains an investigation of this problem. The obtained results show that using Java and PCJ library it is possible to perform sequence analysis using hundreds of nodes in parallel. We have achieved excellent performance and efficiency and we have significantly reduced the time required for sequence analysis. Our work also proved that PCJ library could be used as an effective tool for fast development of the scalable applications.
Chapter
Graph analysis is an intrinsic tool embedded in the big data domain. The demand in processing of bigger and bigger graphs requires highly efficient and parallel applications. In this work we explore the possibility of employing the new PCJ library for distributed calculations in Java. We apply the toolbox to sparse matrix matrix multiplications and the k-means clustering problem. We benchmark the strong scaling performance against an equivalent C++/MPI implementation. Our benchmarks found comparable good scaling results for algorithms using mainly local point-to-point communications, and exposed the potential for logarithmic collective operations directly available in the PCJ library. Further more, we also experienced an improvement of development time to solution, as a result of the high level abstractions provided by Java and PCJ.
Chapter
In this paper, we present performance and scalability of the Java codes parallelized on the Intel KNL platform using Java and PCJ Library. The parallelization is performed using PGAS programming model with no modification to Java language nor Java Virtual Machine. The obtained results show good overall performance, especially for parallel applications. The microbenchmark results, compared to the C/MPI, show that PCJ communication efficiency should be improved.