If you want to read the PDF, try requesting it from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... A good example is a communicationintensive graph search from the Graph500 test suite. The PCJ implementation scales well and outperforms the Hadoop implementation by a factor of 100 [5], but not all benchmarks were well suited for Hadoop processing. Paper [6] compares the PCJ library and Apache Hadoop using a conventional, widely used benchmark for measuring the performance of Hadoop clusters, and shows that the performance of applications developed with the PCJ library is similar or even better than the Apache Hadoop solution. ...
... While C++ code scales ideally, its poor performance when measured in absolute time can be traced back to the implementation of line-tokenizing. All the codes (PCJ, APGAS, C++), in line with our earlier works [5], consistently use regular expressions for this task. ...
... One should note, that different set of results obtained on the Hadoop cluster shows that PCJ implementation is at least 3 times faster than Hadoop one [5] and Spark one [25]. ...
Article
Full-text available
With the development of peta- and exascale size computational systems there is growing interest in running Big Data and Artificial Intelligence (AI) applications on them. Big Data and AI applications are implemented in Java, Scala, Python and other languages that are not widely used in High-Performance Computing (HPC) which is still dominated by C and Fortran. Moreover, they are based on dedicated environments such as Hadoop or Spark which are difficult to integrate with the traditional HPC management systems. We have developed the Parallel Computing in Java (PCJ) library, a tool for scalable high-performance computing and Big Data processing in Java. In this paper, we present the basic functionality of the PCJ library with examples of highly scalable applications running on the large resources. The performance results are presented for different classes of applications including traditional computational intensive (HPC) workloads (e.g. stencil), as well as communication-intensive algorithms such as Fast Fourier Transform (FFT). We present implementation details and performance results for Big Data type processing running on petascale size systems. The examples of large scale AI workloads parallelized using PCJ are presented.
... We have compared the scalability of the PCJ implementation with Hadoop. It turns out that the version based on the PCJ library scales well and outperforms Hadoop implementation by the factor of 100 37,38 . However, one could argue, that the implemented problem was not well-suited for Hadoop processing. ...
... Results presented hereinafter use a simple serial reduction scheme. Performance can be enhanced with the use of a hypercube-based reduction, as presented in our other works 37,35 . Full source code and sample input data are available on GitHub 53 . ...
Article
Large-scale computing and data processing with cloud resources is gaining popularity. However, the usage of the cloud differs from traditional high-performance computing (HPC) systems and both algorithms and codes have to be adjusted. This work is often time-consuming and performance is not guaranteed. To address this problem we have developed the PCJ library (parallel computing in Java), a novel tool for scalable HPC and big data processing in Java. In this article, we present a performance evaluation of parallel applications implemented in Java using the PCJ library. The performance evaluation is based on the examples of highly scalable applications of different characteristics focusing on CPU, communication or I/O. They run on the traditional HPC system and Amazon web services Cloud as well as Linaro Developer Cloud. For the clouds, we have used Intel x86 and ARM processors for running Java codes without changing any line of the program code and without the need for time-consuming recompilation. Presented applications have been parallelized using the partitioned global address space programming model and its realization in the PCJ library. Our results prove that the PCJ library, due to its performance and ability to create simple portable code, has great promise to be successful for the parallelization of various applications and run them on the cloud with a performance close to HPC systems.
... A good example is a communication-intensive graph search from the Graph500 test suite. The PCJ implementation scales well and outperforms Hadoop implementation by the factor of 100 [28,34]. PCJ library was also used to develop code for the evolutionary algorithm which has been used to find a minimum of a simple function as defined in the CEC'14 Benchmark Suite [17]. ...
... The results presented hereinafter use a simple serial reduction scheme. For better results, a hypercube-based reduction can be used, as presented in our other works [28]. Full source code and sample input data are available on GitHub at [12]. ...
Chapter
Cloud resources are more often used for large scale computing and data processing. However, the usage of the cloud is different than traditional High-Performance Computing (HPC) systems and both algorithms and codes have to be adjusted. This work is often time-consuming and performance is not guaranteed. To address this problem we have developed the PCJ library (Parallel Computing in Java), a novel tool for scalable high-performance computing and big data processing in Java. In this paper, we present a performance evaluation of parallel applications implemented in Java using the PCJ library. The performance evaluation is based on the examples of highly scalable applications that run on the traditional HPC system and Amazon AWS Cloud. For the cloud, we have used Intel x86 and ARM processors running Java codes without changing any line of the program code and without the need for time-consuming recompilation. Presented applications have been parallelized using the PGAS programming model and its realization in the PCJ library. Our results prove that the PCJ library, due to its performance and ability to create simple portable code, has great promise to be successful for the parallelization of various applications and run them on the cloud with a similar performance as for HPC systems.
... There are papers [40,41] that show the PCJ implementation of some benchmarks scales very well and outperforms the Hadoop implementation, even by a factor of 100. One of the benchmarks calculates an approximation of π value applying the quasi-Monte The TeraSort is one of the widely used benchmarks for Hadoop. ...
... Previous studies [40,41] show that the PCJ implementation of some benchmarks outperforms the Hadoop implementation, even by a factor of 100. Other studies that compare PCJ with APGAS and Apache Spark [56] show that the PCJ implementation has the same performance as Apache Spark, and for some benchmarks it can be almost 2-times more efficient. ...
Article
Full-text available
Sorting algorithms are among the most commonly used algorithms in computer science and modern software. Having efficient implementation of sorting is necessary for a wide spectrum of scientific applications. This paper describes the sorting algorithm written using the partitioned global address space (PGAS) model, implemented using the Parallel Computing in Java (PCJ) library. The iterative implementation description is used to outline the possible performance issues and provide means to resolve them. The key idea of the implementation is to have an efficient building block that can be easily integrated into many application codes. This paper also presents the performance comparison of the PCJ implementation with the MapReduce approach, using Apache Hadoop TeraSort implementation. The comparison serves to show that the performance of the implementation is good enough, as the PCJ implementation shows similar efficiency to the Hadoop implementation.
... The PCJ library [5] implements the PGAS model in Java and is available as an open-source repository [19]. PCJ is targeted at large-scale HPC applications, but was also observed to be suitable for big data applications [20]. The library is shipped as a single jar file without dependencies on other libraries. ...
... Previous work by Bała, Nowicki et al. [20] [11] compared PCJ to Apache Hadoop. These authors report that PCJ is easier to use than Hadoop, and PCJ programs are 5 to 500 times faster. ...
Article
Full-text available
Objectives: The radical growth of brain MRI data demands faster and accurate processing. To meet these demands, it is necessary to develop a design in cloud platform using distributed platforms. Methods/Analysis: In this paper, we introduce an architecture developed for the cloud using Apache Hadoop to segment the brain MRI images. The scanned MRI images are uploaded through either through web interface or mobile app to the system in the public cloud. The Parallel Genetic Algorithm (PGA) in the cloud system enabled with Hadoop or Spark is used to segment the given MRI images. Findings: The processing time taken for different size of data varying from 2GB to 10GB in a different number of clusters varying from one to five are denoted. This process has been implemented in both Apache Hadoop and Apache Spark. The time ranges from 12 to 24 secs approximately in Hadoop whereas the processing time has come down from 4 to 7 secs in Spark. First of all, the results prove that the network based applications for Medical Image Processing are outperformed by the cloud platform applications. Novelty/Improvement: Distributed Platforms have been used in Cloud environment for Brain MRI segmentation using Parallel Genetic Algorithm.
Article
Full-text available
Solid State Drives (SSDs) were initially developed as faster storage devices intended to replace conventional magnetic Hard Disk Drives (HDDs). However, high computational capabilities enable SSDs to be computing nodes, not just faster storage devices. Such capability is generally called ”In-Storage Computing (ISC)”. Today’s Hadoop MapReduce framework has become a de facto standard for big data processing. This paper explores In-Storage Computing challenges and opportunities for the Hadoop MapReduce framework. For this, we integrate a Hadoop MapReduce system with ISC SSD devices that implement the Hadoop Mapper inside real SSD firmware. This offloads Map tasks from the host MapReduce system to the ISC SSDs. We additionally optimize the host Hadoop system to make the best use of our proposed ISC Hadoop system. Experimental results demonstrate our ISC Hadoop MapReduce system achieves a remarkable performance gain (2.3 faster) as well as significant energy savings (11.5 lower) compared to a typical Hadoop MapReduce system. Further, the experiment suggests such ISC augmented systems can provide a very promising computing model in terms of a system scalability.
Article
Full-text available
The age of big data is now coming. But the traditional data analytics may not be able to handle such large quantities of data. The question that arises now is, how to develop a high performance platform to efficiently analyze big data and how to design an appropriate mining algorithm to find the useful things from big data. To deeply discuss this issue, this paper begins with a brief introduction to data analytics, followed by the discussions of big data analytics. Some important open issues and further research directions will also be presented for the next step of big data analytics.
Article
Full-text available
Graph500 is a new benchmark to rank supercomputers with a large-scale graph search problem. We found that the provided reference implementations are not scalable in a large distributed environment. We devised an optimized method based on 2D partitioning and other methods such as communication compression and vertex sorting. Our optimized implementation can handle BFS (Breadth First Search) of a large graph with 236 (68.7 billion vertices) and 240 (1.1 trillion) edges in 10.58 seconds while using 1366 nodes and 16,392 CPU cores. This performance corresponds to 103.9 GE/s. We also studied the performance characteristics of our optimized implementation and reference implementations on a large distributed memory supercomputer with a Fat-Tree-based Infiniband network.
Conference Paper
Full-text available
MapReduceis emerging as an important programming model for large scale parallel application. Meanwhile, Hadoop is an open source implementation of MapReduce enjoying wide popularity for developing data intensive applications in the cloud. As, in the cloud, the computing unit is virtual machine (VM) based; it is feasible to demonstrate the applicability of MapReduce on virtualized data center. Although the potential for poor performance and heavy load no doubt exists, virtual machines can instead be used to fully utilize the system resources, ease the management of such systems, improve the reliability, and save the power. In this paper, a series of experiments are conducted to measure and analyze the performance of Hadoop on VMs. Our experiments are used as a basis for outlining several issues that will need to be considered when implementing MapReduce to fit completely in the cloud.
Article
Full-text available
Data-intensive, graph-based computations are pervasive in several scientific applications, and are known to to be quite challenging to implement on distributed memory systems. In this work, we explore the design space of parallel algorithms for Breadth-First Search (BFS), a key subroutine in several graph algorithms. We present two highly-tuned parallel approaches for BFS on large parallel systems: a level-synchronous strategy that relies on a simple vertex-based partitioning of the graph, and a two-dimensional sparse matrix-partitioning-based approach that mitigates parallel communication overhead. For both approaches, we also present hybrid versions with intra-node multithreading. Our novel hybrid two-dimensional algorithm reduces communication times by up to a factor of 3.5, relative to a common vertex based approach. Our experimental study identifies execution regimes in which these approaches will be competitive, and we demonstrate extremely high performance on leading distributed-memory parallel systems. For instance, for a 40,000-core parallel execution on Hopper, an AMD Magny-Cours based system, we achieve a BFS performance rate of 17.8 billion edge visits per second on an undirected graph of 4.3 billion vertices and 68.7 billion edges with skewed degree distribution.
Conference Paper
Graph processing is used in many fields of science such as sociology, risk prediction or biology. Although analysis of graphs is important it also poses numerous challenges especially for large graphs which have to be processed on multicore systems. In this paper, we present PGAS (Partitioned Global Address Space) version of the level-synchronous BFS (Breadth First Search) algorithm and its implementation written in Java. Java so far is not extensively used in high performance computing, but because of its popularity, portability, and increasing capabilities is becoming more widely exploit especially for data analysis. The level-synchronous BFS has been implemented using a PCJ (Parallel Computations in Java) library. In this paper, we present implementation details and compare its scalability and performance with the MPI implementation of Graph500 benchmark. We show good scalability and performance of our implementation in comparison with MPI code written in C. We present challenges we faced and optimizations we used in our implementation necessary to obtain good performance.
Conference Paper
The most popular Big Data processing frameworks of these days are Hadoop MapReduce and Spark. Hadoop Distributed File System (HDFS) is the primary storage for these frameworks. Big Data frameworks like Hadoop MapReduce and Spark launch tasks based on data locality. In the presence of heterogeneous storage devices, when different nodes have different storage characteristics, only locality-aware data access cannot always guarantee optimal performance. Rather, storage type becomes important, specially when high performance SSD and in-memory storage devices along with high performance interconnects are available. Therefore, in this paper, we propose efficient data access strategies (e.g. Greedy (prioritizes storage type over locality), Hybrid (balances the load for locality and high performance storage), etc.) for Hadoop and Spark considering both data locality and storage types. We re-design HDFS to accommodate the enhanced access strategies. Our evaluations show that, the proposed data access strategies can improve the read performance of HDFS by up to 33% compared to the default locality-aware data access. The execution times of Hadoop and Spark Sort are also reduced by up to 32% and 17%. The performances of Hadoop and Spark TeraSort are also improved by up to 11% through our design.
Conference Paper
Many high-performance computing (HPC) sites extend their clusters to support Hadoop MapReduce for a variety of applications. However, HPC cluster differs from Hadoop cluster on the configurations of storage resources. In the Hadoop Distributed File System (HDFS), data resides on the compute nodes, while in the HPC cluster, data is stored on separate nodes dedicated to storage. Dedicated storage offloads I/O load from the compute nodes and provides more powerful storage. Local storage provides better locality and avoids contention for shared storage resources. To gain an insight of the two platforms, in this paper, we investigate the performance and resource utilization of different types (i.e., I/O-intensive, data-intensive and CPU-intensive) of applications on the HPC-based Hadoop platforms with local storage and dedicated storage. We find that the I/Ointensive and data-intensive applications with large input file size can benefit more from the dedicated storage, while these applications with small input file size can benefit more from the local storage. CPU-intensive applications with a large number of small-size input files benefit more from the local storage, while these applications with large-size input files benefit approximately equally from the two platforms. We verify our findings by trace-driven experiments on different types of jobs from the Facebook synthesized trace. This work provides guidance on choosing the best platform to optimize the performance of different types of applications and reduce system overhead.
Article
Scale-up machines perform better for jobs with small and median (KB, MB) data sizes, while scale-out machines perform better for jobs with large (GB, TB) data size. Since a workload usually consists of jobs with different data size levels, we propose building a hybrid Hadoop architecture that includes both scale-up and scale-out machines, which however is not trivial. The first challenge is workload data storage. Thousands of small data size jobs in a workload may overload the limited local disks of scale-up machines. Jobs from scale-up and scale-out machines may both request the same set of data, which leads to data transmission between the machines. The second challenge is to automatically schedule jobs to either scale-up or scale-out cluster to achieve the best performance. We conduct a thorough performance measurement of different applications on scale-up and scale-out clusters, configured with Hadoop Distributed File System (HDFS) and a remote file system (i.e., OFS), respectively. We find that using OFS rather than HDFS can solve the data storage challenge. Also, we identify the factors that determine the performance differences on the scale-up and scale-out clusters and their cross points to make the choice. Accordingly, we design and implement the hybrid scale-up/out Hadoop architecture. Our trace-driven experimental results show that our hybrid architecture outperforms both the traditional Hadoop architecture with HDFS and with OFS in terms of job completion time, throughput and job failure rate.
Article
As the era of “big data” comes, the data processing platform like Hadoop was born at the right moment. But its carrier for storage, Hadoop distributed file system (HDFS) has the great weakness in storage of the numerous small files. The storage of numerous small files will increase the load of the entire colony and reduce efficiency. However, datasets such as genomic data and clinical data that will enable researchers to perform analytics in healthcare are all in storage of small files. To solve the defect of storage of small files, we generally will merge small files, and store the big file after merging. But the former methods have not applied the size distribution of the file, and not further improved the effect of merging of small files. This article proposes a method for merging of small files based on balance of data block, which will optimize the volume distribution of the big file after merging, and effectively reduce the data blocks of HDFS, so as to reduce the memory overhead of major nodes of cluster and reduce load to achieve high-efficiency operation of data processing.
Chapter
This paper presents the application of the PCJ library for the parallelization of the selected HPC applications implemented in Java language. The library is motivated by partitioned global address space (PGAS) model represented by Co-Array Fortran, Unified Parallel C, X10 or Titanium. In the PCJ, each task has its own local memory and storesand access variables locally. Variables can be shared between tasks and can be accessed, read and modified by other tasks. The library provides methods to perform basic operations like synchronization of tasks, get and put values in asynchronous one-sided way. Additionally the library offers methods for creating groups of tasks, broadcasting and monitoring variables. The PCJ has ability to work on the multinode multicore systems hiding details of inter- and intranode communication. The PCJ library fully complies with Java standards therefore the programmer does not have to use additional libraries, which are not part of the standard Java distribution. In this paper the PCJ library has been used to run example HPC applications on the multicore nodes. In particular we present performance results for parallel raytracing, matrix multiplication and map-reduce calculations. The detailed information on performance of the reduction operation is also presented. The results show good performance and scalability compared to native implementations of the same algorithms. In particular, MPI C++ and Java 8 parallel streams have been used as a reference. It is noteworthy that the PCJ library due to its performance and ability to create simple code has great promise to be successful for parallelization of the HPC applications.
Article
In Orange File System, large data files are striped across multiple servers to provide highly concurrent access, how-ever, contents of large directories are only stored in a single server, which is becoming a bottleneck in handling a large number of requests accessing the same directory concur-rently. In this paper, a scalable distributed directory for Or-ange File System is implemented and evaluated in a large-scale system. The throughput performance is measured by a modified version of UCAR metarates benchmark. The result shows great scalability in concurrently creating and remov-ing large numbers of files under one directory by multiple clients. On a 64-servers setup and 128 clients accessing the same directory concurrently, the scalable distributed direc-tory can achieve more than 8,000 file creations per second and over 11,000 file removals per second on average.
Conference Paper
We have designed and implemented the Google File Sys- tem, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. While sharing many of the same goals as previous dis- tributed file systems, our design has been driven by obser- vations of our application workloads and technological envi- ronment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore rad- ically different design points. The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our ser- vice as well as research and development efforts that require large data sets. The largest cluster to date provides hun- dreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients. In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.
Article
As Linux clusters have matured as platforms for low-cost, high-performance parallel computing, software packages to provide many key services have emerged, especially in areas such as message passing and networking. One area devoid of support, however, has been parallel file systems, which are critical for highperformance I/O on such clusters. We have developed a parallel file system for Linux clusters, called the Parallel Virtual File System (PVFS). PVFS is intended both as a high-performance parallel file system that anyone can download and use and as a tool for pursuing further research in parallel I/O and parallel file systems for Linux clusters. In this paper, we describe the design and implementation of PVFS and present performance results on the Chiba City cluster at Argonne. We provide performance results for a workload of concurrent reads and writes for various numbers of compute nodes, I/O nodes, and I/O request sizes. We also present performance results for MPI-IO on PVFS, both...
PCJ -Java library for high performance computing in PGAS model
  • M Nowicki
  • L Górski
  • P Grabarczyk
  • P Ba
  • La
M. Nowicki, L. Górski, P. Grabarczyk, P. Ba la. PCJ -Java library for high performance computing in PGAS model In: W. W. Smari and V. Zeljkovic (Eds.) 2014 International Conference on High Performance Computing and Simulation (HPCS), IEEE 2014 pp. 202-209
Madduri: Parallel breadth-first search on distributed memory systems
  • A Buluc
A. Buluc, K. Madduri: Parallel breadth-first search on distributed memory systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, 2011.