[show abstract][hide abstract] ABSTRACT: The rising interest in Java for High Performance Computing (HPC) is based on the appealing features of this language for programming multi-core cluster architectures, particularly the built-in networking and multithreading support, and the continuous increase in Java Virtual Machine (JVM) performance. However, its adoption in this area is being delayed by the lack of analysis of the existing programming options in Java for HPC and thorough and up-to-date evaluations of their performance, as well as the unawareness on current research projects in this field, whose solutions are needed in order to boost the embracement of Java in HPC.This paper analyzes the current state of Java for HPC, both for shared and distributed memory programming, presents related research projects, and finally, evaluates the performance of current Java HPC solutions and research developments on two shared memory environments and two InfiniBand multi-core clusters. The main conclusions are that: (1) the significant interest in Java for HPC has led to the development of numerous projects, although usually quite modest, which may have prevented a higher development of Java in this field; (2) Java can achieve almost similar performance to natively compiled languages, both for sequential and parallel applications, being an alternative for HPC programming; (3) the recent advances in the efficient support of Java communications on shared memory and low-latency networks are bridging the gap between Java and natively compiled applications in HPC. Thus, the good prospects of Java in this area are attracting the attention of both industry and academia, which can take significant advantage of Java adoption in HPC.Highlights► Java is an emerging option for High Performance Computing on multi-core clusters. ► Java can achieve almost similar performance to natively compiled languages. ► Current state of Java for HPC, both for shared and distributed memory programming. ► Performance evaluation of Java message-passing and Java threads on multi-core clusters. ► Java is an alternative to HPC programming as it obtains comparable scalability results.
Science of Computer Programming 05/2013; · 0.57 Impact Factor
[show abstract][hide abstract] ABSTRACT: The simulation of particle dynamics is among the most important mechanisms to study the behavior of molecules in a medium under specific conditions of temperature and density. Several models can be used to compute efficiently the forces that act on each particle, and also the interactions between them. This work presents the design and implementation of a parallel simulation code for the Brownian motion of particles in a fluid. Two different parallelization approaches have been followed: (1) using traditional distributed memory message-passing programming with MPI, and (2) using the Partitioned Global Address Space (PGAS) programming model, oriented towards hybrid shared/distributed memory systems, with the Unified Parallel C (UPC) language. Different techniques for domain decomposition and work distribution are analyzed in terms of efficiency and programmability, in order to select the most suitable strategy. Performance results on a supercomputer using up to 2048 cores are also presented for both MPI and UPC codes.
[show abstract][hide abstract] ABSTRACT: The scalability of High Performance Computing (HPC) applications depends heavily on the efficient support of network communications in virtualized environments. However, Infrastructure as a Service (IaaS) providers are more focused on deploying systems with higher computational power interconnected via high-speed networks rather than improving the scalability of the communication middleware. This paper analyzes the main performance bottlenecks in HPC application scalability on the Amazon EC2 Cluster Compute platform: (1) evaluating the communication performance on shared memory and a virtualized 10 Gigabit Ethernet network; (2) assessing the scalability of representative HPC codes, the NAS Parallel Benchmarks, using an important number of cores, up to 512; (3) analyzing the new cluster instances (CC2), both in terms of single instance performance, scalability and cost-efficiency of its use; (4) suggesting techniques for reducing the impact of the virtualization overhead in the scalability of communication-intensive HPC codes, such as the direct access of the Virtual Machine to the network and reducing the number of processes per instance; and (5) proposing the combination of message-passing with multithreading as the most scalable and cost-effective option for running HPC applications on the Amazon EC2 Cluster Compute platform.
Future Generation Computer Systems 01/2013; 29(1):218–229. · 1.86 Impact Factor
[show abstract][hide abstract] ABSTRACT: Land abandonment and stagnation of rural markets in the last few years have become one of the main concerns of rural administrations. The use of Web and GIS (Geographic Information System) technologies can help to mitigate the effects of these problems. This paper pro-poses a novel Web-GIS tool with spatial capabilities for the dynamization of rural land markets by encouraging the transfer of land from owners to farmers through the leasing of plots. The system, based on open source software, offers information about the properties, their environment and their owners. It uses standards for handling the geographic information and for communicating with external data sources. This system was used as the basis for the development of SITEGAL, the tool for the management of the Land Bank of Galicia (www.bantegal.com/sitegal). SITEGAL has been operational since 2007 obtaining benefits for both administration and users (farmers and land owners), and promoting the e-Government.
[show abstract][hide abstract] ABSTRACT: Servet is a suite of benchmarks focused on extracting a set of parameters with high influence on the overall performance of multicore clusters. These parameters can be used to optimize the performance of parallel applications by adapting part of their behavior to the characteristics of the machine. Up to now the tool considered network bandwidth as constant and independent of the communication pattern. Nevertheless, the inter-node communication bandwidth decreases on modern large supercomputers depending on the number of cores per node that simultaneously access the network and on the distance between the communicating nodes. This paper describes two new benchmarks that improve Servet by characterizing the network performance degradation depending on these factors. This work also shows the experimental results of these benchmarks on a Cray XE6 supercomputer and some examples of how real parallel codes can be optimized by using the information about network degradation.
[show abstract][hide abstract] ABSTRACT: The widespread use of multicore processors is not a consequence of significant advances in parallel programming. In contrast, multicore processors arise due to the complexity of building power-efficient, high-clock-rate, single-core chips. Automatic parallelization of sequential applications is the ideal solution for making parallel programming as easy as writing programs for sequential computers. However, automatic parallelization remains a grand challenge due to its need for complex program analysis and the existence of unknowns during compilation. This paper proposes a new method for converting a sequential application into a parallel counterpart that can be executed on current multicore processors. It hinges on an intermediate representation based on the concept of domain-independent kernel (e.g., assignment, reduction, recurrence). Such kernel-centric view hides the complexity of the implementation details, enabling the construction of the parallel version even when the source code of the sequential application contains different syntactic variations of the computations (e.g., pointers, arrays, complex control flows). Experiments that evaluate the effectiveness and performance of our approach with respect to state-of-the-art compilers are also presented. The benchmark suite consists of synthetic codes that represent common domain-independent kernels, dense/sparse linear algebra and image processing routines, and full-scale applications from SPEC CPU2000.
[show abstract][hide abstract] ABSTRACT: To efficiently scale dense linear algebra problems to future exascale systems, communication cost must be avoided or overlapped. Communication-avoiding 2.5D algorithms improve scalability by reducing inter-processor data transfer volume at the cost of extra memory usage. Communication overlap attempts to hide messaging latency by pipelining messages and overlapping with computational work. We study the interaction and compatibility of these two techniques for two matrix mul- tiplication algorithms (Cannon and SUMMA), triangular solve, and Cholesky factorization. For each algorithm, we construct a detailed performance model that considers both critical path dependencies and idle time. We give novel implementations of 2.5D algorithms with overlap for each of these problems. Our software employs UPC, a partitioned global address space (PGAS) language that provides fast one-sided communication. We show communication avoidance and overlap provide a cumulative benefit as core counts scale, including results using over 24K cores of a Cray XE6 system.
24th ACM/IEEE Conference on Supercomputing; 11/2012
[show abstract][hide abstract] ABSTRACT: Servet is a suite of benchmarks focused on detecting a set of parameters with high influence on the overall performance of multicore systems. These parameters can be used for autotuning codes to increase their performance on multicore clusters. Although Servet has been proved to detect accurately cache hierarchies, bandwidths and bottlenecks in memory accesses, as well as the communication overhead among cores, up to now the impact of the use of this information on application performance optimization has not been assessed. This paper presents a novel algorithm that automatically uses Servet for mapping parallel applications on multicore systems and analyzes its impact on three testbeds using three different parallel programming models: message-passing, shared memory and partitioned global address space (PGAS). Our results show that a suitable mapping policy based on the data provided by this tool can significantly improve the performance of parallel applications without source code modification.
[show abstract][hide abstract] ABSTRACT: This paper presents F-MPJ (Fast MPJ), a scalable and efficient Message-Passing in Java (MPJ) communication middleware for
parallel computing. The increasing interest in Java as the programming language of the multi-core era demands scalable performance
on hybrid architectures (with both shared and distributed memory spaces). However, current Java communication middleware lacks
efficient communication support. F-MPJ boosts this situation by: (1)providing efficient non-blocking communication, which
allows communication overlapping and thus scalable performance; (2)taking advantage of shared memory systems and high-performance
networks through the use of our high-performance Java sockets implementation (named JFS, Java Fast Sockets); (3)avoiding
the use of communication buffers; and (4)optimizing MPJ collective primitives. Thus, F-MPJ significantly improves the scalability
of current MPJ implementations. A performance evaluation on an InfiniBand multi-core cluster has shown that F-MPJ communication
primitives outperform representative MPJ libraries up to 60 times. Furthermore, the use of F-MPJ in communication-intensive
MPJ codes has increased their performance up to seven times.
KeywordsMessage-Passing in Java (MPJ)–Scalable parallel systems–Communication middleware–Scalable collective communication–High-Performance Computing–Performance evaluation
The Journal of Supercomputing 01/2012; 60(1):117-140. · 0.92 Impact Factor
[show abstract][hide abstract] ABSTRACT: This paper presents ibvdev a scalable and efficient low-level Java message-passing communication device over InfiniBand. The continuous increase in
the number of cores per processor underscores the need for efficient communication support for parallel solutions. Moreover,
current system deployments are aggregating a significant number of cores through advanced network technologies, such as InfiniBand,
increasing the complexity of communication protocols, especially when dealing with hybrid shared/distributed memory architectures
such as clusters. Here, Java represents an attractive choice for the development of communication middleware for these systems,
as it provides built-in networking and multithreading support. As the gap between Java and compiled languages performance
has been narrowing for the last years, Java is an emerging option for High Performance Computing (HPC).
The developed communication middleware ibvdev increases Java applications performance on clusters of multicore processors interconnected via InfiniBand through: (1) providing
Java with direct access to InfiniBand using InfiniBand Verbs API, somewhat restricted so far to MPI libraries; (2) implementing
an efficient and scalable communication protocol which obtains start-up latencies and bandwidths similar to MPI performance
results; and (3) allowing its integration in any Java parallel and distributed application. In fact, it has been successfully
integrated in the Java messaging library MPJ Express.
The experimental evaluation of this middleware on an InfiniBand cluster of multicore processors has shown significant point-to-point
performance benefits, up to 85% start-up latency reduction and twice the bandwidth compared to previous Java middleware on
InfiniBand. Additionally, the impact of ibvdev on message-passing collective operations is significant, achieving up to one order of magnitude performance increases compared
to previous Java solutions, especially when combined with multithreading. Finally, the efficiency of this middleware, which
is even competitive with MPI in terms of performance, increments the scalability of communications intensive Java HPC applications.
The Journal of Supercomputing 01/2012; · 0.92 Impact Factor
[show abstract][hide abstract] ABSTRACT: The need of task-adapted and complete information for the management of resources is a well known issue in Grid computing. Globus Toolkit 4 (GT4) includes the Monitoring and Discovery System component (MDS4) to carry out resource management. The Common Information Model (CIM) provides a standard conceptual view of the managed environment. This work improves the MDS4 functionality through the use of CIM, with the aim of providing a unified, standard representation of the Grid resources. Since a practical CIM model may contain a large volume of information, a new Index Service that represents the CIM information through Java instances is presented. In addition, a solution that keeps data in persistent storage has also been implemented. The evaluation of the proposed solutions achieves encouraging results, with an important reduction in memory consumption, a good scalability when the number of instances increases, and with a reasonable response time.
IEEE International Symposium on Parallel and Distributed Processing with Applications, ISPA 2011, Busan, Korea, 26-28 May, 2011; 01/2011
[show abstract][hide abstract] ABSTRACT: MapReduce is a powerful tool for processing large data sets used by many applications running in distributed environments. However, despite the increasing number of computationally intensive problems that require low-latency communications, the adoption of MapReduce in High Performance Computing (HPC) is still emerging. Here languages based on the Partitioned Global Address Space (PGAS) programming model have shown to be a good choice for implementing parallel applications, in order to take advantage of the increasing number of cores per node and the programmability benefits achieved by their global memory view, such as the transparent access to remote data. This paper presents the first PGAS-based MapReduce implementation that uses the Unified Parallel C (UPC) language, which (1) obtains programmability benefits in parallel programming, (2) offers advanced configuration options to define a customized load distribution for different codes, and (3) overcomes performance penalties and bottlenecks that have traditionally prevented the deployment of MapReduce applications in HPC. The performance evaluation of representative applications on shared and distributed memory environments assesses the scalability of the presented MapReduce framework, confirming its suitability.
[show abstract][hide abstract] ABSTRACT: The uptrend in the number of cores in cluster architectures underscores the need for scalable communication middleware on these systems. One of the strategies to take advantage of this increase in the available computational power is the use of efficient message-passing middleware for inter-node communications and thread-based shared memory transfers within each node. This paper presents a Java communication middleware that exploits hybrid shared/distributed memory architectures through the use of scalable Java NIO sockets for inter-node communications and multi-threading on shared memory. Thus, communication- intensive applications running on clusters of multi-core processors can take advantage of the use of this middleware. The performance of these codes generally relies on collective operations, such as broadcasting, scattering or gathering data, which have been optimized to make the most of these architectures. The evaluation of this middleware when relying on multi-core aware communication patterns has shown significant performance improvements both in collective operations and communication-intensive applications.
13th IEEE International Conference on High Performance Computing & Communication, HPCC 2011, Banff, Alberta, Canada, September 2-4, 2011; 01/2011
[show abstract][hide abstract] ABSTRACT: This paper presents a scalable and efficient Message-Passing in Java (MPJ) collective communication library for parallel computing
on multi-core architectures. The continuous increase in the number of cores per processor underscores the need for scalable
parallel solutions. Moreover, current system deployments are usually multi-core clusters, a hybrid shared/distributed memory
architecture which increases the complexity of communication protocols. Here, Java represents an attractive choice for the
development of communication middleware for these systems, as it provides built-in networking and multithreading support.
As the gap between Java and compiled languages performance has been narrowing for the last years, Java is an emerging option
for High Performance Computing (HPC).
Our MPJ collective communication library increases Java HPC applications performance on multi-core clusters: (1) providing
multi-core aware collective primitives; (2) implementing several algorithms (up to six) per collective operation, whereas
publicly available MPJ libraries are usually restricted to one algorithm; (3) analyzing the efficiency of thread-based collective
operations; (4) selecting at runtime the most efficient algorithm depending on the specific multi-core system architecture,
and the number of cores and message length involved in the collective operation; (5) supporting the automatic performance
tuning of the collectives depending on the system and communication parameters; and (6) allowing its integration in any MPJ
implementation as it is based on MPJ point-to-point primitives. A performance evaluation on an InfiniBand and Gigabit Ethernet
multi-core cluster has shown that the implemented collectives significantly outperform the original ones, as well as higher
speedups when analyzing the impact of their use on collective communications intensive Java HPC applications. Finally, the
presented library has been successfully integrated in MPJ Express (http://mpj-express.org), and will be distributed with the next release.
The Journal of Supercomputing 01/2011; 55:126-154. · 0.92 Impact Factor
[show abstract][hide abstract] ABSTRACT: The popularity of Partitioned Global Address Space (PGAS) languages has increased during the last years thanks to their high programmability and performance through an effcient exploitation of data locality. This paper describes the implementation of effcient parallel dense triangular solvers in the PGAS language Unified Parallel C (UPC). The solvers are built on top of sequential BLAS functions and exploit the particularities of the PGAS paradigm. Furthermore, the numerical routines developed implement an automatic process that adapts the algorithms to the characteristics of the system where they are executed. The triangular solvers have been experimentally evaluated in two different multicore clusters and compared to message-passing based counterparts, demonstrating good scalability and effciency.
[show abstract][hide abstract] ABSTRACT: With the evolution of high-performance computing towards heterogeneous, massively par- allel systems, parallel applications have developed new fault tolerance necessities. Check- pointing has become a widely used technique to obtain fault tolerance. Whether due to a failure in the execution or to a migration of the processes to different machines, checkpoint- ing tools must be able to operate in heterogeneous environments. Portable checkpointers usually work around portability issues at the cost of transparency: the user must provide in- formation as what data needs to be stored, where to store it, or where to checkpoint. CPPC (Controller/Precompiler for Portable Checkpointing) is a checkpointing tool designed to fea- ture both portability and transparency. It is made up of a library containing checkpointing routines and a compiler which automates the use of the library. This paper gives an overview of the CPPC tool. Experimental results using benchmarks and large-scale real applications are included, demonstrating usability, efficiency and portability.
Concurrency and Computation: Practice and Experience. 01/2010; 22:749-766.
[show abstract][hide abstract] ABSTRACT: The growing complexity in computer system hierarchies due to the increase in the number of cores per processor, levels of cache (some of them shared) and the number of processors per node, as well as the high-speed interconnects, demands the use of new optimization techniques and libraries that take advantage of their features. In this paper Servet, a suite of benchmarks focused on detecting a set of parameters with high influence in the overall performance of multicore systems, is presented. These benchmarks are able to detect the cache hierarchy, including their size and which caches are shared by each core, bandwidths and bottlenecks in memory accesses, as well as communication latencies among cores. These parameters can be used by auto-tuned codes to increase their performance in multicore clusters. Experimental results using different representative systems show that Servet provides very accurate estimates of the parameters of the machine architecture.
24th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010, Atlanta, Georgia, USA, 19-23 April 2010 - Conference Proceedings; 01/2010
[show abstract][hide abstract] ABSTRACT: Resumen En este artículo se describe nuestra experiencia en la docencia de Arquitectura e Ingeniería de Compu-tadores en el Máster en Informática de la Universi-dade da Coruña, en la cual concurrían las circuns-tancias de titulación EEES de nueva implantación y un número reducido de alumnos. La orientación pro-fesionalizante del máster nos motivó a explorar en innovación docente de cara a la práctica profesional, fundamentalmente a través de metologías de apren-dizaje basado en proyectos (project-based learning) combinado con las acciones de: (1) sustitución de docencia teórica por trabajos académicamente diri-gidos; (2) impartición de seminarios profesionales; (3) uso de técnicas de role playing; y (4) desarro-llo de habilidades comunicativas. La valoración glo-bal es que esta metodología y sus acciones asociadas han resultado tremendamente positivas en la docen-cia de la materia.