Juan Touriño

University of A Coruña, A Coruña, Galicia, Spain

Are you Juan Touriño?

Claim your profile

Publications (103)26.98 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a Java implementation of the recently published MPI 3.0 nonblocking message passing collectives in order to analyze and assess the feasibility of taking advantage of these operations in shared memory systems using Java. Nonblocking collectives aim to exploit the overlapping between computation and communication for collective operations to increase scalability of message passing codes, as it has been carried out for nonblocking point-to-point primitives. This scalability has become crucial not only for clusters but also for shared memory systems because of the current trend of increasing the number of cores per chip, which is leading to the generalization of multi-core and many-core processors. Message passing libraries based on remote direct memory access, thread-based progression, or implementing pure multi-threading shared memory support could potentially benefit from the lack of imposed synchronization by nonblocking collectives. But, although the distributed memory scenario has been well studied, the shared memory one has not been tackled yet. Hence, nonblocking collectives support has been included in FastMPJ, a Message Passing in Java (MPJ) implementation, and evaluated on a representative shared memory system, obtaining significant improvements because of overlapping and lack of implicit synchronization, and with barely any overhead imposed over common blocking operations. Copyright © 2014 John Wiley & Sons, Ltd.
    Concurrency and Computation Practice and Experience 04/2014; · 0.85 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Cloud computing is posing several challenges, such as security, fault tolerance, access interface singularity, and network constraints, both in terms of latency and bandwidth. In this scenario, the performance of communications depends both on the network fabric and its efficient support in virtualized environments, which ultimately determines the overall system performance. To solve the current network constraints in cloud services, their providers are deploying high-speed networks, such as 10 Gigabit Ethernet. This paper presents an evaluation of high-performance computing message-passing middleware on a cloud computing infrastructure, Amazon EC2 cluster compute instances, equipped with 10 Gigabit Ethernet. The analysis of the experimental results, confronted with a similar testbed, has shown the significant impact that virtualized environments still have on communication performance, which demands more efficient communication middleware support to get over the current cloud network limitations.
    Personal and Ubiquitous Computing 12/2013; · 1.13 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Cloud computing is currently being explored by the scientific community to assess its suitability for High Performance Computing (HPC) environments. In this novel paradigm, compute and storage resources, as well as applications, can be dynamically provisioned on a pay-per-use basis. This paper presents a thorough evaluation of the I/O storage subsystem using the Amazon EC2 Cluster Compute platform and the recent High I/O instance type, to determine its suitability for I/O-intensive applications. The evaluation has been carried out at different layers using representative benchmarks in order to evaluate the low-level cloud storage devices available in Amazon EC2, ephemeral disks and Elastic Block Store (EBS) volumes, both on local and distributed file systems. In addition, several I/O interfaces (POSIX, MPI-IO and HDF5) commonly used by scientific workloads have also been assessed. Furthermore, the scalability of a representative parallel I/O code has also been analyzed at the application level, taking into account both performance and cost metrics. The analysis of the experimental results has shown that available cloud storage devices can have different performance characteristics and usage constraints. Our comprehensive evaluation can help scientists to increase significantly (up to several times) the performance of I/O-intensive applications in Amazon EC2 cloud. An example of optimal configuration that can maximize I/O performance in this cloud is the use of a RAID 0 of 2 ephemeral disks, TCP with 9,000 bytes MTU, NFS async and MPI-IO on the High I/O instance type, which provides ephemeral disks backed by Solid State Drive (SSD) technology.
    Journal of Grid Computing 12/2013; · 1.60 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: With the evolution of high-performance computing, parallel applications have developed an increasing necessity for fault tolerance, most commonly provided by checkpoint and restart techniques. Checkpointing tools are typically implemented at one of two different abstraction levels: at the system level or at the application level. The latter has become an interesting alternative due to its flexibility and the possibility of operating in different environments. However, application-level checkpointing tools often require the user to manually insert checkpoints in order to ensure that certain requirements are met (e.g. forcing checkpoints to be taken at the user code and not inside kernel routines). This paper examines the transformations required to enable automatic checkpointing of parallel applications in the CPPC application-level checkpointing framework. These transformations have been implemented on two very different compiler infrastructures: Cetus and LLVM. Cetus is a Java-based compiler infrastructure aiming to provide an easy to use and clean IR and API for program transformation. LLVM is a low-level, SSA-based toolchain. The fundamental differences of both approaches are analyzed from the structural, behavioral and performance perspectives.
    International Journal of Parallel Programming 12/2013; 41(6). · 0.40 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The simulation of particle dynamics is an essential method to analyze and predict the behavior of molecules in a given medium. This work presents the design and implementation of a parallel simulation of Brownian dynamics with hydrodynamic interactions for shared memory systems using two approaches: (1) OpenMP directives and (2) the Partitioned Global Address Space (PGAS) paradigm with the Unified Parallel C (UPC) language. The structure of the code is described, and different techniques for work distribution are analyzed in terms of efficiency, in order to select the most suitable strategy for each part of the simulation. Additionally, performance results have been collected from two representative NUMA systems, and they are studied and compared against the original sequential code.
    The Journal of Supercomputing 09/2013; 65(3). · 0.92 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: SUMMARY Cloud computing is offering new approaches for High Performance Computing (HPC) as it provides dynamically scalable resources as a service over the Internet. In addition, General-Purpose computation on Graphical Processing Units (GPGPU) has gained much attention from scientific computing in multiple domains, thus becoming an important programming model in HPC. Compute Unified Device Architecture (CUDA) has been established as a popular programming model for GPGPUs, removing the need for using the graphics APIs for computing applications. Open Computing Language (OpenCL) is an emerging alternative not only for GPGPU but also for any parallel architecture. GPU clusters, usually programmed with a hybrid parallel paradigm mixing Message Passing Interface (MPI) with CUDA/OpenCL, are currently gaining high popularity. Therefore, cloud providers are deploying clusters with multiple GPUs per node and high-speed network interconnects in order to make them a feasible option for HPC as a Service (HPCaaS). This paper evaluates GPGPU for high performance cloud computing on a public cloud computing infrastructure, Amazon EC2 Cluster GPU Instances (CGI), equipped with NVIDIA Tesla GPUs and a 10 Gigabit Ethernet network. The analysis of the results, obtained using up to 64 GPUs and 256-processor cores, has shown that GPGPU is a viable option for high performance cloud computing despite the significant impact that virtualized environments still have on network overhead, which still hampers the adoption of GPGPU communication-intensive applications. Copyright © 2012 John Wiley & Sons, Ltd.
    Concurrency and Computation Practice and Experience 08/2013; 25(12). · 0.85 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The rising interest in Java for High Performance Computing (HPC) is based on the appealing features of this language for programming multi-core cluster architectures, particularly the built-in networking and multithreading support, and the continuous increase in Java Virtual Machine (JVM) performance. However, its adoption in this area is being delayed by the lack of analysis of the existing programming options in Java for HPC and thorough and up-to-date evaluations of their performance, as well as the unawareness on current research projects in this field, whose solutions are needed in order to boost the embracement of Java in HPC.This paper analyzes the current state of Java for HPC, both for shared and distributed memory programming, presents related research projects, and finally, evaluates the performance of current Java HPC solutions and research developments on two shared memory environments and two InfiniBand multi-core clusters. The main conclusions are that: (1) the significant interest in Java for HPC has led to the development of numerous projects, although usually quite modest, which may have prevented a higher development of Java in this field; (2) Java can achieve almost similar performance to natively compiled languages, both for sequential and parallel applications, being an alternative for HPC programming; (3) the recent advances in the efficient support of Java communications on shared memory and low-latency networks are bridging the gap between Java and natively compiled applications in HPC. Thus, the good prospects of Java in this area are attracting the attention of both industry and academia, which can take significant advantage of Java adoption in HPC.Highlights► Java is an emerging option for High Performance Computing on multi-core clusters. ► Java can achieve almost similar performance to natively compiled languages. ► Current state of Java for HPC, both for shared and distributed memory programming. ► Performance evaluation of Java message-passing and Java threads on multi-core clusters. ► Java is an alternative to HPC programming as it obtains comparable scalability results.
    Science of Computer Programming 05/2013; · 0.57 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The simulation of particle dynamics is among the most important mechanisms to study the behavior of molecules in a medium under specific conditions of temperature and density. Several models can be used to compute efficiently the forces that act on each particle, and also the interactions between them. This work presents the design and implementation of a parallel simulation code for the Brownian motion of particles in a fluid. Two different parallelization approaches have been followed: (1) using traditional distributed memory message-passing programming with MPI, and (2) using the Partitioned Global Address Space (PGAS) programming model, oriented towards hybrid shared/distributed memory systems, with the Unified Parallel C (UPC) language. Different techniques for domain decomposition and work distribution are analyzed in terms of efficiency and programmability, in order to select the most suitable strategy. Performance results on a supercomputer using up to 2048 cores are also presented for both MPI and UPC codes.
    Computer Physics Communications 04/2013; 184(4):1191–1202. · 2.41 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Unified Parallel C (UPC) is a Partitioned Global Address Space (PGAS) language whose popularity has increased during the last years owing to its high programmability and reasonable performance through an efficient exploitation of data locality, especially on hierarchical architectures like multicore clusters. However, the performance issues that arise in this language due to the irregular structure of sparse matrix operations have not yet been studied. Among them, the selection of an adequate storage format for the sparse matrices can significantly improve the efficiency of the parallel codes. This paper presents an evaluation, using UPC, of the most common sparse storage formats with different implementations of the matrix-vector and matrix-matrix products, which are key kernels in many scientific applications.
    The Journal of Supercomputing 04/2013; 64(1). · 0.92 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Servet is a suite of benchmarks focused on extracting a set of parameters with high influence on the overall performance of multicore clusters. These parameters can be used to optimize the performance of parallel applications by adapting part of their behavior to the characteristics of the machine. Up to now the tool considered network bandwidth as constant and independent of the communication pattern. Nevertheless, the inter-node communication bandwidth decreases on modern large supercomputers depending on the number of cores per node that simultaneously access the network and on the distance between the communicating nodes. This paper describes two new benchmarks that improve Servet by characterizing the network performance degradation depending on these factors. This work also shows the experimental results of these benchmarks on a Cray XE6 supercomputer and some examples of how real parallel codes can be optimized by using the information about network degradation.
    Computers & Electrical Engineering 01/2013; 39(8):2483–2493. · 0.93 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The widespread use of multicore processors is not a consequence of significant advances in parallel programming. In contrast, multicore processors arise due to the complexity of building power-efficient, high-clock-rate, single-core chips. Automatic parallelization of sequential applications is the ideal solution for making parallel programming as easy as writing programs for sequential computers. However, automatic parallelization remains a grand challenge due to its need for complex program analysis and the existence of unknowns during compilation. This paper proposes a new method for converting a sequential application into a parallel counterpart that can be executed on current multicore processors. It hinges on an intermediate representation based on the concept of domain-independent kernel (e.g., assignment, reduction, recurrence). Such kernel-centric view hides the complexity of the implementation details, enabling the construction of the parallel version even when the source code of the sequential application contains different syntactic variations of the computations (e.g., pointers, arrays, complex control flows). Experiments that evaluate the effectiveness and performance of our approach with respect to state-of-the-art compilers are also presented. The benchmark suite consists of synthetic codes that represent common domain-independent kernels, dense/sparse linear algebra and image processing routines, and full-scale applications from SPEC CPU2000.
    Parallel Computing 01/2013; · 1.21 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The scalability of High Performance Computing (HPC) applications depends heavily on the efficient support of network communications in virtualized environments. However, Infrastructure as a Service (IaaS) providers are more focused on deploying systems with higher computational power interconnected via high-speed networks rather than improving the scalability of the communication middleware. This paper analyzes the main performance bottlenecks in HPC application scalability on the Amazon EC2 Cluster Compute platform: (1) evaluating the communication performance on shared memory and a virtualized 10 Gigabit Ethernet network; (2) assessing the scalability of representative HPC codes, the NAS Parallel Benchmarks, using an important number of cores, up to 512; (3) analyzing the new cluster instances (CC2), both in terms of single instance performance, scalability and cost-efficiency of its use; (4) suggesting techniques for reducing the impact of the virtualization overhead in the scalability of communication-intensive HPC codes, such as the direct access of the Virtual Machine to the network and reducing the number of processes per instance; and (5) proposing the combination of message-passing with multithreading as the most scalable and cost-effective option for running HPC applications on the Amazon EC2 Cluster Compute platform.
    Future Generation Computer Systems 01/2013; 29(1):218–229. · 2.64 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Land abandonment and stagnation of rural markets in the last few years have become one of the main concerns of rural administrations. The use of Web and GIS (Geographic Information System) technologies can help to mitigate the effects of these problems. This paper pro-poses a novel Web-GIS tool with spatial capabilities for the dynamization of rural land markets by encouraging the transfer of land from owners to farmers through the leasing of plots. The system, based on open source software, offers information about the properties, their environment and their owners. It uses standards for handling the geographic information and for communicating with external data sources. This system was used as the basis for the development of SITEGAL, the tool for the management of the Land Bank of Galicia (www.bantegal.com/sitegal). SITEGAL has been operational since 2007 obtaining benefits for both administration and users (farmers and land owners), and promoting the e-Government.
    Earth Science Informatics 01/2013; · 0.69 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The presence of many-core units as accelerators has been increasing due to their ability to improve the performance of highly parallel workloads. General Purpose GPU(GPGPU) computing has allowed the graphical units to emerge as successful co-processors that can be employed to improve the performance of many different non-graphical applications with high parallel requirements, which make them suitable for many High Performance Computing workloads. While the main libraries developed to exploit the massive parallel capacity of GPUs are oriented to C/C++ programmers, there have been several efforts to extend this support to other languages. Among them, Java stands out for being one of the most extended languages and there are multiple projects that try to enable Java to take advantage of GPGPU computing. In this scenario, this paper presents an evaluation of the most relevant among the current solutions that exploit GPGPU computing in Java.
    Advanced Information Networking and Applications Workshops (WAINA), 2013 27th International Conference on; 01/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: To efficiently scale dense linear algebra problems to future exascale systems, communication cost must be avoided or overlapped. Communication-avoiding 2.5D algorithms improve scalability by reducing inter-processor data transfer volume at the cost of extra memory usage. Communication overlap attempts to hide messaging latency by pipelining messages and overlapping with computational work. We study the interaction and compatibility of these two techniques for two matrix mul- tiplication algorithms (Cannon and SUMMA), triangular solve, and Cholesky factorization. For each algorithm, we construct a detailed performance model that considers both critical path dependencies and idle time. We give novel implementations of 2.5D algorithms with overlap for each of these problems. Our software employs UPC, a partitioned global address space (PGAS) language that provides fast one-sided communication. We show communication avoidance and overlap provide a cumulative benefit as core counts scale, including results using over 24K cores of a Cray XE6 system.
    24th ACM/IEEE Conference on Supercomputing; 11/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: The popularity of Partitioned Global Address Space (PGAS) languages has increased during the last years thanks to their high programmability and performance through an efficient exploitation of data locality, especially on hierarchical architectures such as multicore clusters. This paper describes UPCBLAS, a parallel numerical library for dense matrix computations using the PGAS Unified Parallel C language. The routines developed in UPCBLAS are built on top of sequential basic linear algebra subprograms functions and exploit the particularities of the PGAS paradigm, taking into account data locality in order to achieve a good performance. Furthermore, the routines implement other optimization techniques, several of them by automatically taking into account the hardware characteristics of the underlying systems on which they are executed. The library has been experimentally evaluated on a multicore supercomputer and compared with a message-passing-based parallel numerical library, demonstrating good scalability and efficiency. Copyright © 2012 John Wiley & Sons, Ltd.
    Concurrency and Computation Practice and Experience 09/2012; 24(14):1645-1667. · 0.85 Impact Factor
  • RISTI - Revista Ibérica de Sistemas e Tecnologias de Informação. 06/2012;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Servet is a suite of benchmarks focused on detecting a set of parameters with high influence on the overall performance of multicore systems. These parameters can be used for autotuning codes to increase their performance on multicore clusters. Although Servet has been proved to detect accurately cache hierarchies, bandwidths and bottlenecks in memory accesses, as well as the communication overhead among cores, up to now the impact of the use of this information on application performance optimization has not been assessed. This paper presents a novel algorithm that automatically uses Servet for mapping parallel applications on multicore systems and analyzes its impact on three testbeds using three different parallel programming models: message-passing, shared memory and partitioned global address space (PGAS). Our results show that a suitable mapping policy based on the data provided by this tool can significantly improve the performance of parallel applications without source code modification.
    Computers & Electrical Engineering 03/2012; 38(2):258–269. · 0.93 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents ibvdev a scalable and efficient low-level Java message-passing communication device over InfiniBand. The continuous increase in the number of cores per processor underscores the need for efficient communication support for parallel solutions. Moreover, current system deployments are aggregating a significant number of cores through advanced network technologies, such as InfiniBand, increasing the complexity of communication protocols, especially when dealing with hybrid shared/distributed memory architectures such as clusters. Here, Java represents an attractive choice for the development of communication middleware for these systems, as it provides built-in networking and multithreading support. As the gap between Java and compiled languages performance has been narrowing for the last years, Java is an emerging option for High Performance Computing (HPC). The developed communication middleware ibvdev increases Java applications performance on clusters of multicore processors interconnected via InfiniBand through: (1) providing Java with direct access to InfiniBand using InfiniBand Verbs API, somewhat restricted so far to MPI libraries; (2) implementing an efficient and scalable communication protocol which obtains start-up latencies and bandwidths similar to MPI performance results; and (3) allowing its integration in any Java parallel and distributed application. In fact, it has been successfully integrated in the Java messaging library MPJ Express. The experimental evaluation of this middleware on an InfiniBand cluster of multicore processors has shown significant point-to-point performance benefits, up to 85% start-up latency reduction and twice the bandwidth compared to previous Java middleware on InfiniBand. Additionally, the impact of ibvdev on message-passing collective operations is significant, achieving up to one order of magnitude performance increases compared to previous Java solutions, especially when combined with multithreading. Finally, the efficiency of this middleware, which is even competitive with MPI in terms of performance, increments the scalability of communications intensive Java HPC applications.
    The Journal of Supercomputing 01/2012; · 0.92 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Partitioned Global Address Space (PGAS) languages offer programmers a shared memory view that increases their productivity and allow locality exploitation to obtain good performance on current large-scale distributed memory systems. UPCBLAS is a parallel numerical library for dense matrix computations using the PGAS Unified Parallel C (UPC) language. The interface of this library exploits the characteristics of the PGAS memory model and thus it is easier to use than MPI-based libraries. This paper addresses the implementation of solvers of systems of equations through Cholesky and LU factorizations in UPC using UPCBLAS. The developed codes are experimentally evaluated and compared to the MPI versions using ScaLAPACK. Parallel solvers of equations are present in many parallel numerical applications and they have been traditionally developed in MPI. This work shows that UPCBLAS can be considered as a good alternative to the MPI-based libraries for increasing the productivity of numerical application developers.
    Parallel and Distributed Processing with Applications (ISPA), 2012 IEEE 10th International Symposium on; 01/2012