Juan Touriño

University of A Coruña, La Corogne, Galicia, Spain

Are you Juan Touriño?

Claim your profile

Publications (120)41.03 Total impact

  • Source
  • Source
  • Source
    Dataset: ICCS11-talk
  • [Show abstract] [Hide abstract]
    ABSTRACT: On-chip power consumption is one of the fundamental challenges of current technology scaling. Cache memories consume a sizable part of this power, particularly due to leakage energy. STT-RAM is one of several new memory technologies that have been proposed in order to improve power while preserving performance. It features high density and low leakage, but at the expense of write energy and performance. This article explores the use of STT-RAM-based scratchpad memories that trade nonvolatility in exchange for faster and less energetically expensive accesses, making them feasible for on-chip implementation in embedded systems. A novel multiretention scratchpad partitioning is proposed, featuring multiple storage spaces with different retention, energy, and performance characteristics. A customized compiler-based allocation algorithm suitable for use with such a scratchpad organization is described. Our experiments indicate that a multiretention STT-RAM scratchpad can provide energy savings of 53% with respect to an iso-area, hardware-managed SRAM cache.
    ACM Transactions on Architecture and Code Optimization 12/2014; 11(4):1-26. DOI:10.1145/2669556 · 0.60 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper examines four different strategies, each one with its own data distribution, for im-plementing the parallel Conjugate Gradient (CG) method and how they impact communication and overall performance. Firstly, typical 1D and 2D distributions of the matrix involved in CG computations are considered. Then, a new 2D version of the CG method with asymmetric work-load, based on leaving some threads idle during part of the computation to reduce communication, is proposed. The four strategies are independent of sparse storage schemes and are implemented using Unified Parallel C (UPC), a Partitioned Global Address Space (PGAS) language. The strategies are evaluated on two different platforms through a set of matrices that exhibit distinct sparse patterns, demonstrating that our asymmetric proposal outperforms the others except for one matrix on one platform.
    The Journal of Supercomputing 09/2014; 70(2):816-829. DOI:10.1007/s11227-014-1300-0 · 0.84 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The performance and scalability of communications are key for HPC applications in current multi-core era. Despite the significant benefits (e.g., produc-tivity, multithreading) of Java for parallel program-ming, its poor communications support has hindered its adoption in HPC. This paper presents FastMPJ (http://fastmpj.com), an efficient Message-Passing in Java (MPJ) library, boosting Java for HPC by: (1) providing high performance shared memory commu-nications using threads; (2) taking full advantage of high-speed networks to provide low-latency and high bandwidth communications; (3) including a scalable collective library with topology aware primitives, au-tomatically selected at runtime; (4) avoiding Java data buffering overheads through zero-copy protocols; and (5) implementing the most widely extended MPI-like Java binding for a highly productive development. The performance evaluation on representative testbeds (InfiniBand, 10 Gigabit Ethernet, Myrinet, and shared memory systems) has shown that FastMPJ rivals MPI libraries, significantly improving the efficiency and scalability of communication-intensive Java HPC ap-plications.
    Cluster Computing 09/2014; 17(3). DOI:10.1007/s10586-014-0345-4 · 0.95 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The use of GPUs for general purpose computation has increased dramatically in the past years due to the rising demands of computing power and their tremendous computing capacity at low cost. Hence, new programming models have been developed to integrate these accelerators with high-level programming languages, giving place to heterogeneous computing systems. Unfortunately, this heterogeneity is also exposed to the programmer complicating its exploitation. This paper presents a new technique to automat-ically rewrite sequential programs into a parallel counterpart targeting GPU-based heterogeneous systems. The original source code is analyzed through domain-independent computational kernels, which hide the complexity of the implementation details by presenting a non-statement-based, high-level, hierarchical representation of the application. Next, a locality-aware technique based on standard compiler transformations is applied to the original code through OpenHMPP directives. Two representative case studies from scientific applications have been selected: the three-dimensional discrete convolution and the simple-precision general matrix multiplication. The effectiveness of our technique is corroborated by a performance evaluation on NVIDIA GPUs.
    7th International Symposium on High-level Parallel Programming and Applications (HLPP), Amsterdam, Netherlands; 07/2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This manuscript summarizes the main ideas introduced in [1]. We propose a compiler that automatically transforms a sequential application into a parallel counterpart for multicore processors. It is based on an intermediate representation, named KIR, which exposes multiple levels of parallelism and hides the complexity of the implementation details thanks to the domain-independent kernels (e.g., assignment, reduction). The effectiveness and performance of our approach, built on top of GCC, has been tested with a large variety of codes.
    17th International Workshop on Software and Compilers for Embedded Systems (SCOPES), Schloss Rheinfels, St. Goar, Germany; 06/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a Java implementation of the recently published MPI 3.0 nonblocking message passing collectives in order to analyze and assess the feasibility of taking advantage of these operations in shared memory systems using Java. Nonblocking collectives aim to exploit the overlapping between computation and communication for collective operations to increase scalability of message passing codes, as it has been carried out for nonblocking point-to-point primitives. This scalability has become crucial not only for clusters but also for shared memory systems because of the current trend of increasing the number of cores per chip, which is leading to the generalization of multi-core and many-core processors. Message passing libraries based on remote direct memory access, thread-based progression, or implementing pure multi-threading shared memory support could potentially benefit from the lack of imposed synchronization by nonblocking collectives. But, although the distributed memory scenario has been well studied, the shared memory one has not been tackled yet. Hence, nonblocking collectives support has been included in FastMPJ, a Message Passing in Java (MPJ) implementation, and evaluated on a representative shared memory system, obtaining significant improvements because of overlapping and lack of implicit synchronization, and with barely any overhead imposed over common blocking operations. Copyright © 2014 John Wiley & Sons, Ltd.
    Concurrency and Computation Practice and Experience 04/2014; DOI:10.1002/cpe.3279 · 0.78 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Land abandonment and stagnation of rural markets in the last few years have become one of the main concerns of rural administrations. The use of Web and GIS (Geographic Information System) technologies can help to mitigate the effects of these problems. This paper pro-poses a novel Web-GIS tool with spatial capabilities for the dynamization of rural land markets by encouraging the transfer of land from owners to farmers through the leasing of plots. The system, based on open source software, offers information about the properties, their environment and their owners. It uses standards for handling the geographic information and for communicating with external data sources. This system was used as the basis for the development of SITEGAL, the tool for the management of the Land Bank of Galicia (www.bantegal.com/sitegal). SITEGAL has been operational since 2007 obtaining benefits for both administration and users (farmers and land owners), and promoting the e-Government.
    Earth Science Informatics 12/2013; 6(4):1-18. DOI:10.1007/s12145-013-0123-5 · 0.69 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: With the evolution of high-performance computing, parallel applications have developed an increasing necessity for fault tolerance, most commonly provided by checkpoint and restart techniques. Checkpointing tools are typically implemented at one of two different abstraction levels: at the system level or at the application level. The latter has become an interesting alternative due to its flexibility and the possibility of operating in different environments. However, application-level checkpointing tools often require the user to manually insert checkpoints in order to ensure that certain requirements are met (e.g. forcing checkpoints to be taken at the user code and not inside kernel routines). This paper examines the transformations required to enable automatic checkpointing of parallel applications in the CPPC application-level checkpointing framework. These transformations have been implemented on two very different compiler infrastructures: Cetus and LLVM. Cetus is a Java-based compiler infrastructure aiming to provide an easy to use and clean IR and API for program transformation. LLVM is a low-level, SSA-based toolchain. The fundamental differences of both approaches are analyzed from the structural, behavioral and performance perspectives.
    International Journal of Parallel Programming 12/2013; 41(6). DOI:10.1007/s10766-012-0231-8 · 0.50 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Cloud computing is posing several challenges, such as security, fault tolerance, access interface singularity, and network constraints, both in terms of latency and bandwidth. In this scenario, the performance of communications depends both on the network fabric and its efficient support in virtualized environments, which ultimately determines the overall system performance. To solve the current network constraints in cloud services, their providers are deploying high-speed networks, such as 10 Gigabit Ethernet. This paper presents an evaluation of high-performance computing message-passing middleware on a cloud computing infrastructure, Amazon EC2 cluster compute instances, equipped with 10 Gigabit Ethernet. The analysis of the experimental results, confronted with a similar testbed, has shown the significant impact that virtualized environments still have on communication performance, which demands more efficient communication middleware support to get over the current cloud network limitations.
    Personal and Ubiquitous Computing 12/2013; DOI:10.1007/s00779-012-0605-3 · 1.62 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Cloud computing is currently being explored by the scientific community to assess its suitability for High Performance Computing (HPC) environments. In this novel paradigm, compute and storage resources, as well as applications, can be dynamically provisioned on a pay-per-use basis. This paper presents a thorough evaluation of the I/O storage subsystem using the Amazon EC2 Cluster Compute platform and the recent High I/O instance type, to determine its suitability for I/O-intensive applications. The evaluation has been carried out at different layers using representative benchmarks in order to evaluate the low-level cloud storage devices available in Amazon EC2, ephemeral disks and Elastic Block Store (EBS) volumes, both on local and distributed file systems. In addition, several I/O interfaces (POSIX, MPI-IO and HDF5) commonly used by scientific workloads have also been assessed. Furthermore, the scalability of a representative parallel I/O code has also been analyzed at the application level, taking into account both performance and cost metrics. The analysis of the experimental results has shown that available cloud storage devices can have different performance characteristics and usage constraints. Our comprehensive evaluation can help scientists to increase significantly (up to several times) the performance of I/O-intensive applications in Amazon EC2 cloud. An example of optimal configuration that can maximize I/O performance in this cloud is the use of a RAID 0 of 2 ephemeral disks, TCP with 9,000 bytes MTU, NFS async and MPI-IO on the High I/O instance type, which provides ephemeral disks backed by Solid State Drive (SSD) technology.
    Journal of Grid Computing 12/2013; 11(4). DOI:10.1007/s10723-013-9250-y · 1.67 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Servet is a suite of benchmarks focused on extracting a set of parameters with high influence on the overall performance of multicore clusters. These parameters can be used to optimize the performance of parallel applications by adapting part of their behavior to the characteristics of the machine. Up to now the tool considered network bandwidth as constant and independent of the communication pattern. Nevertheless, the inter-node communication bandwidth decreases on modern large supercomputers depending on the number of cores per node that simultaneously access the network and on the distance between the communicating nodes. This paper describes two new benchmarks that improve Servet by characterizing the network performance degradation depending on these factors. This work also shows the experimental results of these benchmarks on a Cray XE6 supercomputer and some examples of how real parallel codes can be optimized by using the information about network degradation.
    Computers & Electrical Engineering 11/2013; 39(8):2483–2493. DOI:10.1016/j.compeleceng.2013.08.012 · 0.99 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The widespread use of multicore processors is not a consequence of significant advances in parallel programming. In contrast, multicore processors arise due to the complexity of building power-efficient, high-clock-rate, single-core chips. Automatic parallelization of sequential applications is the ideal solution for making parallel programming as easy as writing programs for sequential computers. However, automatic parallelization remains a grand challenge due to its need for complex program analysis and the existence of unknowns during compilation. This paper proposes a new method for converting a sequential application into a parallel counterpart that can be executed on current multicore processors. It hinges on an intermediate representation based on the concept of domain-independent kernel (e.g., assignment, reduction, recurrence). Such kernel-centric view hides the complexity of the implementation details, enabling the construction of the parallel version even when the source code of the sequential application contains different syntactic variations of the computations (e.g., pointers, arrays, complex control flows). Experiments that evaluate the effectiveness and performance of our approach with respect to state-of-the-art compilers are also presented. The benchmark suite consists of synthetic codes that represent common domain-independent kernels, dense/sparse linear algebra and image processing routines, and full-scale applications from SPEC CPU2000.
    Parallel Computing 09/2013; DOI:10.1016/j.parco.2013.04.003 · 1.89 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The simulation of particle dynamics is an essential method to analyze and predict the behavior of molecules in a given medium. This work presents the design and implementation of a parallel simulation of Brownian dynamics with hydrodynamic interactions for shared memory systems using two approaches: (1) OpenMP directives and (2) the Partitioned Global Address Space (PGAS) paradigm with the Unified Parallel C (UPC) language. The structure of the code is described, and different techniques for work distribution are analyzed in terms of efficiency, in order to select the most suitable strategy for each part of the simulation. Additionally, performance results have been collected from two representative NUMA systems, and they are studied and compared against the original sequential code.
    The Journal of Supercomputing 09/2013; 65(3). DOI:10.1007/s11227-012-0843-1 · 0.84 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: SUMMARY Cloud computing is offering new approaches for High Performance Computing (HPC) as it provides dynamically scalable resources as a service over the Internet. In addition, General-Purpose computation on Graphical Processing Units (GPGPU) has gained much attention from scientific computing in multiple domains, thus becoming an important programming model in HPC. Compute Unified Device Architecture (CUDA) has been established as a popular programming model for GPGPUs, removing the need for using the graphics APIs for computing applications. Open Computing Language (OpenCL) is an emerging alternative not only for GPGPU but also for any parallel architecture. GPU clusters, usually programmed with a hybrid parallel paradigm mixing Message Passing Interface (MPI) with CUDA/OpenCL, are currently gaining high popularity. Therefore, cloud providers are deploying clusters with multiple GPUs per node and high-speed network interconnects in order to make them a feasible option for HPC as a Service (HPCaaS). This paper evaluates GPGPU for high performance cloud computing on a public cloud computing infrastructure, Amazon EC2 Cluster GPU Instances (CGI), equipped with NVIDIA Tesla GPUs and a 10 Gigabit Ethernet network. The analysis of the results, obtained using up to 64 GPUs and 256-processor cores, has shown that GPGPU is a viable option for high performance cloud computing despite the significant impact that virtualized environments still have on network overhead, which still hampers the adoption of GPGPU communication-intensive applications. Copyright © 2012 John Wiley & Sons, Ltd.
    Concurrency and Computation Practice and Experience 08/2013; 25(12). DOI:10.1002/cpe.2845 · 0.78 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The rising interest in Java for High Performance Computing (HPC) is based on the appealing features of this language for programming multi-core cluster architectures, particularly the built-in networking and multithreading support, and the continuous increase in Java Virtual Machine (JVM) performance. However, its adoption in this area is being delayed by the lack of analysis of the existing programming options in Java for HPC and thorough and up-to-date evaluations of their performance, as well as the unawareness on current research projects in this field, whose solutions are needed in order to boost the embracement of Java in HPC. This paper analyzes the current state of Java for HPC, both for shared and distributed memory programming, presents related research projects, and finally, evaluates the performance of current Java HPC solutions and research developments on two shared memory environments and two InfiniBand multi-core clusters. The main conclusions are that: (1) the significant interest in Java for HPC has led to the development of numerous projects, although usually quite modest, which may have prevented a higher development of Java in this field; (2) Java can achieve almost similar performance to natively compiled languages, both for sequential and parallel applications, being an alternative for HPC programming; (3) the recent advances in the efficient support of Java communications on shared memory and low-latency networks are bridging the gap between Java and natively compiled applications in HPC. Thus, the good prospects of Java in this area are attracting the attention of both industry and academia, which can take significant advantage of Java adoption in HPC.
    Science of Computer Programming 05/2013; 78(5):-. DOI:10.1016/j.scico.2011.06.002 · 0.55 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Unified Parallel C (UPC) is a Partitioned Global Address Space (PGAS) language whose popularity has increased during the last years owing to its high programmability and reasonable performance through an efficient exploitation of data locality, especially on hierarchical architectures like multicore clusters. However, the performance issues that arise in this language due to the irregular structure of sparse matrix operations have not yet been studied. Among them, the selection of an adequate storage format for the sparse matrices can significantly improve the efficiency of the parallel codes. This paper presents an evaluation, using UPC, of the most common sparse storage formats with different implementations of the matrix-vector and matrix-matrix products, which are key kernels in many scientific applications.
    The Journal of Supercomputing 04/2013; 64(1). DOI:10.1007/s11227-012-0796-4 · 0.84 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The simulation of particle dynamics is among the most important mechanisms to study the behavior of molecules in a medium under specific conditions of temperature and density. Several models can be used to compute efficiently the forces that act on each particle, and also the interactions between them. This work presents the design and implementation of a parallel simulation code for the Brownian motion of particles in a fluid. Two different parallelization approaches have been followed: (1) using traditional distributed memory message-passing programming with MPI, and (2) using the Partitioned Global Address Space (PGAS) programming model, oriented towards hybrid shared/distributed memory systems, with the Unified Parallel C (UPC) language. Different techniques for domain decomposition and work distribution are analyzed in terms of efficiency and programmability, in order to select the most suitable strategy. Performance results on a supercomputer using up to 2048 cores are also presented for both MPI and UPC codes.
    Computer Physics Communications 04/2013; 184(4):1191–1202. DOI:10.1016/j.cpc.2012.12.015 · 2.41 Impact Factor